SAP and ICL: Contextual Optimization of Tabular Models

⚡

Key Takeaways

1SAP launched the SAP-RPT-1 suite in 2025, utilizing contextual learning to optimize ERP tasks.

2The ICL allows for model adaptation without retraining, but poses challenges in latency and accuracy.

3Optimizing contextual payload is crucial for balancing efficiency and cost in modern AI systems.

💡Why it matters — Contextual optimization directly influences the performance and user experience of AI systems.

SAP and ICL: A New Era for Tabular Models

The Rise of Contextual Learning

In recent years, we have witnessed a significant increase in investments in foundational tabular models, whether open-source or commercial, built around Contextual Learning (ICL). In 2025, the software giant SAP introduced the SAP-RPT-1 model suite, designed to address ERP-centric tasks in areas such as financial planning, sales and procurement order processing, and supply chain management. Unlike traditional supervised learning, where models are trained and fine-tuned for specific tasks, ICL allows a pre-trained model to adapt in real-time using relatively small amounts of task-specific data provided in the contextual payload, which acts as an ephemeral training set.

The Challenges of ICL: Accuracy vs Latency

While the shift to ICL eliminates the need to retrain task-specific tabular models, it introduces a significant trade-off between accuracy and latency at inference time, particularly for centrally hosted models like SAP-RPT-1. On one hand, the time required to send the contextual payload to the model's server, and for the model to interpret and learn from this payload, directly contributes to the overall response latency. Smaller payloads can reduce latency. On the other hand, the model may need to infer complex patterns and data distributions from heterogeneous contextual data that may contain outliers, missing values, and long-tail patterns. Accurate predictions generally depend on large and well-organized contextual payloads. In practice, this means finding ways to distill the contextual payload to reduce response time without degrading the model's predictive performance. Secondary trade-offs involve factors such as the model's service throughput, response stability, and the monetary cost of using the model. All these challenges make optimizing the contextual payload a central architectural concern in ICL-based workflows.

Trade-offs at Inference Time

An effective approach to analyze the trade-offs at inference time for foundational tabular models based on ICL is to apply the "iron triangle" framework. This concept helps navigate the inherent tensions between response quality, inference cost, and latency, which is analogous to the classic "triple constraint" in project management. It is crucial to note that improving one of these dimensions generally puts pressure on the others: higher quality responses tend to be more computationally intensive, increasing both latency and cost; reducing latency often requires sacrificing quality or paying more for faster hardware; and decreasing cost generally means accepting slower or lower-quality AI responses.

We encounter this same triangular tension in the context of ICL-based foundational tabular models. The primary trade-off is the need to balance response quality (measured in terms of accuracy, recall, etc.) against latency. Consider a real-time fraud detection system deployed at ATMs: both accuracy and speed are critical, but they pull the system in different directions regarding the construction of the contextual payload. Larger and richer payloads give the AI model more examples from which to infer the underlying pattern, recognize rare and long-tail patterns, and thus provide higher quality predictions. At the same time, each additional line or feature increases the volume of data that must be sent to the model's server and interpreted during inference, which can introduce a measurable overhead to the end-to-end response time. In real-time applications, even a small increase in payload size can significantly degrade system responsiveness and ultimately harm the user experience.

Secondary Trade-offs and Implications

Moreover, several secondary trade-offs emerge in practice. A larger contextual payload not only slows down inference but also consumes more tokens. In the context of token-based billing, this creates a tension between response latency and the monetary cost of using the model for clients, which becomes particularly relevant for centrally hosted models like SAP-RPT-1. A larger payload can increase the computation time per request, creating a latency-throughput trade-off that may force the AI system development team to make difficult scalability decisions. There is also a potential trade-off between quality and stability: increasing the volume and variety of contextual data may improve predictive accuracy but can reduce determinism by introducing noise and making outputs more sensitive to small variations in the data. Finally, more sophisticated payload selection methods, such as KNN-based retrieval, can enhance prediction quality but also increase the time required to construct the payload, adding to overall latency.

Strategies for Optimizing Contextual Payloads

In general, strategies for optimizing contextual payloads span two orthogonal dimensions: the method and the timing of optimization. The optimization method determines how exactly the payload is crafted, i.e., the specific filtering, grouping, or encoding techniques used to compress lines in the raw context. The timing of optimization concerns when and where the optimization is performed, for example, whether it is pre-calculated offline or derived on-the-fly at inference time, and whether this is done by the client or the model service. Choosing a particular timing for constructing the optimized payload can have significant implications for inference latency and maintainability. The method and timing of payload optimization must align with the scope, budget, latency threshold, and quality requirements of a given AI use case.

Optimization Methods

We can generally distinguish between task-agnostic and task-sensitive optimization methods. Task-agnostic methods rely on techniques such as random sampling and recency-based sampling, which do not require knowledge of the specific prediction task or the semantic structure of the data. Random sampling is easy to implement, quick, and impartial, making it a useful baseline or fallback strategy. However, it may inadvertently discard lines that capture rare but crucial patterns for model performance. Recency-based sampling assumes that timestamps are recorded in the data and retrieves the most recent lines, which can be valuable for time-related data distributions (e.g., seasonal) or those prone to temporal drift. However, recency-based sampling ignores the broader structure of the dataset and may overweight short-term noise. Overall, task-agnostic methods offer simplicity and speed but provide limited control over the representativeness and relevance of the resulting payload.

In contrast, task-sensitive methods can incorporate information about the prediction task, query lines, and the underlying data distribution to select the most relevant lines for the contextual payload. A common approach is K-nearest neighbors (KNN) sampling, which identifies lines in historical data that are similar to the query lines. This can produce highly relevant contextual data and strong empirical performance, but it requires distance metrics (e.g., cosine) and auxiliary models to vectorize or embed the data, and can therefore be computationally expensive at scale. Another class of techniques uses clustering algorithms (e.g., K-means, hierarchical clustering, DBSCAN) to draw representative samples from clusters relevant to the query lines. This can ensure sufficient coverage of diverse patterns in the data while avoiding redundancy, although it typically requires offline computation of clusters and periodic recomposition to ensure that the clusters remain up to date.

More sophisticated task-sensitive methods are also possible. For example, the raw context and query lines can be integrated into a low-dimensional vector space—encoded in the query and decoded in the response from the foundational model's API; this amounts to a form of lossy compression that sacrifices some accuracy for the latency and cost benefits of a smaller payload. Retrieval-augmented generation (RAG) techniques can further enrich the payload with domain-specific anchoring to enhance response relevance.

In summary, task-sensitive methods generally produce higher quality contextual payloads but come with a greater engineering and computational overhead.

Timing of Optimization

A key decision related to timing concerns the possibility of pre-calculating certain optimization steps for the payload offline (i.e., the "when"). For example, a "golden" dataset can be pre-calculated from historical data, optimized for information density, and enriched with metadata (e.g., cluster identifiers, hashtags, etc.). Relevant lines can be selected from this lighter golden dataset to quickly construct and send the contextual payload at inference time. Golden datasets are well-suited for stable patterns and repetitive tasks (e.g., auto-completing common sales orders in the ERP domain), but their development and maintenance can create additional overhead for the development team. In contrast, on-the-fly optimization derives the payload at inference time based on current query lines and available historical data. This approach is more adaptive but can increase computation cost and latency for each inference call. On-the-fly optimization does not necessarily reduce the development team's overhead: savings from not maintaining a golden dataset may be offset by the engineering effort required to dynamically optimize the contextual payload.

Another timing-related decision concerns whether the optimization occurs on the client side or the service side.