Improved Batch Inference API: Enhanced UI, Expanded Model Support, and 3000× Rate Limit Increase
Improved Batch Inference API: Enhanced UI, Expanded Model Support, and 3000× Rate Limit Increase
We’ve rolled out major improvements to our Batch Inference API, making it simpler, faster, and more powerful for teams processing massive datasets.

What’s New
Streamlined UI
Create and track batch jobs in an intuitive interface — no complex API calls required.
Universal Model Access
The Batch Inference API now supports all serverless models and private deployments, so you can run batch workloads on exactly the models you need.
Massive Scale Jump
Rate limits are up from 10M to 30B enqueued tokens per model per user, a 3000× increase. Need more? We’ll work with you to customize.
Lower Cost
For most serverless models, the Batch Inference API runs at 50% the cost of our real-time API, making it the most economical way to process high-throughput workloads.
Batch Inference API in Action
“We rely on the Batch Inference API to process very large amounts of requests. The high rate limits—up to 30B enqueued tokens—let us run massive experiments without bottlenecks, and jobs consistently finish well under the 24-hour SLA, often within just hours. It’s transformed the pace at which we can test and iterate.”
— Volodymyr Kuleshov, Co-Founder, Inception Labs
Inception Labs is one of many teams leveraging the Batch Inference API to accelerate experimentation and production workloads. From research datasets to customer-facing applications, Batch enables large-scale processing that simply wasn’t feasible before.
Ideal Use Cases
The Batch Inference API is perfect when you need high throughput without real-time constraints:
- Large-scale text analysis: Sentiment analysis, document classification, content tagging
- Fraud detection: Scan millions of transactions for anomalies
- Synthetic data generation: Create massive training datasets
- Embedding generation: Turn large corpora into vector representations
- Content moderation: Process user-generated content at scale
- Model evaluation: Run large benchmark suites
- Customer support automation: Handle tickets with longer SLAs efficiently
Looking Ahead
These updates mark a major step forward in making large-scale inference both accessible and cost-effective. With an upgraded UI, universal model support, and dramatically higher rate limits––all at typically half the cost of real-time APIs––the Batch Inference API is the most efficient way to handle massive workloads.
Try the Batch Inference API today and start scaling your experiments without limits.
- Lower
Cost20% - faster
training4x - network
compression117x
Q: Should I use the RedPajama-V2 Dataset out of the box?
RedPajama-V2 is conceptualized as a pool of data that serves as a foundation for creating high quality datasets. The dataset is thus not intended to be used out of the box and, depending on the application, data should be filtered out using the quality signals that accompany the data. With this dataset, we take the view that the optimal filtering of data is dependent on the intended use. Our goal is to provide all the signals and tooling that enables this.
article