Workers AI for Developer Week - faster inference, new models, async batch API, expanded LoRA support
Happy Developer Week 2025! Workers AI is excited to announce a couple of new features and improvements available today. Check out our blog ↗ for all the announcement details.
We’re rolling out some in-place improvements to our models that can help speed up inference by 2-4x! Users of the models below will enjoy an automatic speed boost starting today:
- @cf/meta/llama-3.3-70b-instruct-fp8-fastgets a speed boost of 2-4x, leveraging techniques like speculative decoding, prefix caching, and an updated inference backend.
- @cf/baai/bge-small-en-v1.5,- @cf/baai/bge-base-en-v1.5,- @cf/baai/bge-large-en-v1.5get an updated back end, which should improve inference times by 2x.- With the bgemodels, we’re also announcing a new parameter calledpoolingwhich can takeclsormeanas options. We highly recommend usingpooling: clswhich will help generate more accurate embeddings. However, embeddings generated with cls pooling are not backwards compatible with mean pooling. For this to not be a breaking change, the default remains as mean pooling. Please specifypooling: clsto enjoy more accurate embeddings going forward.
 
- With the 
We’re also excited to launch a few new models in our catalog to help round out your experience with Workers AI. We’ll be deprecating some older models in the future, so stay tuned for a deprecation announcement. Today’s new models include:
- @cf/mistralai/mistral-small-3.1-24b-instruct: a 24B parameter model achieving state-of-the-art capabilities comparable to larger models, with support for vision and tool calling.
- @cf/google/gemma-3-12b-it: well-suited for a variety of text generation and image understanding tasks, including question answering, summarization and reasoning, with a 128K context window, and multilingual support in over 140 languages.
- @cf/qwen/qwq-32b: a medium-sized reasoning model, which is capable of achieving competitive performance against state-of-the-art reasoning models, e.g., DeepSeek-R1, o1-mini.
- @cf/qwen/qwen2.5-coder-32b-instruct: the current state-of-the-art open-source code LLM, with its coding abilities matching those of GPT-4o.
Introducing a new batch inference feature that allows you to send us an array of requests, which we will fulfill as fast as possible and send them back as an array. This is really helpful for large workloads such as summarization, embeddings, etc. where you don’t have a human-in-the-loop. Using the batch API will guarantee that your requests are fulfilled eventually, rather than erroring out if we don’t have enough capacity at a given time.
Check out the tutorial to get started! Models that support batch inference today include:
- @cf/meta/llama-3.3-70b-instruct-fp8-fast
- @cf/baai/bge-small-en-v1.5
- @cf/baai/bge-base-en-v1.5
- @cf/baai/bge-large-en-v1.5
- @cf/baai/bge-m3
- @cf/meta/m2m100-1.2b
We’ve upgraded our LoRA experience to include 8 newer models, and can support ranks of up to 32 with a 300MB safetensors file limit (previously limited to rank of 8 and 100MB safetensors) Check out our LoRAs page to get started. Models that support LoRAs now include:
- @cf/meta/llama-3.2-11b-vision-instruct
- @cf/meta/llama-3.3-70b-instruct-fp8-fast
- @cf/meta/llama-guard-3-8b
- @cf/meta/llama-3.1-8b-instruct-fast(coming soon)
- @cf/deepseek-ai/deepseek-r1-distill-qwen-32b(coming soon)
- @cf/qwen/qwen2.5-coder-32b-instruct
- @cf/qwen/qwq-32b
- @cf/mistralai/mistral-small-3.1-24b-instruct
- @cf/google/gemma-3-12b-it
Was this helpful?
- Resources
- API
- New to Cloudflare?
- Directory
- Sponsorships
- Open Source
- Support
- Help Center
- System Status
- Compliance
- GDPR
- Company
- cloudflare.com
- Our team
- Careers
- © 2025 Cloudflare, Inc.
- Privacy Policy
- Terms of Use
- Report Security Issues
- Trademark