Rapid-MLX
Rapid-MLX is an OpenAI-compatible inference server optimized for Apple Silicon (MLX). 2-4x faster than Ollama, with full tool calling, reasoning separation, and prompt caching.
| Property | Details |
|---|---|
| Description | Local LLM inference server for Apple Silicon. Docs |
| Provider Route on LiteLLM | rapid_mlx/ |
| Provider Doc | Rapid-MLX ↗ |
| Supported Endpoints | /chat/completions |
Quick Start​
Install and start Rapid-MLX​
brew tap raullenchai/rapid-mlx
brew install rapid-mlx
rapid-mlx serve qwen3.5-9b
Or install via pip:
pip install vllm-mlx
rapid-mlx serve qwen3.5-9b
Usage - litellm.completion (calling OpenAI compatible endpoint)​
- SDK
- PROXY
import litellm
response = litellm.completion(
model="rapid_mlx/default",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
- Add to config.yaml
model_list:
- model_name: my-model
litellm_params:
model: rapid_mlx/default
api_base: http://localhost:8000/v1
- Start the proxy
$ litellm --config /path/to/config.yaml
- Send a request
curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Authorization: Bearer sk-1234' \
--header 'Content-Type: application/json' \
--data '{
"model": "my-model",
"messages": [
{
"role": "user",
"content": "what llm are you"
}
]
}'
Environment Variables​
| Variable | Description | Default |
|---|---|---|
RAPID_MLX_API_KEY | API key (optional, Rapid-MLX does not require auth by default) | not-needed |
RAPID_MLX_API_BASE | Server URL | http://localhost:8000/v1 |
Supported Models​
Any MLX model served by Rapid-MLX works. Use the model name as loaded by the server. Common choices:
rapid_mlx/default- Whatever model is currently loadedrapid_mlx/qwen3.5-9b- Best small model for general userapid_mlx/qwen3.5-35b- Smart and fastrapid_mlx/qwen3.5-122b- Frontier-level MoE model
Features​
- Streaming - Full SSE streaming support
- Tool calling - 17 parser formats (Qwen, Hermes, MiniMax, GLM, etc.)
- Reasoning separation - Native support for thinking models (Qwen3, DeepSeek-R1)
- Prompt caching - KV cache reuse and DeltaNet state snapshots for fast TTFT
- Multi-Token Prediction - Speculative decoding for supported models