Download, serve, quantize, and fine-tune AI models. From laptop to cluster. OpenAI-compatible API out of the box.
Download models from HuggingFace, Koder AI Hub, or any direct URL. Supports GGUF and safetensors formats with progress tracking.
Expose any model via a fully OpenAI-compatible REST API. Drop-in replacement for GPT — /v1/chat/completions, streaming, embeddings.
Chat with models directly from your terminal. Full conversation history, streaming responses, and special commands.
Reduce model size with GGUF quantization. Support for Q4_0, Q4_K_M, Q5_K_M, Q8_0, and more levels.
Fine-tune models on your own data with LoRA and QLoRA methods. Powered by unsloth or axolotl backends.
Measure inference performance: tokens/sec, latency percentiles (p50/p95/p99), and throughput across multiple prompts.
Auto-detect NVIDIA (CUDA) and AMD (ROCm) GPUs. Monitor VRAM usage, utilization, and temperature in real time.
Same binary works everywhere. Run on your laptop for development, deploy to a GPU cluster for production.
Define model configurations with Modelfiles. Set system prompts, parameters, and templates in a simple text format.
# Download from HuggingFace $ koder-ai-runtime pull TheBloke/Llama-2-7B-GGUF # Download a specific quantization $ koder-ai-runtime pull TheBloke/Llama-2-7B-GGUF \ llama-2-7b.Q4_K_M.gguf # Download from direct URL $ koder-ai-runtime pull https://example.com/model.gguf
# Start the server $ koder-ai-runtime serve -m llama-2-7b -p 7802 # Use with any OpenAI client $ curl http://localhost:7802/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama-2-7b", "messages": [{"role": "user", "content": "Hello!"}] }'
$ koder-ai-runtime run llama-2-7b Model: huggingface/TheBloke--Llama-2-7B-GGUF Type your message and press Enter. You: What is the meaning of life? Assistant: The meaning of life is a deeply personal question that philosophers have debated for millennia...
# Quantize a model $ koder-ai-runtime quantize model.gguf Q4_K_M # Fine-tune with LoRA $ koder-ai-runtime finetune \ -m llama-2-7b \ -d dataset.jsonl \ --method qlora \ --epochs 3 # Benchmark performance $ koder-ai-runtime benchmark -m llama-2-7b
Drop-in replacement. Works with any OpenAI SDK or client library.
| Method | Endpoint | Description |
|---|---|---|
| GET | /v1/models |
List available models |
| POST | /v1/chat/completions |
Chat completions (supports streaming via SSE) |
| POST | /v1/completions |
Text completions |
| POST | /v1/embeddings |
Generate text embeddings |
| GET | /health |
Health check endpoint |