Koder AI Runtime

Download, serve, quantize, and fine-tune AI models. From laptop to cluster. OpenAI-compatible API out of the box.

curl -fsSL https://koder.dev/install-ai-runtime | sh

Everything You Need

📥

Pull Models

Download models from HuggingFace, Koder AI Hub, or any direct URL. Supports GGUF and safetensors formats with progress tracking.

⚡

Serve with OpenAI API

Expose any model via a fully OpenAI-compatible REST API. Drop-in replacement for GPT — /v1/chat/completions, streaming, embeddings.

💬

Interactive Chat

Chat with models directly from your terminal. Full conversation history, streaming responses, and special commands.

📦

Quantize

Reduce model size with GGUF quantization. Support for Q4_0, Q4_K_M, Q5_K_M, Q8_0, and more levels.

🧬

Fine-Tune

Fine-tune models on your own data with LoRA and QLoRA methods. Powered by unsloth or axolotl backends.

📊

Benchmark

Measure inference performance: tokens/sec, latency percentiles (p50/p95/p99), and throughput across multiple prompts.

🎮

GPU Acceleration

Auto-detect NVIDIA (CUDA) and AMD (ROCm) GPUs. Monitor VRAM usage, utilization, and temperature in real time.

🚀

Laptop to Cluster

Same binary works everywhere. Run on your laptop for development, deploy to a GPU cluster for production.

📝

Modelfile Support

Define model configurations with Modelfiles. Set system prompts, parameters, and templates in a simple text format.

Quick Start

# Download from HuggingFace
$ koder-ai-runtime pull TheBloke/Llama-2-7B-GGUF

# Download a specific quantization
$ koder-ai-runtime pull TheBloke/Llama-2-7B-GGUF \
    llama-2-7b.Q4_K_M.gguf

# Download from direct URL
$ koder-ai-runtime pull https://example.com/model.gguf

# Start the server
$ koder-ai-runtime serve -m llama-2-7b -p 7802

# Use with any OpenAI client
$ curl http://localhost:7802/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "llama-2-7b",
      "messages": [{"role": "user", "content": "Hello!"}]
    }'

$ koder-ai-runtime run llama-2-7b
Model: huggingface/TheBloke--Llama-2-7B-GGUF
Type your message and press Enter.

You: What is the meaning of life?
Assistant: The meaning of life is a deeply
personal question that philosophers have
debated for millennia...

# Quantize a model
$ koder-ai-runtime quantize model.gguf Q4_K_M

# Fine-tune with LoRA
$ koder-ai-runtime finetune \
    -m llama-2-7b \
    -d dataset.jsonl \
    --method qlora \
    --epochs 3

# Benchmark performance
$ koder-ai-runtime benchmark -m llama-2-7b

OpenAI-Compatible API

Drop-in replacement. Works with any OpenAI SDK or client library.

Method	Endpoint	Description
GET	`/v1/models`	List available models
POST	`/v1/chat/completions`	Chat completions (supports streaming via SSE)
POST	`/v1/completions`	Text completions
POST	`/v1/embeddings`	Generate text embeddings
GET	`/health`	Health check endpoint