Local Llm

How to Run LLMs Locally with an OpenAI-Compatible Server

Install Octomil and start a local LLM inference server that speaks the OpenAI API. Run Llama, Gemma, Phi, and other models on your own hardware in one command.

Running an LLM locally means the model executes on your own hardware — laptop, desktop, server — instead of calling a cloud API. The fastest way: install Octomil and run octomil serve llama-3.2-3b. That gives you an OpenAI-compatible inference server at http://localhost:8080/v1. Existing code that talks to https://api.openai.com/v1 works against it with a one-line base_url change.

What you get: no per-request API spend, no rate limits, no data leaving the machine, and latency in milliseconds instead of network round-trips. What used to be the catch — setting up runtimes, downloading weights, picking a quantization, wiring up an API layer — collapses into one command.

This guide walks through setup on macOS, Linux, and Windows, running your first model, and connecting it to real application code.


Install Octomil

macOS and Linux

Open a terminal and run:

curl -fsSL https://get.octomil.com | sh

This installs the octomil CLI to /usr/local/bin (or ~/.local/bin if that isn’t writable), with the runtime bundle under ~/.local/lib/octomil. Requires macOS 14+ on Apple Silicon, or Ubuntu 20.04+ / Debian 11+ / Fedora 36+ on x86_64. On Intel Macs, install with pip install octomil.

Windows

Open PowerShell and run:

irm https://octomil.com/install.ps1 | iex

Requires Windows 10 or later with PowerShell 5.1+.

Verify the installation

octomil --version

You should see a version string like octomil 0.x.y. If the command isn’t found, restart your terminal so the PATH update takes effect.


Start a local inference server

octomil serve llama-3.2-3b

That’s it. Octomil downloads the model if needed, starts the inference engine, and exposes an OpenAI-compatible API at http://localhost:8080/v1.

You’ll see output like:

Model:    llama-3.2-3b
Runtime:  llama.cpp (Metal)
Endpoint: http://localhost:8080/v1
Status:   ready

Choosing a different model

octomil serve gemma3-1b
octomil serve phi-4-mini
octomil serve llama-3.2-1b

Run octomil list models to see available models. Octomil handles downloading weights, selecting the right quantization for your hardware, and configuring the runtime.


Talk to it with curl

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2-3b",
    "messages": [
      {"role": "user", "content": "Explain TCP in one paragraph."}
    ]
  }'

The response follows the standard OpenAI Chat Completions format:

{
  "id": "chatcmpl-local-abc123",
  "object": "chat.completion",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "TCP (Transmission Control Protocol) is a connection-oriented protocol..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 87,
    "total_tokens": 99
  }
}

Use the OpenAI SDK (Python)

If your application already uses the OpenAI Python SDK, switching to local inference requires changing two lines:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed",  # local server doesn't require auth
)

response = client.chat.completions.create(
    model="llama-3.2-3b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
)

print(response.choices[0].message.content)

The base_url and api_key are the only changes. Everything else — streaming, function calling, message format — works the same way.

Streaming

stream = client.chat.completions.create(
    model="llama-3.2-3b",
    messages=[{"role": "user", "content": "Write a haiku about local inference."}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Use the OpenAI SDK (Node.js)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8080/v1",
  apiKey: "not-needed",
});

const completion = await client.chat.completions.create({
  model: "llama-3.2-3b",
  messages: [{ role: "user", content: "Summarize quantum computing in two sentences." }],
});

console.log(completion.choices[0].message.content);

What’s happening under the hood

When you run octomil serve llama-3.2-3b, the CLI:

  1. Selects the runtime — llama.cpp on CPU/Metal/CUDA, or a hardware-specific backend if available
  2. Downloads and caches the model — weights are stored in ~/.octomil/models/ and reused across runs
  3. Applies quantization — picks the best quantization scheme for your hardware (Q4_K_M on machines with 8GB+ RAM, Q3_K_S on constrained devices)
  4. Starts the HTTP server — OpenAI-compatible API on port 8080, with health check at /health

The server runs in the foreground. Stop it with Ctrl+C.


Hardware requirements

ModelMin RAMRecommendedTokens/sec (M3 MacBook Air)
Llama 3.2 1B (Q4)2 GB4 GB~45 tok/s
Llama 3.2 3B (Q4)4 GB8 GB~28 tok/s
Gemma 2 2B (Q4)3 GB6 GB~35 tok/s
Phi-3 Mini (Q4)4 GB8 GB~30 tok/s
Llama 3.1 8B (Q4)6 GB16 GB~15 tok/s

Performance scales with available memory bandwidth. Apple Silicon Macs with unified memory perform particularly well. Machines with NVIDIA GPUs (CUDA) see additional speedups.


When you outgrow one machine

A local inference server is enough for prototyping, development, and single-user applications. When your project moves beyond a single machine — deploying to a fleet of devices, managing rollouts across mobile apps, routing between local and cloud inference, or monitoring model performance across real users — that’s when the Octomil platform takes over.

The CLI you just installed is the same entry point. The transition from local server to managed fleet deployment doesn’t require switching tools or rewriting integration code.


FAQ

Can I use this with LangChain, LlamaIndex, or other frameworks?

Yes. Any framework that uses the OpenAI SDK or makes HTTP requests to an OpenAI-compatible endpoint works without modification. Point the base URL at http://localhost:8080/v1.

How does this compare to Ollama?

Both run models locally with a simple CLI. Octomil adds an OpenAI-compatible API by default, and provides a path to multi-device deployment, routing, and fleet operations when your project grows beyond one machine.

How does this compare to LM Studio?

LM Studio provides a desktop GUI for chatting with local models. Octomil is a CLI and API server designed for integration into applications and deployment pipelines. They target different workflows — LM Studio for interactive exploration, Octomil for building products.

Can I run multiple models simultaneously?

Yes. Start multiple octomil serve instances on different ports:

octomil serve llama-3.2-3b --port 8080
octomil serve gemma3-1b --port 8081

Does this work offline?

After the initial model download, yes. No internet connection is required for inference. Models are cached locally at ~/.octomil/models/.

What quantization formats are supported?

Octomil supports GGUF (llama.cpp), CoreML, ONNX, and ExecuTorch model formats. The CLI automatically selects the best format and quantization level for your hardware.