{% extends "base.html" %} {% block title %}LLM Settings - AUTARCH{% endblock %} {% block content %}

Active Backend

Configured backend: {{ llm_backend }} — select a tab, fill in settings, and click Save & Activate, then Load Model to initialise.

Not loaded — click Load Model to initialise the current backend.

Local GGUF models may take 10–60 s to load depending on size. The page will wait — check the Debug Log for live output.

Model Selection

Scans for .gguf, .ggml, .bin files and SafeTensors model directories.

llama.cpp Parameters

GPU / Compute Backend

CUDA
7.5 GB VRAM
NVIDIA GPU
n_gpu_layers = -1
Intel · Vulkan
15 GB RAM
Intel GPU / iGPU
n_gpu_layers = -1
🖥
CPU Only
24 GB RAM
Any CPU
n_gpu_layers = 0

Context & Threading

Token window. Higher = more memory.
Match physical cores.
-1 = all, 0 = CPU only.
Prompt processing batch.

Sampling / Generation

0 = deterministic, 1 = creative.
Nucleus sampling threshold.
0 = disabled.
Penalises repeated tokens.
Max new tokens generated.
-1 = random.

RoPE & Mirostat

Extends context via RoPE.
Replaces top-p/k sampling.
Target entropy (5.0 default).
Learning rate (0.1 default).

Performance Flags

Transformers / SafeTensors Parameters

Device & Precision

auto (default) — let transformers decide  |  cpu — CPU only  |  cuda:0 — single GPU. Enable FP32 CPU Offload above if 8-bit gives a "dispatched on CPU" error.

Tokenizer

left = causal LM standard.

Generation

0 = disabled.
1 = greedy/sampling.

Claude API

Requires an Anthropic account. Get your API key from the console.

Stored in autarch_settings.conf — keep it safe.
0–1. Claude default is 1.
Use with lower temp.
0 = disabled.

OpenAI API

Also compatible with any OpenAI-format endpoint: LiteLLM, Ollama (/v1), vLLM, LocalAI, etc. Just set the Base URL to your local server.

Leave blank for local servers (Ollama, vLLM, etc.).
Change to point to local LLM servers.
Type any model ID, or pick from suggestions.
0–2. 1 = default.
Reduce repetition.
Encourage new topics.

HuggingFace Inference API

Account

Get yours at huggingface.co/settings/tokens

Model

Browse HF
Full model repo ID (e.g. meta-llama/Llama-3.1-8B-Instruct).
Provider used for serverless inference. Some require separate accounts.
Overrides provider. Use for Dedicated Inference Endpoints.

Generation Parameters

0 = disabled.
-1 = random.
Tokens that will stop generation.
{% endblock %}