{% extends "base.html" %} {% block title %}LLM Settings - AUTARCH{% endblock %} {% block content %}

Active Backend

Configured backend: {{ llm_backend }} — select a tab, fill in settings, and click Save & Activate, then Initialize LLM to initialise.

Not loaded — click Initialize LLM to initialise the current backend.

Local GGUF models may take 10–60 s to load depending on size. The page will wait — check the Debug Log for live output.

Model Selection

Scans for .gguf, .ggml, .bin files and SafeTensors model directories.

llama.cpp Parameters

GPU / Compute Backend

CUDA
7.5 GB VRAM
NVIDIA GPU
n_gpu_layers = -1
Intel · Vulkan
15 GB RAM
Intel GPU / iGPU
n_gpu_layers = -1
🖥
CPU Only
24 GB RAM
Any CPU
n_gpu_layers = 0

Context & Threading

Token window. Higher = more memory.
Match physical cores.
-1 = all, 0 = CPU only.
Prompt processing batch.

Sampling / Generation

0 = deterministic, 1 = creative.
Nucleus sampling threshold.
0 = disabled.
Penalises repeated tokens.
Max new tokens generated.
-1 = random.

RoPE & Mirostat

Extends context via RoPE.
Replaces top-p/k sampling.
Target entropy (5.0 default).
Learning rate (0.1 default).

Performance Flags

Transformers / SafeTensors Parameters

Device & Precision

auto (default) — let transformers decide  |  cpu — CPU only  |  cuda:0 — single GPU. Enable FP32 CPU Offload above if 8-bit gives a "dispatched on CPU" error.

Tokenizer

left = causal LM standard.

Generation

0 = disabled.
1 = greedy/sampling.

Claude API

Requires an Anthropic API key. AUTARCH calls Claude directly for chat, agent, and analysis tasks.

Stored in autarch_settings.conf — keep it safe.
Click Refresh to fetch available models from the API.
0–1. Claude default is 1.
Use with lower temp.
0 = disabled.
{% if llm_backend == 'claude' %}Active — {{ claude.model }}{% else %}Not active{% endif %}

Save stores settings + API key to encrypted vault. Reload re-tests the connection without changing settings. Activate saves + loads Claude as the active backend.

OpenAI API

Also compatible with any OpenAI-format endpoint: LiteLLM, Ollama (/v1), vLLM, LocalAI, etc. Just set the Base URL to your local server.

Leave blank for local servers (Ollama, vLLM, etc.).
Change to point to local LLM servers.
Type any model ID, or pick from suggestions.
0–2. 1 = default.
Reduce repetition.
Encourage new topics.

HuggingFace Inference API

Account

Get yours at huggingface.co/settings/tokens

Model

Browse HF
Full model repo ID (e.g. meta-llama/Llama-3.1-8B-Instruct).
Provider used for serverless inference. Some require separate accounts.
Overrides provider. Use for Dedicated Inference Endpoints.

Generation Parameters

0 = disabled.
-1 = random.
Tokens that will stop generation.

Agent Configuration

Configure the AI agent backends used by Agent Hal and Autonomy. Agents use the THOUGHT/ACTION/PARAMS loop to accomplish tasks with tools.

Uses the currently loaded LLM backend (configured above) for agent operations. Best for offline or privacy-sensitive work.

Maximum tool-use steps per task.
Show step-by-step agent reasoning.
Active agent backend: {{ agents.backend }} — The agent will use this backend when processing tasks.
{% endblock %}