NativeLab — Setup Guide

Step 1 of 3

Engine Download llama.cpp

NativeLab uses llama.cpp as its inference engine. You need to download the pre-built release that matches your operating system and hardware.

Official source

Visit the GitHub releases page to get the latest build:

github.com/ggml-org/llama.cpp/releases

github.com/ggml-org/llama.cpp/releases

Latest release

b5605 · llama.cpp

Scroll down past the changelog to find the Assets section. Expand it if collapsed, then download the file matching your system below.

            llama-b5605-bin-win-avx2-x64.zip
            Windows AVX2
          

llama-b5605-bin-win-avx-x64.zip Windows AVX

            llama-b5605-bin-macos-arm64.zip
            macOS Apple Silicon
          

llama-b5605-bin-macos-x64.zip macOS Intel

            llama-b5605-bin-ubuntu-x64.tar.gz
            Linux x64
          

Windows — Choose the right build

Download avx2 if your CPU is from 2013 or later (Intel Haswell+ / AMD Ryzen). Use avx for older machines. If you have an NVIDIA GPU, grab the cuda variant for much faster inference.

Download llama-bXXXX-bin-win-avx2-x64.zip
Extract the zip — you'll get a folder of .exe files
Note the path to llama-cli.exe and llama-server.exe
No installer needed — just remember where they are

macOS — Apple Silicon vs Intel

Download arm64 for any Mac with an M1/M2/M3/M4 chip. Download x64 for older Intel-based Macs. Apple Silicon chips have unified memory — this means much better performance per GB than standard RAM.

Download llama-bXXXX-bin-macos-arm64.zip (M-series) or x64 (Intel)
Unzip the archive — you'll get a folder of binaries
On first run, macOS may show a security warning — go to System Settings > Privacy & Security and click Allow Anyway
Note the path to llama-cli and llama-server

Linux — Ubuntu / Debian recommended

Download the ubuntu-x64 build for most distros. If you have an NVIDIA GPU with CUDA drivers installed, grab the cuda variant for hardware acceleration.

Download llama-bXXXX-bin-ubuntu-x64.tar.gz
Extract: tar -xzf llama-bXXXX-bin-ubuntu-x64.tar.gz
Make executable: chmod +x llama-cli llama-server
Note the full path — e.g. /home/user/llama/llama-cli

Hardware capability guide

Pick the right build
for your machine

Your CPU generation determines which binary flags are supported. When in doubt, AVX2 covers the vast majority of hardware made after 2013.

Pre-2013

Legacy CPU

avx (32-bit float) No GPU support

Core 2 Duo, early Core i series. Very slow for large models.

2013–2018

Modern CPU

avx2 — recommended Haswell / Ryzen 1st gen

Great for 7B–13B models on CPU. Most common setup.

2019–now

Newer CPU

avx2 — recommended avx512 (server CPUs)

Fast inference. Ryzen 3000+, 10th-gen Intel+, Apple Silicon.

NVIDIA GPU

CUDA acceleration

cuda12 variant 5–20x faster

GTX 10xx or newer. Needs CUDA drivers installed separately.

Apple Silicon tip

M1/M2/M3/M4 Macs have unified memory — RAM and GPU memory are shared. A 16 GB M2 Mac can comfortably run a 13B model that would require a dedicated 16 GB GPU on PC. The macOS arm64 build already uses Metal acceleration automatically.

Step 2 of 3

App Configuration Point NativeLab to the binaries

Open NativeLab Pro and navigate to the Server tab. You'll browse to the two binaries you extracted in Step 1.

Native Lab Pro v2

Chat

Models

Config

Server

Pipeline

API Models

Download

MCP

Logs

llama-cli path

C:\llama\llama-cli.exe

llama-server path

C:\llama\llama-server.exe

llama-server is running · Port 8080 · Ready

Understanding the active mode

Server mode vs CLI mode

NativeLab can talk to llama.cpp in two ways. The indicator below shows which one is currently active and what it means for your workflow.

Active backend mode

Active

Server Mode

llama-server (HTTP)

Starts a local OpenAI-compatible API on port 8080. NativeLab connects to it over HTTP. The server stays running between messages.

Persistent context window

Streaming tokens

OpenAI-compatible /v1/chat

Pipeline & API pass-through

Faster multi-turn conversations

Active

CLI Mode

llama-cli (subprocess)

Spawns llama-cli as a subprocess per generation. Simpler setup — no server process needed. Slower for multi-turn chat.

No persistent context

Simpler — no port needed

No streaming support

Not compatible with pipeline

Good for single prompts

llama-server running on port 8080

NativeLab is connected · Streaming enabled · HTTP API ready

Recommended for most users

If the server is not responding — Reload from the View menu

macOS — View menu

→

What this fixes

+ Kills any stale llama-server process

+ Relaunches server with current model

+ Clears port 8080 if previously stuck

+ Re-reads your binary path settings

i Check the Logs tab to confirm restart

How to browse

Click Browse next to each field. A file picker will open — navigate to the folder where you extracted llama.cpp and select the appropriate binary:

llama-cli — for single prompt / chat completions
llama-server — for the local HTTP server (OpenAI-compatible API)

Path examples by OS

Windows	`C:\Users\you\Downloads\llama\llama-cli.exe`
macOS	`/Users/you/Downloads/llama/llama-cli`
Linux	`/home/you/llama/llama-cli`

Once connected

After setting both paths, NativeLab will start llama-server automatically in the background. The green status indicator in the Server tab confirms it's running. You're now ready to load a model.

Step 3 of 3

Models Download a GGUF model

Switch to the Download tab inside NativeLab. Paste a HuggingFace repo ID, click Search, choose a quantization, and download directly to your models folder.

Download HuggingFace GGUF Downloader

Enter any HuggingFace repo ID to browse its GGUF files and download them straight to your models folder. Network access required.

SEARCH REPOSITORY

bartowski/Llama-3.1-8B-Instruct-GGUF

Found 8 GGUF file(s).

AVAILABLE FILES

Llama-3.1-8B-Instruct-Q2_K.gguf[Q2_K · Very compressed]

Llama-3.1-8B-Instruct-Q3_K_M.gguf[Q3_K_M · Compressed]

Llama-3.1-8B-Instruct-Q4_K_M.gguf[Q4_K_M · Balanced]

Llama-3.1-8B-Instruct-Q5_K_M.gguf[Q5_K_M · High quality]

Llama-3.1-8B-Instruct-Q8_0.gguf[Q8_0 · Near lossless]

Quantization guide

Which quant
should you pick?

Quantization compresses model weights. Lower = smaller file and less RAM needed, but slightly lower quality. Q4_K_M is the sweet spot for most setups.

Format	Quality	RAM (7B model)	Best for
Q2_K	Low	~3 GB	8 GB RAM, speed only
Q3_K_M	Low-Med	~3.9 GB	8 GB RAM, better than Q2
Q4_K_M Recommended	Good	~4.8 GB	16 GB RAM · best balance
Q5_K_M Recommended	Great	~5.7 GB	16–32 GB RAM
Q6_K	Very good	~6.6 GB	32 GB RAM
Q8_0	Near-lossless	~8.7 GB	32 GB+ RAM

Recommended repos by hardware

Copy — Paste — Download

Click Copy on any repo ID below and paste it directly into NativeLab's Download tab search field.

8 GB RAM — Light models (3B–7B, Q3/Q4)

Runs on most laptops. Expect ~8–18 tokens/sec on a modern CPU.

bartowski/Llama-3.2-3B-Instruct-GGUF

Meta Llama 3.2 · 3B · Fast · General chat

3B · ~2 GB

TheBloke/Mistral-7B-Instruct-v0.2-GGUF

Mistral 7B · Excellent quality-to-size · General use

7B · ~4.1 GB

lmstudio-community/DeepSeek-R1-Distill-Qwen-7B-GGUF

DeepSeek R1 Distill · Reasoning · Strong math & logic

7B · ~4.4 GB

Qwen/Qwen2.5-7B-Instruct-GGUF

Qwen 2.5 · Strong coding & multilingual

7B · ~4.4 GB

16 GB RAM — Mid-tier models (8B–13B, Q4/Q5)

The sweet spot. Can run 13B at Q4 or 8B at Q5/Q6 for very high quality output.

bartowski/Llama-3.1-8B-Instruct-GGUF

Meta Llama 3.1 · 8B · Excellent instruction following

8B · ~5 GB

lmstudio-community/DeepSeek-R1-Distill-Llama-8B-GGUF

DeepSeek R1 Distill · LLaMA base · Reasoning & coding

8B · ~5 GB

TheBloke/Llama-2-13B-Chat-GGUF

Llama 2 · 13B · Reliable general assistant

13B · ~8 GB

bartowski/gemma-3-12b-it-GGUF

Google Gemma 3 · 12B · Multimodal-trained, strong reasoning

12B · ~7.5 GB

TheBloke/CodeLlama-13B-Instruct-GGUF

Code Llama 13B · Coding specialist · Python/C++/JS

13B · ~8 GB

32 GB RAM — Large models (30B–34B, Q4/Q5)

High-end desktop or workstation territory. Near-GPT-4-level quality from some models.

bartowski/Llama-3.3-70B-Instruct-GGUF

Meta Llama 3.3 · 70B · Flagship open model · Use Q2/Q3 at 32GB

70B · ~24 GB Q3

lmstudio-community/DeepSeek-R1-Distill-Qwen-32B-GGUF

DeepSeek R1 32B · Exceptional reasoning chain · Q3 fits in 32GB

32B · ~19 GB Q3

Qwen/Qwen2.5-32B-Instruct-GGUF

Qwen 2.5 · 32B · Best-in-class coding at this size

32B · ~19 GB Q3

64 GB+ RAM — Full 70B models (Q4/Q5)

Power workstations, Mac Studio / Mac Pro with 64GB unified memory. Best-in-class local inference.

bartowski/Llama-3.3-70B-Instruct-GGUF

Meta Llama 3.3 · 70B · Q4_K_M fits in ~42 GB — best open LLM available

70B · ~42 GB Q4

bartowski/DeepSeek-R1-GGUF

DeepSeek R1 Full · 671B (use IQ2/IQ3 variants) · Reasoning powerhouse

671B · Varies

Where models are saved

NativeLab downloads models to C:\NativeLabPro\localllm on Windows (or the equivalent path on Mac/Linux). Once downloaded, they appear automatically in the Models tab and can be loaded from there.

Get up and running
in three steps

Watch the full
setup in action

Pick the right build
for your machine

Server mode vs CLI mode

Which quant
should you pick?

Copy — Paste — Download

Start your first conversation

Get up and runningin three steps

Watch the fullsetup in action

Pick the right buildfor your machine

Server mode vs CLI mode

Which quantshould you pick?

Copy — Paste — Download

Start your first conversation

Get up and running
in three steps

Watch the full
setup in action

Pick the right build
for your machine

Which quant
should you pick?