For years, running frontier AI models locally was impractical — models were too large, hardware was too slow, and quantization quality was too degraded. That changed with Apple Silicon's unified memory architecture. The M4 Pro chip with 48 GB of unified memory keeps the entire model in fast, CPU-adjacent RAM instead of bouncing data between a discrete GPU and system RAM.
The practical result: a laptop can now run a 35-billion-parameter model like Qwen3 at speeds that are usable for real development work — not a curiosity, but a daily driver. The author cancelled three paid subscriptions in one week after making this switch.
Compare throughput across inference backends on M4 Pro hardware.
There is no single "correct" way to run local AI. Four backends serve different user profiles, from no-terminal beginners to Python scripters. Each exposes an OpenAI-compatible API endpoint so your tools (Cline, Claude Code, OpenHands) do not need to change — only the endpoint URL.
LM Studio provides a graphical interface to browse, download, and run quantized models. It downloads Qwen3 in Q4_K_M format and immediately exposes a local API on localhost:1234. No configuration files needed.
Ollama uses the MLX backend optimized for Apple Silicon. A single ollama run qwen3:30b-a3b-q4_K_M command downloads and starts the model. The author benchmarked 1,851 tokens/sec prefill and 134 tokens/sec decode on M4 Pro with int4 quantization.
oMLX is an advanced runner that uses SSD-backed KV caching. Long-context agent sessions no longer need to rebuild the KV cache from scratch on each call. The result: agent time-to-first-token drops from 30–90 seconds down to 1–3 seconds. This is the biggest practical win for coding agents that iterate on the same codebase repeatedly.
mlx-lm is a Python library that gives you direct programmatic access to models running on the MLX framework. It is the foundation for fine-tuning workflows using LoRA adapters. Use it when you need to embed local inference inside a Python script rather than calling an HTTP endpoint.
Cursor's core value is an AI that can read your codebase, propose edits, and run terminal commands — but all of that happens via cloud API calls. The same behavior is achievable locally by pointing a coding agent at your local backend instead.
Three agents work well for this. Cline is a VS Code extension. OpenHands is a browser-based agent with a sandboxed environment. Claude Code can be redirected to a local endpoint with an environment variable, allowing it to call Qwen3 instead of Claude on Anthropic's servers.
Click a capability to see which agents support it.
ComfyUI Desktop is a node-based image generation interface that runs entirely on your Mac. It supports Flux, SDXL, and Z-Image-Turbo models. On M4 Pro hardware, a 1024×1024 image generates in 25–35 seconds — slower than Midjourney's cloud rendering, but free, private, and with full control over the pipeline.
The node-based workflow in ComfyUI lets you chain models, apply ControlNet constraints, use IP-Adapter for style transfer, and build reproducible pipelines that cloud tools cannot match for repeatability.
Approximate render time for a 1024×1024 image at default steps. Click Animate to see the comparison.
* Cloud time varies; local runs are free and private regardless of speed.
Flux is the highest-quality open model for photorealistic and artistic generation. Flux.1-dev produces images with excellent prompt adherence but takes the longest at around 35 seconds. Flux.1-schnell is a distilled faster variant at the cost of some detail.
SDXL is a well-established model with a large ecosystem of LoRA fine-tunes, ControlNet adapters, and community checkpoints. The Lightning distilled variant runs in 8 steps, making it a fast middle ground between quality and speed.
Z-Image-Turbo is a heavily distilled model optimized for speed. At 4 steps it produces usable images in under 10 seconds. It trades fine detail and prompt precision for rapid iteration — good for quickly exploring composition ideas before committing to a slower model.
Beyond text and images, the M4 Pro handles several other AI workflows locally. Voice transcription via Whisper runs fully offline with no data leaving the device. RAG (retrieval-augmented generation) over private documents keeps sensitive data local. Vision models can analyze screenshots and images without uploading them.
OpenAI's Whisper model runs on Apple Silicon via whisper.cpp or faster-whisper. It transcribes audio in real time or from files. The M4 Pro runs the medium model at faster-than-realtime speeds, making it suitable for live meeting transcription without any network dependency.
RAG over private documents means your PDFs, notes, and internal docs never leave your machine. Tools like LlamaIndex or Chroma running locally embed your documents into a vector store. Queries retrieve relevant chunks and inject them into the context window before the local model answers.
Multimodal models like LLaVA and Qwen-VL accept images as input alongside text. On the M4 Pro you can feed a screenshot or diagram to the model and ask it to explain, debug, or extract information. Useful for analyzing UI bugs, reading charts, or interpreting error screenshots.
mlx-lm supports LoRA fine-tuning directly on Apple Silicon. You can adapt a base model to your codebase's style, your company's writing tone, or a specialized domain. A 35B model can be fine-tuned with 4-bit LoRA on 48 GB of unified memory in hours rather than days.
Click a state to start, then click valid next states to walk through a local RAG query lifecycle.
Qwen3-35B running locally does not match Claude Opus or GPT-4o on every task. The author is explicit about this: for complex reasoning, long multi-step planning, or tasks that genuinely require frontier model capability, cloud APIs still win. Local models excel at the 80% of daily coding tasks that do not require frontier reasoning.
The economic argument is also hardware-dependent. The M4 Pro configuration that makes this viable costs significantly more than a MacBook Air. The break-even point against subscription costs depends on how heavily you use AI tools and how long you keep the hardware.
Click a task type to see the recommendation.
Qwen3-35B at int4 quantization is competitive with GPT-3.5-class models on coding tasks and approaches GPT-4o on simpler requests. For complex agentic tasks requiring multi-hop reasoning, frontier cloud models still outperform. The gap is narrowing with each model generation.
At heavy usage — say, 500K tokens per day across coding, chat, and image generation — cloud subscriptions add up quickly. A developer running multiple AI tools could spend $60–$150/month on subscriptions. The M4 Pro premium pays off over 18–24 months of heavy local usage at those rates.
Every prompt sent to a cloud API is processed on external servers. Local inference means zero data exfiltration: your code, your documents, and your prompts stay on your hardware. For regulated industries or developers working on proprietary code, this is not optional — it is a hard requirement that cloud tools cannot satisfy.
With 48 GB of unified memory, your M4 Pro can run models that were cloud-only a year ago. Four models stand out in 2026 for different purposes — and unlike cloud APIs, you can keep all of them installed locally and switch based on the task at hand.
Click Animate to see how each model fits in your unified memory pool.
All four fit in 48 GB — you can run any one at a time, or mix smaller models simultaneously.
Qwen3-Coder is Alibaba's coding-specialized model, trained on 5.5 trillion code tokens. It beats GPT-4's HumanEval score and wins on SWE-bench for real-world agentic tasks. MoE architecture means only 3.3B parameters are active per token — fast inference despite the 30B total size.
Qwen3.6 is Alibaba's general-purpose successor to Qwen3. The 35B variant uses MoE with only 3B active parameters per token. Key advantage over qwen3-coder: it accepts image inputs alongside text, making it useful for analyzing screenshots, diagrams, and UI bugs. A good default if you want one model for everything.
GPT-OSS is OpenAI's own open-weight model, released under Apache 2.0. The 20B variant targets low-latency local use. Its strongest differentiator is native function calling and structured output support — ideal when your agent needs reliable JSON schemas or tool invocation. Shorter 128K context than the Qwen models, but the lightest on RAM of the four.
Google's Gemma 4 launched April 2026 and immediately became the strongest open model for reasoning in the sub-32B range. The 27B MoE variant scores 85.2% on MMLU Pro and 89.2% on AIME 2026 math reasoning — notably ahead of the Qwen models on pure reasoning benchmarks. Supports images across all variants, with edge models also supporting audio.
Pick your primary use case to get a recommendation.