A coding agent evaluation for small language models

We’re building Luna, an affordable & intelligent local AI assistant, think ChatGPT or Claude, runs locally on a sub-200$ device.

With the inherent limitations of the hardware, only Small Language Models can be run, specifically under 3 Billion parameters

In this mini-research, we set out to answer:

Agent framework: We pick https://github.com/huggingface/smolagents, allowing a model to generate code to solve problems
Models: We test a variety of Language Models
1. Open-source models: anything from Smollm2:135M to Phi4
2. Closed-source models: o3-mini & gpt-4.1
Compute: The LLM inference is done on a variety of services & hardware to optimize cost & speed
1. My personal NVIDIA 1050 Ti laptop (free)
2. Microsoft Azure VM (8vCPU + 32GB) (free with Startup Credits)
3. Vast AI (10$ spent on NVIDIA 4090 Ti)
4. Ori.co (10$ spent on NVIDIA A16)
5. Microsoft Azure AI Foundary with GPT 4.1 & o3-mini (free with Startup Credits)
6. Cloudflare AI Workers (free with Startup Credits)
Evaluation: OpenAI o3-mini serving as LLM-as-a-judge

We prepared a set of 183 tasks across 8 categories: