
Goal
We’re building Luna, an affordable & intelligent local AI assistant, think ChatGPT or Claude, runs locally on a sub-200$ device.
With the inherent limitations of the hardware, only Small Language Models can be run, specifically under 3 Billion parameters
In this mini-research, we set out to answer:
- Are the current Small Language Models capable of executing agent tasks?
- How the cutting edge Giant Language Models (from OpenAI) perform?
Experimental Setup
- Agent framework: We pick https://github.com/huggingface/smolagents, allowing a model to generate code to solve problems
- Models: We test a variety of Language Models
- Open-source models: anything from Smollm2:135M to Phi4
- Closed-source models: o3-mini & gpt-4.1
- Compute: The LLM inference is done on a variety of services & hardware to optimize cost & speed
- My personal NVIDIA 1050 Ti laptop (free)
- Microsoft Azure VM (8vCPU + 32GB) (free with Startup Credits)
- Vast AI (10$ spent on NVIDIA 4090 Ti)
- Ori.co (10$ spent on NVIDIA A16)
- Microsoft Azure AI Foundary with GPT 4.1 & o3-mini (free with Startup Credits)
- Cloudflare AI Workers (free with Startup Credits)
- Evaluation: OpenAI o3-mini serving as LLM-as-a-judge
Tasks
We prepared a set of 183 tasks across 8 categories:
- Mathematics & Quantitative Reasoning
- Science & Technical Knowledge
- Language & Communication