Small Language Models: A Comparison of Qwen, Mistral, and Gemma

2025-03-21
ℹ️Note on the source

This blog post was automatically generated (and translated). It is based on the following original, which I selected for publication on this blog:
Gemma 3 27b vs. QwQ 32b vs. Mistral 24b vs. Deepseek r1 – Composio.

Small Language Models: A Comparison of Qwen, Mistral, and Gemma

While attention is often focused on the largest language models, recent advancements in smaller models are noteworthy. Models with 32 billion parameters, once considered substantial, can now be run effectively on local setups, providing an alternative to relying on large AI providers and potentially reducing costs for developers.

Recently, Qwen, Mistral, and Google have released QwQ 32B (a reasoning model), Mistral Small 24B, and Gemma 27B (a base model), respectively. Despite varying architectures, their performance scores are comparable to the Deepseek R1 model.

How do these models compare in practice?

Performance Overview

Here's a summary of the models' performance across different tasks:

  • QwQ 32B: Excels in coding, reasoning, and math, although it tends to be verbose.
  • Gemma 3: A solid model with good overall performance, but its license is restrictive.
  • Mistral Small 3.1: Capable of handling simple tasks but lags behind QwQ in overall performance.

For local hosting, QwQ 32B emerges as the preferred choice based on this comparison.

Model Highlights

QwQ 32B

Developed by Alibaba, this 32B parameter model aims to rival Deepseek R1, which has 671B parameters. Benchmarks suggest that QwQ-32B exhibits competitive performance, especially in certain domains.

Gemma 3 27B

Google's open-source model, based on Gemini 2.0, is available in multiple sizes. It is designed to run efficiently on resource-constrained devices, such as a single GPU or TPU, and supports multiple languages. Gemma 3 is primarily built for reasoning tasks.

Mistral Small 3.1 24B

This model offers multimodal understanding and an expanded context window of up to 128k tokens. Claims have been made that it outperforms Gemma 27B and GPT-4.0 mini.

Coding Capabilities

In a test involving the creation of a JavaScript simulation of a rotating 3D sphere made of alphabets, QwQ produced impressive results, accurately implementing the animation, letter spinning, and color changes. Gemma 3 delivered a partially functional output, while Mistral Small failed to produce any meaningful result. In a LeetCode question, QwQ provided a correct answer with appropriate time complexity, while Gemma 3 and Mistral Small fell short.

Reasoning Skills

When tested on reasoning questions, Gemma 3 swiftly provided accurate answers. QwQ also performed well, demonstrating an impressive thought process. Both models successfully answered questions involving common sense and pattern recognition.

Mathematical Prowess

Both QwQ and Gemma 3 correctly answered mathematical questions involving clock angles and word arrangements. This further reinforces their potential in tasks requiring logical and analytical skills.

Conclusion

Smaller language models are rapidly evolving, offering viable alternatives to larger models for specific applications. Qwen stands out for its balance of size and performance. Gemma 3 emerges as a solid base model, distilling the capabilities of larger models into a smaller footprint. While Mistral Small shows promise, it appears to lag behind Qwen and Gemma in overall performance.

  • QwQ 32B is recommended for coding tasks.
  • QwQ is also the preferred choice for reasoning and math, although Gemma performs respectably.
  • Gemma and Mistral Small offer image input support, which is a significant advantage for multimodal applications.
  • QwQ and Mistral utilize the Apache 2.0 license, while Gemma's license is more restrictive.

Overall, QwQ 32B provides performance comparable to Deepseek R1 in a more compact size. Is this the beginning of a shift towards more efficient and accessible language models?


Comments are closed.