one year on
Meta releases Llama 3.2 with vision and on-device models at Connect 2024
Llama 3.2’s multimodal 11B and 90B models and lightweight 1B and 3B variants for edge and mobile devices debut, along with first Llama Stack distributions designed to simplify deployment.
At Meta Connect today, the company released Llama 3.2, its first Llama models to handle both text and images. The lineup includes an 11B and a 90B vision model, which Meta says are competitive with Claude 3 Haiku and GPT-4o-mini on image understanding benchmarks, alongside 1B and 3B text-only models optimized for mobile and edge devices. The small models support a 128K token context window and are enabled day-one on Qualcomm and MediaTek hardware, with Arm optimizations.
To support on-device deployment, Meta released Llama Stack distributions — standardized APIs for inference, tool use, and retrieval-augmented generation — packaged with partners including AWS, Databricks, Fireworks, Together AI, and Ollama for single-node setups, and PyTorch ExecuTorch for iOS on-device. The stack is meant to simplify the way developers work with Llama models in different environments, including single-node, on-prem, cloud, and on-device.
Meta says the models are available for download on llama.com and Hugging Face, and it is working with partners to make them available on day one.
The record
One year later — open only if you can handle spoilers
Llama 3.2 vision models saw moderate adoption in the open-source community but never fully displaced proprietary multimodal models. The 1B and 3B models became popular for on-device summarization and tool calling, though the 'true AR glasses' vision remained years away from mass market.