Hugging face Introduced two new variants to its smolvlm vision language models last week. The new artificial intelligence (ai) models are available in 256 million and 500 million parameter sizes, with the former being claimed as the world’s smallest vision model by the Campany. The new variants focus on retaining the efficiency of the older two-billion parameter model with reducing the size significantly. The company highlighted that the new models can be locally run on Constrained devices, consumer laptops, or even potentially browsers-based inference.
Hugging face introduces smaller smolvlm ai models
In a blog postThe company announced the smolvlm-256m and smolvlm-500m vision language models, in addition to the existing 2 billion parameter model. The release brings two base models and two instructions fin-tuned models in the Abovemented parameter sizes.
Hugging face said that these models can be loaded directly to transformers, machine learning exchanges (MLX), and Open Neural Network Exchange (onnx) . Notably, these are open-source models available with an apache 2.0 license for both personal and commercial usage.
With the new ai models, hugging face aims to brings multimodal models focused on computer vision to portable devices. The 256 million parameter model, for instance, can be run on less than one gb of gpu memory and 15GB of ram to process 16 images per second (with a batch size of 64).
Andrés Marafioti, A Machine Learning Resineer at Hugging Face told Venturebeat, “For a mid-sized company processing 1 Million images monthly, this translates to Substanti Annual Saveings in Compute Costs.”
To reduce the size of the AI models, the resurchars switched the vision encoder from the previous siglip 400m to a 93M-paarameter siglip base patch. Additional, the tokenisation was also optimized. The new vision models encode images at a rate of 4096 pixels per token, compared to 1820 pixels per token in the 2b model.
Notable, the smaller models are also marginally behind the 2b model in terms of performance, but the company said this trade-of has been kept at a minimum. As per hugging face, the 256m variant can be used for captioning images or short videos, Answering questions about documents, and basic visual reasoning tasks.
Developers can use transformers and mlx for infection and fin-tuning the ai model as they work with the old smolvlm code out-of-to-box. These models are also also listed on hugging face.
(Tagstotranslate) Hugging face smolvlm 256m 500m Vision language AI Model Open Source Released Hugging factor En source