Background
Hugging Face offers a leading open-source library for machine learning (ML) tasks, providing state-of-the-art pre-trained models, datasets, and metrics to support ML application development. It serves as a central repository for sharing and versioning these resources, and users can access and download them for a wide range of ML applications such as NLP, computer vision and reinforcement learning.
Data retrieval
The Hugging Face Hub Python API is used to retrieve models, metrics, and datasets from the Hugging Face Hub. The three main components of Hugging Face´s repository are retrieved through the API:
Datasets: information retrieved for available datasets includes author, citation, description, number of downloads, and tags associated with each dataset. Additional fields such as language, task category, and data type can also be obtained for each dataset. A list of all languages present in Hugging Face is listed here.
Metrics: information retrieved for available metrics includes a unique identification number and a description associated with each metric.
Models: information retrieved for available models includes the author (e.g. developer, company, etc.), a unique identification number, the model task, the number of parameters, tags, and a time stamp of when the models were uploaded to Hugging Face. While the models retrieved through this method have unique identification numbers, double counting cannot be excluded since it is not possible to distinguish whether some of the models are derived from others, whole or in part. For instance, some models could be downloaded for fine-tuning and posteriorly uploaded with a different unique identification number.
Language models
Many of the models from Hugging Face are NLP-related and can be considered language models (LMs). While there is no such label in Hugging Face, this methodology considers a given model to be a language model if the model task contains any NLP-related keyword in the title, for instance “text-generation” and “token-classification”. This distinction allows for comparative analysis of LMs to other model types available on Hugging Face.
Determing the language of the model is based on voluntary tags the authors insert upon publication. Language tags appear as 2- or 3-digit iso codes (i.e. no indication that they relate to languages). Unfortunately, many 3-digit iso codes also have other meanings (e.g. ‘new’ is also the code for Newari). Therefore, only languages that have 2-digit iso codes are included.
As of February 2026, and using only 2-digit iso language codes, just over 25% of language-based models were found to have at least one language tag.
Data updates
The AI models and datasets visualisations on OECD.AI that use data from Hugging Face are updated on a quarterly basis.

























