Hugging Face data

Methodological note

Hugging Face offers a leading open-source library for machine learning (ML) tasks, providing state-of-the-art pre-trained models, datasets, and metrics to support ML application development. It serves as a central repository for sharing and versioning these resources, and users can access and download them for a wide range of ML applications such as NLP, computer vision and reinforcement learning. 

Data retrieval 

The Hugging Face Hub Python API is used to retrieve models, metrics, and datasets from the Hugging Face Hub. The three main components of Hugging Face´s repository are retrieved through the API: 

Datasets: information retrieved for available datasets includes author, citation, description, number of downloads, and tags associated with each dataset. Additional fields such as language, task category, and data type can also be obtained for each dataset. A list of all languages present in Hugging Face is listed here.

Metrics: information retrieved for available metrics includes a unique identification number and a description associated with each metric. 

Models: information retrieved for available models includes the author (e.g. developer, company, etc.), a unique identification number, the model task, the number of parameters, and a time stamp of when the models were uploaded to Hugging Face. While the models retrieved through this method have unique identification numbers, double counting cannot be excluded since it is not possible to distinguish whether some of the models are derived from others, whole or in part. For instance, some models could be downloaded for fine-tuning and posteriorly uploaded with a different unique identification number. 

Language models 

Many of the models from Hugging Face are NLP-related and can be considered language models (LMs). While there is no such label in Hugging Face, this methodology considers a given model to be a language model if the model task contains any NLP-related keyword in the title, for instance “text-generation” and “token-classification”. This distinction allows for comparative analysis of LMs to other model types available on Hugging Face. 

Compute indicators from Hugging Face 

The variables related to the models in the data retrieval step include one key indicator supporting a better understanding of the amount of compute needed for model training: the number of parameters.  

It is acknowledged that Hugging Face users often fine-tune their models rather than train them from scratch. However, for the purposes of the training cost simulator, the methodology assumes that all models are trained from scratch, thus excluding fine-tuning scenarios. It should be noted that this likely leads to an overestimation of the cost of training AI models, as fine-tuning typically requires less computing power than training models from scratch.

Simulating the training costs of AI models 

A formula to estimate the cost of training AI models was co-developed with the OECD.AI Expert Group on AI Compute and Climate, and is shown below: 

Training cost = Training time in GPU-seconds * Dollar cost of a GPU hour / 3600 seconds per hour 

The formula inputs can be further broken down into several elements:  

Training time in GPU-seconds = Actual FLOPs for training / Number of GPU FLOPs 

Actual FLOPs for training = FLOPs for training / MFU 

FLOPs for training = 6 * Number of parameters * Number of tokens 

This formula is applied to all models where time stamps were available. To obtain the number of parameters, each model was downloaded individually. Note that only some models with time stamps also had data on the number of parameters (only about 13% of these models).  

The key input variables for the formula are as follows: 

  • Number of parameters: The number of parameters in a Hugging Face model represents the adjustable components within the model. More parameters typically imply more capacity to understand and generate language. This is the primary input variable and is obtained by downloading each model.
  • Number of tokens: This methodology assumes that all models are trained on 1 chinchilla, which is approximately 20 times the number of parameters for a given model. This number refers to the number of tokens of text necessary to train a model of a particular size in an optimally efficient fashion. Chinchillas are typically used to determine the optimal number of tokens for decoder-only models like GPT-3, but this approximation is also used for distinct models, such as BERT. 
  • Model FLOPs Utilization (MFU): This variable determines how efficiently the model uses its processing power during a single round of moving forward and backward through its calculations. It is based on the number of mathematical operations needed for this process, also known as floating point operations per second (FLOPs). This methodology assumes a MFU value of 55% for the purposes of this simulation, which assumes a high degree of efficiency typically associated with training in an optimised fashion. 
  • Number of GPU FLOPs: This variable represents the direct mathematical measure of a computer’s performance. A value of 312 TFLOPs is considered for the A100-40GB GPU. 
  • Dollar cost of a GPU hour: This variable calculates the dollar cost (in USD) of using a GPU for one hour. For simplification purposes, a value of 1.10 USD is used, which is an approximation based on the 2023 prices available from commonly used cloud providers.
  • Number of Graphics Processing Units (GPUs): A GPU is a specialized electronic circuit designed to perform rapid calculations and manipulations needed for rendering graphics and images. There is currently no data available for this input variable. As such, the number of 64 GPUs was used as an assumption in the methodology as this is estimated to be a common setting in most cloud providers. 
  • GPU type: The hardware used to train the model. There is currently no data available for this input variable. As such, the methodology assumes that a A100-40GB GPU chip was used as this is estimated to be a common setting in most cloud providers. 

This formula simulates the cost to train a model, while making various assumptions about key input variables. These include hardware setting and efficiency assumptions typically associated with training in an optimised fashion. The simulated values from the application of this formula should not be interpreted as the true model cost values as they may differ from the real values. More precise and complete information on key variables like hardware settings (e.g. number of GPUs and GPU type) would help to improve the accuracy of the cost simulator.  

Data updates 

The AI models and datasets visualisations on OECD.AI that use data from Hugging Face are updated on a quarterly basis. 

Acknowledgement 

This methodology was developed in collaboration with the OECD.AI Expert Group on AI Compute and Climate. Special thanks are extended to Steering Group member Jonathan Frankle (MosaicML) for contributions to the development of the formula, Expert Group member Sasha Luccioni (Hugging Face) for collaboration in data access and processing, Steering Group member David Kanter (MLCommons) for inputs into developing the methodology, and Co-Chair Keith Strier (NVIDIA) for oversight and guidance.