GitHub data

Methodological note

Background

Since its creation in 2007, GitHub has become the main provider of Internet hosting for software development and version control. Many technology organisations and software developers use GitHub as a primary place for collaboration. To enable collaboration, GitHub is structured into projects or “repositories”, which contain all of a project’s files and each file’s revision history. The analysis of GitHub data could shed light on relevant metrics about who is developing AI software, where, how fast, and using which development tools. These metrics could serve as proxies for broader trends in the field of software development and innovation.

Identifying AI projects

Arguably, a significant portion of AI software development takes place on GitHub. OECD.AI partners with GitHub to identify public AI projects – or “repositories” – following the methodology developed by Gonzalez et al. (2020). Using the 439 topic labels identified by Gonzalez et al. – as well as the topics “machine learning”, “deep learning”, and “artificial intelligence” – GitHub provides OECD.AI with a list of public projects containing AI code. GitHub updates the list of public AI projects on a quarterly basis, which allows OECD.AI to capture trends in AI software development over time.

Obtaining AI projects’ metadata

OECD.AI uses GitHub’s list of public AI projects to query GitHub’s public API and obtain more information about these projects. Project metadata may include the individual or organisation who created the project; the programming language(s) (e.g. Python) and development tool(s) (e.g. Jupyter Notebooks) used in the project; as well as information about the contributions – or “commits” – made to it, which include the commit’s author and a timestamp. In practical terms, a contribution or “commit” is an individual change to a file or set of files. Additionally, GitHub automatically suggests topical tags to each project based on its content. These topical tags need to be confirmed or modified by the project owner(s) to appear in the metadata.

Mapping contributions to AI projects to a country

Contributions to public AI projects are mapped to a country based on location information at the contributor level and at the project level.

a) Location information at the contributor level:

GitHub’s “Location” field: contributors can provide their location in their GitHub account. Given that GitHub’s location field accepts free text, the location provided by contributors is not standardised and could belong to different levels (e.g. sub-urban, urban, regional, or national). To allow cross-country comparisons, Mapbox is used to standardise all available locations to the country level. In this case, the latest location available at the time of crawling is used.
Top level domain: where the location field is empty or the location is not recognised, a contributor’s location is assigned based on his or her email domain (e.g. “.fr“, “.us“, etc.). In this case, the contributor’s email assigned to the commit is used.

b) Location information at the project level:

Project information: where no location information is available at the contributor level, information at the repository or project level is exploited. In particular, contributions from contributors with no location information to projects created or owned by a known organisation are automatically assigned the organisation’s country (i.e. the country where its headquarters are located). For example, contributions from a contributor with no location information to an AI project owned by Microsoft will be assigned to the United States.

If the above fails, a contributor’s location field is left blank.

As of October 2021, 71.2% of the contributions to public AI projects were mapped to a country using this methodology. However, a decreasing trend in the share of AI projects for which a location can be identified is observed in time, indicating a possible lag in location reporting.

Measuring contributions to AI projects

Collaboration on a given public AI project is measured by the number of contributions – or “commits” – made to it.

To obtain a fractional count of contributions by country, an AI project is divided equally by the total number of contributions made to it. A country’s total contributions to AI projects is therefore given by the sum of its contributions – in fractional counts – to each AI project. In relative terms, the share of contributions to public AI projects made by a given country is the ratio of that country’s contributions to each of the AI projects in which it participates over the total contributions to AI projects from all countries.

In future iterations, OECD.AI plans to include additional measures of contribution to AI software development, such as issues raised, comments, and pull requests.

Where there are no commits or location detected for users making the commits, the location of the repository owner is used as a fallback mechanism. In most cases, this will just result in an additional commit assigned the owner of the repository.

Based on commits, fractional counts are used to assign an AI project or repository to one or more countries. For example, the location of a project with one commit from France, one from the US, and one from Pakistan will be split equally between these countries (0.33 each). The repository’s creation date is used to situate the project in time. Available filters (e.g. popularity and impact) are properties of the AI project and not of the commits.

Identifying the programming languages and development tools used in AI projects

GitHub uses file extensions contained in a project to automatically tag it with one or more programming language and/or development tool. This implies that more than one programming language or development tool could be used in a given AI project.

Measuring the quality of AI projects

Two quality measures are used to classify public AI projects:

Project impact: the impact of an AI project is given by the number of managed copies (i.e. “forks”) made of that project.
Project popularity: the popularity of an AI project is given by the number of followers (i.e. “stars”) received by that project.

Filtering by project impact or popularity could help identify those countries that contribute the most to high quality projects.

Measuring collaboration

Two countries are said to collaborate on a specific public AI software development project if there is at least one contributor from each country with at least one contribution (i.e. “commit”) to the project. Domestic collaboration occurs when two contributors from the same country contribute to a project.