OpenAlex

Methodological note

Introduction

The OpenAlex Dataset is a comprehensive, open-source bibliographic database offering extensive information on academic publications (Priem et al., 2022). Succeeding the Microsoft Academic Graph (MAG) (Sinha et al., 2015; Wang et al., 2019), it is maintained by The OpenResearch Foundation. It includes over 245 million research publications, including journals, conferences, and workshop papers.

The dataset encompasses a wide range of bibliographic data, including information about authors, institutions and their countries, journals, conferences, and fields of study. It also includes citation data, allowing for the analysis of citation networks and the impact of individual publications. The data records are tagged with a set of 65 000 topics from Wikidata, covering a range of different subjects.

The OpenAlex Dataset is also designed to be interoperable with other data sources, making it a valuable resource for data integration projects. Its comprehensive coverage of academic publications, combined with its open-source nature, makes it a unique and valuable resource for researchers and data scientists alike.

Determining Artificial Intelligence papers for OECD.AI

The visualizations in the OECD AI Policy Observatory (OECD.AI) utilize a subset of the OpenAlex Dataset, specifically papers related to AI. A paper is considered to be about AI if it is tagged during the concept detection operation with a field of study that is categorised in either the “artificial intelligence” or the “machine learning” fields of study in the OpenAlex taxonomy. Results from other fields of study, such as “natural language processing”, “speech recognition”, and “computer vision” are only included if they also belong to the “artificial intelligence” or the “machine learning” fields of study. As such, the results are likely to be conservative.

Collaboration between countries or institutions

OECD.AI showcases research collaborations between various entities, including institutions and countries, adhering to the “OECD Guidelines on the use of country names and codes.”. This is done by assigning each paper to the relevant institutions and countries on the basis of the authors’ institutional affiliations (information about an author’s institutional affiliations is available for about 51% of AI publications in OpenAlex. Thus, collaboration statistics may be underestimated). OECD and CSIC (2016) define collaboration as “co-authorship involving different institutions. International collaboration refers to publications co-authored among institutions in different countries…National collaboration concerns publications co-authored by different institutions within the reference country. No collaboration refers to publications not involving co-authorship across institutions. No collaboration includes singled-authored articles, as long as the individual has a single affiliation, as well as multiple-authored documents within a given institution.” Institutional measures of collaboration may overestimate actual collaboration in the case of countries where it is common practice to have a double affiliation (OECD and CSIC, 2016).

To avoid double counting, collaborations are considered to be binary: either an entity collaborates on a paper (value=1) or it does not (value=0). The shared paper counts as one toward the number of collaborations between two entities. The following rules apply:

For between-country collaborations: papers written by authors from more than one institution in the same country only count as one collaboration for that country.
For between-institution collaboration: papers written by more than one author from the same institution only count as one collaboration for that institution.

Counting of publications: quantity measure

In absolute terms, each publication counts as one unit towards an entity (a country or an institution). To avoid double-counting, a publication written by multiple authors from different institutions is split equally among each author. For example, if a publication has four authors from institutions in the US, one author from an institution in China and one author from a French institution, then 4/6 are attributed to the US, 1/6 to China and 1/6 to France. Similar logic is applied when counting citations. We provide this normalized measure as an additional indicator to the raw count of publications and citations, under the suffix (fractional count).

The “Publications per capita” checkbox allows the user to normalise the number of publications per unit of population for countries with a population of at least one million.

Counting of publications: quality measure

Although by no means a perfect measure, citations are used to estimate the ‘quality’ of a publication, with a decay factor to adjust for time. For each publication, the normalised quality score is:

Quality score = # Citations / [(upcoming year) – (year of publication)]

Based on this score, publications are categorised into low, medium, or high quality. To determine the appropriate category, we calculate the average number of citations per year in the field of AI, and normalise it using the same formula as above. We used the distribution of these normalised scores to assign categories; any publication below the first quantile is tagged as low quality, publications between the first and the third quantile are considered a medium quality, and publications above the third quantile are considered high quality. The specific thresholds are shown in the table below:

Normalized score of a paper	Category
score≤0.828	Low
0.828<score≤0.946	Medium
score>0.946	High

Table of publication scores ranges and their classification into low, medium and high quality.

Policy areas

Classification of scientific publications by policy area

A list of the most relevant fields of study from the OpenAlex taxonomy was created for each policy area. Policy areas include agriculture, competition, corporate governance, development, digital economy, economy, education, employment, environment, finance and insurance, health, industry and entrepreneurship, innovation, investment, public governance, science and technology, social and welfare issues, tax, trade, and transport. An AI-related publication from Openalex must contain at least one of the relevant OpenAlex topics for a given policy area to be classified in that policy area.

AI Compute and environmental sustainability

AI research publications in compute and environmental sustainability were analysed by selecting publications that match the following concepts, keywords and subtopics using Wikidata as the knowledge base: “computer cluster”, “computer graphics”, “computer hardware”, “networking hardware”, “central processing unit”, “cloud computing”, “computing platform”, “Microsoft Azure”, “Amazon web services”, “Google cloud platform”, “Cloud computing”, “Oracle cloud”, “HPC”, and “HPCC”. Job postings related to “environmental sustainability” were matched against the following concepts, keywords and subtopics: “ecosystem”, “digital twin”, “efficiency”, “environmental sustainability”, and “sustainable development”.

Additional metrics

Additional metrics are used to construct the y-axis of the “AI publications vs GDP per capita by country, region, in time” chart. These indicators include GDP, population and GDP per capita.

Visualisations also make use of gross domestic expenditures on research and development (R&D), expressed as a percent of GDP. They include both capital and current expenditures in the four main sectors: Business enterprise, Government, Higher education and Private non-profit. R&D covers basic research, applied research, and experimental development. Source: UNESCO Institute for Statistics (uis.unesco.org).

Limitations

The drop in bibliographic data from OpenAlex in 2024 resulted from a combination of internal adjustments in data processing and external factors affecting data availability. These were mostly one-time shifts or temporary breaks in data availability aimed to improve data quality, rather than a long-term reduction in comprehensiveness. These include metadata inaccuracies (see here); changes in data sources and ingestion practices (see here); transition to new data formats and snapshot updates (see here); discrepancies in document type classifications (see here); challenges in author disambiguation (see here); external factors affecting data availability (see here).

Users are encouraged to stay informed about ongoing updates and improvements to OpenAlex’s data infrastructure.

References

Priem, J.; Piwowar, H.; and Orr, R. (2022), OpenAlex: A Fully-Open Index of Scholarly Works, Authors, Venues, Institutions, and Concepts. arXiv:2205.01833 [cs]. DOI: http://dx.doi.org/10.48550/arXiv.2205.01833.

OECD (2019a), Recommendation of the Council on Artificial Intelligence, OECD/LEGAL/0449, https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0449.

OECD (2019b), Revised outline for practical guidance for the Recommendation of the Council on Artificial Intelligence, https://one.oecd.org/document/DSTI/CDEP(2019)4/REV2/en.

OECD and SCImago Research Group (CSIC) (2016), Compendium of Bibliometric Science Indicators, OECD Publishing, Paris, http://oe.cd/scientometrics.

Sinha, A.; Shen, Z.; Song, Y.; Ma, H.; Eide, D.; Hsu, B.; and Wang, K. (2015), An Overview of Microsoft Academic Service (MAS) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW ’15 Companion). ACM, New York, NY, USA, 243-246. DOI: http://dx.doi.org/10.1145/2740908.2742839.

Wang, K.; Shen, Z.; Huang, C.; Wu, C.; Eide, D.; Dong, Y.; Qian, J.; Kanakia, A.; Chen, A.; and Rogahn, R. (2019), A Review of Microsoft Academic Services for Science of Science Studies, Frontiers in Big Data, https://doi.org/10.3389/fdata.2019.00045.