OpenAlex

Methodological note

Introduction

The OpenAlex Dataset is a comprehensive, open-source bibliographic database offering extensive information on academic publications (Priem et al., 2022). Succeeding the Microsoft Academic Graph (MAG) (Sinha et al., 2015; Wang et al., 2019), it is maintained by The OpenResearch Foundation. It includes over 245 million research publications, including journals, conferences, and workshop papers.

The dataset encompasses a wide range of bibliographic data, including information about authors, institutions and their countries, journals, conferences, and fields of study. It also includes citation data, allowing for the analysis of citation networks and the impact of individual publications. The data records are tagged with a set of 65 000 topics from Wikidata, covering a range of different subjects.

The OpenAlex Dataset is also designed to be interoperable with other data sources, making it a valuable resource for data integration projects. Its comprehensive coverage of academic publications, combined with its open-source nature, makes it a unique and valuable resource for researchers and data scientists alike.

Determining Artificial Intelligence papers for OECD.AI 

The visualizations in the OECD AI Policy Observatory (OECD.AI) utilize a subset of the OpenAlex Dataset, specifically papers related to AI. A paper is considered to be about AI if it is tagged during the concept detection operation with a field of study that is categorised in either the “artificial intelligence” or the “machine learning” fields of study in the OpenAlex taxonomy. Results from other fields of study, such as “natural language processing”, “speech recognition”, and “computer vision” are only included if they also belong to the “artificial intelligence” or the “machine learning” fields of study. As such, the results are likely to be conservative. 

Collaboration between countries or institutions 

OECD.AI showcases research collaborations between various entities, including institutions and countries, adhering to the “OECD Guidelines on the use of country names and codes.”. This is done by assigning each paper to the relevant institutions and countries on the basis of the authors’ institutional affiliations (information about an author’s institutional affiliations is available for about 51% of AI publications in OpenAlex. Thus, collaboration statistics may be underestimated). OECD and CSIC (2016) define collaboration as “co-authorship involving different institutions. International collaboration refers to publications co-authored among institutions in different countries…National collaboration concerns publications co-authored by different institutions within the reference country. No collaboration refers to publications not involving co-authorship across institutions. No collaboration includes singled-authored articles, as long as the individual has a single affiliation, as well as multiple-authored documents within a given institution.” Institutional measures of collaboration may overestimate actual collaboration in the case of countries where it is common practice to have a double affiliation (OECD and CSIC, 2016). 

To avoid double counting, collaborations are considered to be binary: either an entity collaborates on a paper (value=1) or it does not (value=0). The shared paper counts as one toward the number of collaborations between two entities. The following rules apply: 

  • For between-country collaborations: papers written by authors from more than one institution in the same country only count as one collaboration for that country. 
  • For between-institution collaboration: papers written by more than one author from the same institution only count as one collaboration for that institution. 

Counting of publications: quantity measure 

In absolute terms, each publication counts as one unit towards an entity (a country or an institution). To avoid double-counting, a publication written by multiple authors from different institutions is split equally among each author. For example, if a publication has four authors from institutions in the US, one author from an institution in China and one author from a French institution, then 4/6 are attributed to the US, 1/6 to China and 1/6 to France. Similar logic is applied when counting citations. We provide this normalized measure as an additional indicator to the raw count of publications and citations, under the suffix (fractional count).

The “Publications per capita” checkbox allows the user to normalise the number of publications per unit of population for countries with a population of at least one million.

Counting of publications: quality measure 

Although by no means a perfect measure, citations are used to estimate the ‘quality’ of a publication, with a decay factor to adjust for time. For each publication, the normalised quality score is:

Quality score = # Citations / [(upcoming year) – (year of publication)]

Based on this score, publications are categorised into low, medium, or high quality. To determine the appropriate category, we calculate the average number of citations per year in the field of AI, and normalise it using the same formula as above. We used the distribution of these normalised scores to assign categories; any publication below the first quantile is tagged as low quality, publications between the first and the third quantile are considered a medium quality, and publications above the third quantile are considered high quality. The specific thresholds are shown in the table below:

Normalized score of a paperCategory
score≤0.828Low
0.828Medium
score>0.946High
Table of publication scores ranges and their classification into low, medium and high quality.

Policy areas

Classification of scientific publications by policy area

A list of the most relevant fields of study from the OpenAlex taxonomy was created for each policy area. Policy areas include agriculture, competition, corporate governance, development, digital economy, economy, education, employment, environment, finance and insurance, health, industry and entrepreneurship, innovation, investment, public governance, science and technology, social and welfare issues, tax, trade, and transport. An AI-related publication from Openalex must contain at least one of the relevant OpenAlex topics for a given policy area to be classified in that policy area. 

AI Compute and environmental sustainability

AI research publications in compute and environmental sustainability were analysed by selecting publications that match the following concepts, keywords and subtopics using Wikidata as the knowledge base: “computer cluster”, “computer graphics”, “computer hardware”, “networking hardware”, “central processing unit”, “cloud computing”, “computing platform”, “Microsoft Azure”, “Amazon web services”, “Google cloud platform”, “Cloud computing”, “Oracle cloud”, “HPC”, and “HPCC”. Job postings related to “environmental sustainability” were matched against the following concepts, keywords and subtopics: “ecosystem”, “digital twin”, “efficiency”, “environmental sustainability”, and “sustainable development”.

Additional metrics

Additional metrics are used to construct the y-axis of the “AI publications vs GDP per capita by country, region, in time” chart. These indicators include: 

  • GDP: GDP at purchaser’s prices is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without deducting the depreciation of fabricated assets or the depletion and degradation of natural resources. Data are in current US dollars. Dollar figures for GDP are converted from domestic currencies using single year official exchange rates. For a few countries where the official exchange rate does not reflect the rate effectively applied to actual foreign exchange transactions, an alternative conversion factor is used. Sources: World Bank national accounts data and OECD National Accounts data files (data.worldbank.org/). 
  • GDP per capita: GDP per capita is gross domestic product divided by midyear population. GDP is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in current U.S. dollars. Sources: World Bank national accounts data and OECD National Accounts data files (data.worldbank.org/). 
  • Population: Total population is based on the de facto definition of population, which counts all residents regardless of legal status or citizenship. The values shown are midyear estimates. Sources: United Nations Population Division, World Population Prospects: 2019 Revision; Census reports and other statistical publications from national statistical offices; Eurostat: Demographic Statistics; United Nations Statistical Division, Population and Vital Statistics Report; U.S. Census Bureau: International Database; and Secretariat of the Pacific Community: Statistics and Demography Programme (data.worldbank.org/). 
  • R&D expenditure (% of GDP): Gross domestic expenditures on research and development (R&D), expressed as a percent of GDP. They include both capital and current expenditures in the four main sectors: Business enterprise, Government, Higher education and Private non-profit. R&D covers basic research, applied research, and experimental development. Source: UNESCO Institute for Statistics (uis.unesco.org). 

For these metrics, data is interpolated in years where no data is available. If last year’s value is missing for an indicator, the value of the latest available year is used. 

References

Priem, J.; Piwowar, H.; and Orr, R. (2022), OpenAlex: A Fully-Open Index of Scholarly Works, Authors, Venues, Institutions, and Concepts. arXiv:2205.01833 [cs]. DOI: http://dx.doi.org/10.48550/arXiv.2205.01833.

OECD (2019a), Recommendation of the Council on Artificial Intelligence, OECD/LEGAL/0449, https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0449

OECD (2019b), Revised outline for practical guidance for the Recommendation of the Council on Artificial Intelligence, https://one.oecd.org/document/DSTI/CDEP(2019)4/REV2/en.

OECD and SCImago Research Group (CSIC) (2016), Compendium of Bibliometric Science Indicators, OECD Publishing, Paris, http://oe.cd/scientometrics.

Sinha, A.; Shen, Z.; Song, Y.; Ma, H.; Eide, D.; Hsu, B.; and Wang, K. (2015), An Overview of Microsoft Academic Service (MAS) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW ’15 Companion). ACM, New York, NY, USA, 243-246. DOI: http://dx.doi.org/10.1145/2740908.2742839.

Wang, K.; Shen, Z.; Huang, C.; Wu, C.; Eide, D.; Dong, Y.; Qian, J.; Kanakia, A.; Chen, A.; and Rogahn, R. (2019), A Review of Microsoft Academic Services for Science of Science Studies, Frontiers in Big Data, https://doi.org/10.3389/fdata.2019.00045.