Microsoft Academic Graph data 

Methodological note

Introduction

The Microsoft Academic Graph (MAG) is a heterogeneous graph containing scientific publication records, citation relationships between those publications, as well as authors, institutions, journals, conferences, and fields of study (Sinha et al., 2015; Wang et al., 2019). 

MAG employs advances in machine learning, semantic inference and knowledge discovery to explore scholarly information. It is a semantic search engine (i.e. not keyword-based), which means that it employs natural language processing (NLP) to understand and remember the knowledge conveyed in each document. 

Information is indexed by paper type, author, time of publication, field of study, publication outlet and institution. The graph is updated on a bi-weekly basis. In 2019, close to one million new papers were added every month (Wang, 2019). 

Each paper is automatically categorised into a field of study following a machine-learned taxonomy. An algorithm detects the concepts present in each paper and identifies, or “learns”, the hierarchy of different fields of studies. This concept detection operation is performed bi-weekly every time the graph (In discrete mathematics, graphs are mathematical structures used to model pairwise relations between objects. Graphs are made up of vertices, also called nodes or points, which are connected by edges, also called links or lines) is updated and applied to every new article, which is tagged accordingly. The taxonomy itself is adjusted every 6 months. Because top-level disciplines are important and visible, the top two levels of the taxonomy are manually reviewed against Wikipedia’s hierarchy of topic classifications (Shen et al., 2018).

Determining Artificial Intelligence papers for OECD.AI 

The visualisations provided in the OECD AI Policy Observatory (OECD.AI) use a subset of the MAG comprised of papers related to AI. A paper is considered to be about AI if it is tagged during the concept detection operation with a field of study that is categorised in either the “artificial intelligence” or the “machine learning” fields of study in the MAG taxonomy. Results from other fields of study, such as “natural language processing”, “speech recognition”, and “computer vision” are only included if they also belong to the “artificial intelligence” or the “machine learning” fields of study. As such, the results are likely to be conservative. 

Collaboration between countries or institutions 

OECD.AI displays research collaborations between different entities, either institutions or countries (Country names and codes in OECD.AI abide by the “OECD Guidelines regarding the use of the list of names of countries and territories.” ). This is done by assigning each paper to the relevant institutions and countries on the basis of the authors’ institutional affiliations (Information about an author’s institutional affiliations is available for about 51% of AI publications in MAG. Thus, collaboration statistics may be underestimated). OECD and CSIC (2016) define collaboration as “co-authorship involving different institutions. International collaboration refers to publications co-authored among institutions in different countries…National collaboration concerns publications co-authored by different institutions within the reference country. No collaboration refers to publications not involving co-authorship across institutions. No collaboration includes singled-authored articles, as long as the individual has a single affiliation, as well as multiple-authored documents within a given institution.” Institutional measures of collaboration may overestimate actual collaboration in the case of countries where it is common practice to have a double affiliation (OECD and CSIC, 2016). 

To avoid double counting, collaborations are considered to be binary: either an entity collaborates on a paper (value=1) or it does not (value=0). The shared paper counts as one toward the number of collaborations between two entities. The following rules apply: 

  • For between-country collaborations: papers written by authors from more than one institution in the same country only count as one collaboration for that country. 
  • For between-institution collaboration: papers written by more than one author from the same institution only count as one collaboration for that institution. 

Matching institutions to their countries 

The Global Research Identifier Database (GRID) was used to match institutions in MAG to a country. Information about an institution’s geographical coordinates, city, and country from GRID allowed for the geolocation of 72% of the institutions in MAG. The remaining institutions were matched manually to their respective countries. 

An artificial entity called “international organisations” was created to reflect papers written by international organisations. This avoids counting these papers as originating in the country where the relevant international organisation is headquartered. 

Following the same logic, papers from multinational enterprises were attributed to the country in which the company’s headquarters are located, regardless of the country in which the actual research was conducted. 

Moreover, information from GRID was used to classify institutions by type, including company, education, government, healthcare, archive, facility, non-profit and others. Heuristics were defined to classify institutions for which this information was missing in GRID. Where heuristics did not work, institution types were input manually.

Type of papers 

The MAG classifies papers into the following types depending on publication outlets: conference; book; book chapter; repository; patents; journal; and other. The “Repository” type refers to archival sites, including arXiv, bioRxiv, and SSRN. There may be several versions of “Repository” papers, including some that may be published in conventional journals. “Other” is a category comprising papers from journals or conferences of which the quality is unknown. This includes one-off workshops, new journals, or venues that no longer exist. 

Arguably, patents are conceptually different from the other categories in this listing (MAG includes patent applications as publications since they “fit well into the model of publication entity…because they all have authors, affiliations, topical contents, etc., and can receive citations” [Wang et al., 2019]).   Therefore, for simplicity, the category “Research publications” – the default setting for most MAG data visualisations on OECD.AI – comprises all paper types except patents. The drop-down menu “Publication type” allows selecting and viewing results for patents only (Note that results for patents come with some considerable lag, as “publication [of a patent] generally only takes place 18 months after the first filing. As a result, patent data are publicly available for most countries across the world, often in long time series” [OECD, 2009]). 

10 Saliency rankings differ from traditional citation counts in that the latter treat each citation as equal and perpetual, whereas the earlier imposes weighting on each citation based on the factors mentioned above. While citation counts could be altered with relative ease, to boost the saliency of an article one would have to persuade many well-established authors publishing at reputable venues to cite it frequently. 

Counting of publications: quantity measure 

In absolute terms, each publication counts as one unit towards an entity (a country or an institution). To avoid double-counting, a publication written by multiple authors from different institutions is split equally among each author. For example, if a publication has four authors from institutions in the US, one author from an institution in China and one author from a French institution, then 4/6 are attributed to the US, 1/6 to China and 1/6 to France. The same logic applies to institutional collaborations. 

The “Publications per capita” checkbox allows the user to normalise the number of publications per unit of population for countries with a population of at least one million.

Counting of publications: quality measure 

MAG assigns a rank to each publication to indicate its relevance (Since papers may be published in different formats and venues, in some cases different versions of the same paper exist in MAG. These papers are grouped under a unique identifier called “family ID”, that borrows the value or “paper ID” of the main paper in the family. MAG ranks its publications by family ID). Since papers may be published in different formats and venues, in some cases different versions of the same paper exist in MAG. These papers are grouped under a unique identifier called “family ID”, which borrows the value or “paper ID” of the main paper in the family. MAG ranks its publications by family ID. It does so by using a dynamic eigencentrality measure that ranks a publication highly if that publication impacts highly ranked publications, is authored by highly ranked scholars from reputable institutions, or is published in a highly regarded venue and also considers the competitiveness of the field. The eigencentrality measure can be considered as the likelihood that a publication would be evaluated as being highly impactful if a survey were to be posed to the entire scholarly community. For this reason, MAG calls this measure the “saliency” of the publication. Saliency rankings differ from traditional citation counts in that the latter treat each citation as equal and perpetual, whereas the earlier imposes weighting on each citation based on the factors mentioned above. While citation counts could be altered with relative ease, to boost the saliency of an article one would have to persuade many well-established authors publishing at reputable venues to cite it frequently. Similarly, the saliency of an author, institution, field, and publication venue represents the sum of all saliencies of the respective publications. 

To adjust for temporal bias – i.e. older publications having more citations than more recent ones because they have been in circulation longer – MAG considers that saliency is an autoregressive stochastic process. This means that the saliency of a publication decays over time if a publication does not receive continuing acknowledgments, or its authors, publication venue and fields do not maintain their saliency levels. Reinforcement learning is used to estimate the rate of the decay and to adapt the saliency to best predict future citation behaviours. By leveraging the scale of Microsoft’s web crawler in Bing, MAG observes tens of millions of citations each week, which serve as feedback from the entire scholarly community on its saliency assessments. Find out more about how publications are ranked in MAG.  

OECD.AI uses the saliency rankings from MAG as a measure of quality. To provide fairer intertemporal comparisons, publication ranks are normalised according to the publication year. 

OECD AI Principles 

Classification of scientific publications by AI Principle 

A pool-based active learning algorithm (Active learning is a type of iterative supervised learning that interactively queries the user to obtain the desired labels for new data points. By choosing the data from which it learns, an active learning algorithm is designed to achieve greater accuracy with fewer training labels than in traditional supervised learning [Settles, 2010])was developed and trained to classify publications under each of the OECD AI Principles “AI Principles” (Please see OECD [2019a] for more information about the OECD AI Principles). A subset of the AI publications in MAG was selected to train the active learning algorithm. This was accomplished by estimating a semantic similarity score between a publication’s title and abstract, and information on each of the AI Principles from the following resources:

  • OECD Recommendation of the Council on Artificial Intelligence (OECD, 2019a).
  • Practical implementation guidance for the OECD AI principles (OECD, 2019b).
  • List of keywords purposely created for each AI Principle, assigning a specific relevance level to each term (i.e., either high or standard relevance).

A similarity score was determined using the following methodology:

  • A “count score” was calculated by counting the total number of high relevance and standard relevance keywords in a publication’s keywords, abstract and title using the following formula:
    • Count score = (count of high relevance keywords) + 0.3*(count of standard relevance keywords)
  • A “cosine similarity score” was calculated between the publications and the three abovementioned resources using the following formula: 
    • Cosine similarity score = (cosine similarity between publications and the list of keywords for each AI Principle) + (cosine similarity between publications and each AI Principle’s section from the practical implementation guidance)
  • The similarity score is defined as the sum of the count and the cosine similarity scores:
    • Similarity score = (Count score) + (Cosine similarity score)

The 10 000 publications with the highest similarity score were included in the training dataset for each AI Principle. Training and refinement of the active learning classifier is expected to continue throughout 2020. 

Selection of related recent scientific research by AI Principle 

After using the active learning classifier to identify publications relevant to each of the AI Principles, publications are sorted based on their MAG saliency rank. The highest-ranking publications from the last six months are then selected to be shown in the OECD.AI platform. 

Policy areas 

Classification of scientific publications by policy area 

A list of the most relevant fields of study from the MAG taxonomy was created for each policy area. Policy areas include agriculture, competition, corporate governance, development, digital economy, economy, education, employment, environment, finance and insurance, health, industry and entrepreneurship, innovation, investment, public governance, science and technology, social and welfare issues, tax, trade, and transport. An AI-related publication from MAG must contain at least one of the relevant MAG topics for a given policy area to be classified in that policy area. 

Selection of related recent scientific research by policy area 

For each policy area, relevant AI-related publications are sorted based on their MAG saliency rank. The highest-ranking publications in the last six months were then selected to be shown in the OECD.AI platform. 

Additional metrics 

Additional metrics are used to construct the y-axis of the “AI publications vs GDP per capita by country, region, in time” chart. These indicators include: 

  • GDP: GDP at purchaser’s prices is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without deducting the depreciation of fabricated assets or the depletion and degradation of natural resources. Data are in current US dollars. Dollar figures for GDP are converted from domestic currencies using single year official exchange rates. For a few countries where the official exchange rate does not reflect the rate effectively applied to actual foreign exchange transactions, an alternative conversion factor is used. Sources: World Bank national accounts data and OECD National Accounts data files (data.worldbank.org/). 
  • GDP per capita: GDP per capita is gross domestic product divided by midyear population. GDP is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in current U.S. dollars. Sources: World Bank national accounts data and OECD National Accounts data files (data.worldbank.org/). 
  • Population: Total population is based on the de facto definition of population, which counts all residents regardless of legal status or citizenship. The values shown are midyear estimates. Sources: United Nations Population Division, World Population Prospects: 2019 Revision; Census reports and other statistical publications from national statistical offices; Eurostat: Demographic Statistics; United Nations Statistical Division, Population and Vital Statistics Report; U.S. Census Bureau: International Database; and Secretariat of the Pacific Community: Statistics and Demography Programme (data.worldbank.org/). 
  • R&D expenditure (% of GDP): Gross domestic expenditures on research and development (R&D), expressed as a percent of GDP. They include both capital and current expenditures in the four main sectors: Business enterprise, Government, Higher education and Private non-profit. R&D covers basic research, applied research, and experimental development. Source: UNESCO Institute for Statistics (uis.unesco.org). 

For these metrics, data is interpolated in years where no data is available. If last year’s value is missing for an indicator, the value of the latest available year is used. 

COVID-19 research 

The visualisations included under the “COVID-19 research” tab use MAG data and follow a similar methodology to the one described for AI research (see sections Determining Artificial Intelligence papers for OECD.AI, Collaboration between countries or institutions, Matching institutions to their countries, Type of papers, Counting of publications: quantity measure, and Counting of publications: quality measure). 

Contrastingly to AI, a paper is considered to be related to Coronavirus if it is tagged during the concept detection operation with any of the below fields of study – or their sub-concepts – from the MAG taxonomy: 

  • Coronavirus 
  • Coronavirus disease 2019 (COVID-19) 
  • Severe acute respiratory syndrome 
  • Middle East respiratory syndrome coronavirus 

References

OECD (2019a), Recommendation of the Council on Artificial Intelligence, OECD/LEGAL/0449, https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0449

OECD (2019b), Revised outline for practical guidance for the Recommendation of the Council on Artificial Intelligence, https://one.oecd.org/document/DSTI/CDEP(2019)4/REV2/en.

OECD and SCImago Research Group (CSIC) (2016), Compendium of Bibliometric Science Indicators, OECD Publishing, Paris, http://oe.cd/scientometrics.

Settles, B. (2010), Active Learning Literature Survey, Computer Sciences Technical Report 1648, University of Wisconsin–Madison, http://burrsettles.com/pub/settles.activelearning.pdf.

Shen, Z.; Ma, H.; and Wang, K. (2018), A Web-scale system for scientific knowledge exploration, arXiv, https://arxiv.org/abs/1805.12216.

Sinha, A.; Shen, Z.; Song, Y.; Ma, H.; Eide, D.; Hsu, B.; and Wang, K. (2015), An Overview of Microsoft Academic Service (MAS) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW ’15 Companion). ACM, New York, NY, USA, 243-246. DOI: http://dx.doi.org/10.1145/2740908.2742839.

Wang, K.; Shen, Z.; Huang, C.; Wu, C.; Eide, D.; Dong, Y.; Qian, J.; Kanakia, A.; Chen, A.; and Rogahn, R. (2019), A Review of Microsoft Academic Services for Science of Science Studies, Frontiers in Big Data, https://doi.org/10.3389/fdata.2019.00045.