Stack Overflow methodology 

Stack Overflow is a question-and-answer platform for computer programmers. Over 100 million people visit the site every month to ask questions, learn, and share technical knowledge. The site functions as a collection of detailed, high-quality information relating to computer programming. 

Stack Overflow has become a key resource for AI practitioners around the world, including data scientists and machine learning experts, and has emerged as one of the main online platforms where computer programmers convene to discuss a variety of AI-related topics. 

Identifying AI questions and answers 

OECD.AI leverages a publicly available dataset periodically updated by Stack Overflow to analyse the evolution of AI-related questions and answers over time. To identify questions and answers relating to AI, OECD.AI filters the data using tags made available by Stack Overflow.  Tags are leveraged by question askers to target the appropriate responses, and used by question respondents to categorize and identify the subject of answers.  Specific tags are also aggregated and categorized by Stack Overflow users to identify the subjects relating to a major area, such as AI and Machine Learning.  OECD.AI leverages the list of AI-related tags from Stack Overflow to identify concepts related to machine learning (e.g. unsupervised-learning, neural-network, computer-vision, etc.) as well as symbolic AI (e.g., fuzzy logic, expert systems, etc.). Over 60 Stack Overflow tags have been identified as being related to AI (Table 1). To color code these tags, OECD.AI directly uses categories from stack overflow to show the applications of each topic.

Location analysis 

Contributions to AI questions and answers are mapped to a country based on location information at the user level. 

  • Stack Overflow location: users have the option of providing their location through their Stack Overflow profiles. The location provided by contributors is not standardised and could belong to different levels (e.g., sub-urban, urban, regional, or national). To allow cross-country comparisons, Mapbox is used to standardise all available locations to the country level. 
  • Top level domain: where the location field is empty or the location is not recognised, a contributor’s location is assigned based on his or her email domain (e.g. “.fr”, “.us“, etc.). 

If the above fails, a contributor’s location field is left blank. 

As of June 2022, about 25% of questions and answers could be mapped to a location using this methodology. However, a decreasing trend in the share of AI questions and answers for which a location can be identified is observed in time, indicating evolving user reporting preferences or a possible lag in location reporting. 

Questions, answers, and accepted answers 

OECD.AI analysis presents the overall number of questions, answers, and accepted answers posted to Stack Overflow to highlight different activities and trends on the platform.the answer selected as most relevant by the user asking the question. In other words, users posing questions usually choose one answer to highlight the response that most adequately answered their question. 

Measuring the quality of questions and answers 

OECD.AI ranks the quality of questions and answers by their ‘score’ i.e., the number of votes they received. Stack Overflow allows users to vote positively or negatively on any question or answer. The overall ‘score’ on a question or answer is the aggregation of all votes. OECD.AI uses this score to assess the ‘quality’ of questions and answers. The quality indicator is divided into three categories: low, medium, and high. Based on the distribution of votes, we define ‘low’ quality as a question or answer that has a negative score or zero (i.e., same number of negative and positive votes or more negative than positive votes). ‘Medium’ score is attributed to a question or answer that has a score of more than zero and less than five (e.g., 10 positive votes and 7 negative votes). A ‘high’ score is attributed to a post that has a score greater than five (e.g., 15 positive votes and 2 negative votes). 

Country income level 

OECD.AI leverages World Bank groupings to categorise countries by income level. Economies are currently divided into four groups: low, lower-middle, upper-middle, and high. Income is measured using gross national income (GNI) per capita in U.S. dollars, converted from local currency using the World Bank Atlas method. This categorisation is updated yearly by the World Bank, thus income groups may change according to year. 

International knowledge flows 

Stack Overflow enables AI-related international knowledge flows. Knowledge flows happen when a user from one country answers a question posed by a user in another country. By mapping user location, OECD.AI enables the analysis of international knowledge flows by country over time.  OECD.AI visualizes interactions from two perspectives: the country asking questions (incoming knowledge flow by country) and the country answering questions (outgoing knowledge flow by country).

AI tags

machine-learning 
data-science 
neural-network 
conv-neural-network 
recurrent-neural-network 
convolutional-neural-network 
karas 
pytorch 
mxnet 
classification 
supervised-learning 
regression 
cluster-analysis 
unsupervised-learning 
reinforcement-learning 
pca 
support-vector-machines 
svm 
nearest-neighbor 
knn 
k-means 
bayesian-networks 
mixture-model 
decisiontrees 
genetic-algorithm 
simulated-annealing 
hidden-markov-models 
gaussian-process 
kalman-filter 
kalman 
particle-filter 
ensemble-learning 
q-learning 
computer-vision 
face-recognition 
ocr 
image-recognition 
speech-recognition 
voice-recognition 
nlp 
spam-filtering 
anomaly-detection 
recommendation-engine 
machine-translation 
libsvm 
weka 
orange 
shogun 
scikit-learn 
pybrain 
mahout 
rapidminer 
knime 
azure-machine-learning 
nltk 
caffe 
tensorflow 
theano 
keras 
opennmt 
xgboost 
catboost 
stanford-nlp 
deep-learning 
reinforcement-learning 
computer-vision 
robotics 
artificial-intelligence 
automation 
expert-system 
fuzzy-logic