Stack Overflow
Methodological note
Background
Stack Overflow is a question-and-answer platform for computer programmers. Over 100 million people visit the site every month to ask questions, learn, and share technical knowledge. The site functions as a collection of detailed, high-quality information relating to computer programming.
Stack Overflow has become a key resource for AI practitioners around the world, including data scientists and machine learning experts, and has emerged as one of the main online platforms where computer programmers convene to discuss a variety of AI-related topics.
Identifying AI questions and answers
OECD.AI leverages a publicly available dataset periodically updated by Stack Overflow to analyse the evolution of AI-related questions and answers over time. To identify questions and answers relating to AI, OECD.AI filters the data using tags made available by Stack Overflow. Tags are leveraged by question askers to target the appropriate responses, and used by question respondents to categorize and identify the subject of answers. Specific tags are also aggregated and categorized by Stack Overflow users to identify the subjects relating to a major area, such as AI and Machine Learning. OECD.AI leverages the list of AI-related tags from Stack Overflow to identify concepts related to machine learning (e.g. unsupervised-learning, neural-network, computer-vision, etc.) as well as symbolic AI (e.g., fuzzy logic, expert systems, etc.). Over 60 Stack Overflow tags have been identified as being related to AI (Table 1). To color code these tags, OECD.AI directly uses categories from stack overflow to show the applications of each topic.
Location analysis
Contributions to AI questions and answers are mapped to a country based on location information at the user level.
- Stack Overflow location: users have the option of providing their location through their Stack Overflow profiles. The location provided by contributors is not standardised and could belong to different levels (e.g., sub-urban, urban, regional, or national). To allow cross-country comparisons, Mapbox is used to standardise all available locations to the country level.
- Top level domain: where the location field is empty or the location is not recognised, a contributor’s location is assigned based on his or her email domain (e.g. “.fr”, “.us“, etc.).
If the above fails, a contributor’s location field is left blank.
As of June 2022, about 25% of questions and answers could be mapped to a location using this methodology. However, a decreasing trend in the share of AI questions and answers for which a location can be identified is observed in time, indicating evolving user reporting preferences or a possible lag in location reporting.
Questions, answers, and accepted answers
OECD.AI analysis presents the overall number of questions, answers, and accepted answers posted to Stack Overflow to highlight different activities and trends on the platform.the answer selected as most relevant by the user asking the question. In other words, users posing questions usually choose one answer to highlight the response that most adequately answered their question.
Measuring the quality of questions and answers
OECD.AI ranks the quality of questions and answers by their ‘score’ i.e., the number of votes they received. Stack Overflow allows users to vote positively or negatively on any question or answer. The overall ‘score’ on a question or answer is the aggregation of all votes. OECD.AI uses this score to assess the ‘quality’ of questions and answers. The quality indicator is divided into three categories: low, medium, and high. Based on the distribution of votes, we define ‘low’ quality as a question or answer that has a negative score or zero (i.e., same number of negative and positive votes or more negative than positive votes). ‘Medium’ score is attributed to a question or answer that has a score of more than zero and less than five (e.g., 10 positive votes and 7 negative votes). A ‘high’ score is attributed to a post that has a score greater than five (e.g., 15 positive votes and 2 negative votes).
Country income level
OECD.AI leverages World Bank groupings to categorise countries by income level. Economies are currently divided into four groups: low, lower-middle, upper-middle, and high. Income is measured using gross national income (GNI) per capita in U.S. dollars, converted from local currency using the World Bank Atlas method. This categorisation is updated yearly by the World Bank, thus income groups may change according to year.
International knowledge flows
Stack Overflow enables AI-related international knowledge flows. Knowledge flows happen when a user from one country answers a question posed by a user in another country. By mapping user location, OECD.AI enables the analysis of international knowledge flows by country over time. OECD.AI visualizes interactions from two perspectives: the country asking questions (incoming knowledge flow by country) and the country answering questions (outgoing knowledge flow by country).
AI tags
machine-learning |
data-science |
neural-network |
conv-neural-network |
recurrent-neural-network |
convolutional-neural-network |
karas |
pytorch |
mxnet |
classification |
supervised-learning |
regression |
cluster-analysis |
unsupervised-learning |
reinforcement-learning |
pca |
support-vector-machines |
svm |
nearest-neighbor |
knn |
k-means |
bayesian-networks |
mixture-model |
decisiontrees |
genetic-algorithm |
simulated-annealing |
hidden-markov-models |
gaussian-process |
kalman-filter |
kalman |
particle-filter |
ensemble-learning |
q-learning |
computer-vision |
face-recognition |
ocr |
image-recognition |
speech-recognition |
voice-recognition |
nlp |
spam-filtering |
anomaly-detection |
recommendation-engine |
machine-translation |
libsvm |
weka |
orange |
shogun |
scikit-learn |
pybrain |
mahout |
rapidminer |
knime |
azure-machine-learning |
nltk |
caffe |
tensorflow |
theano |
keras |
opennmt |
xgboost |
catboost |
stanford-nlp |
deep-learning |
reinforcement-learning |
computer-vision |
robotics |
artificial-intelligence |
automation |
expert-system |
fuzzy-logic |