Stack Overflow Survey

Methodological note

Background

Stackoverflow is a question-and-answer platform for programmers. Over 100 million people visit the site every month to ask questions, learn, and share technical knowledge. The site functions as a collection of detailed, high-quality collection of information relating to programming.

The website has become a key resource for AI practitioners, data scientists and machine learning experts around the world and has emerged as one of the main online platforms where programmers convene to discuss a variety of AI-related topics.

Stack Overflow Survey

Since 2011, Stackoverflow conducts an annual survey to collect information from its users. The purpose of the survey is to better understand the Stackoverflow user base.  As the survey only includes respondents to define themselves as ‘Data scientist’ or ‘Machine learning expert’ beginning in 2015, this analysis examines surveys from 2015 onwards.  About 70,000 stackoverflow users participate each year, and over 100 countries were represented.

Respondents of the survey are recruited from onsite messaging, blog posts, email lists, banner ads, and social media posts. Survey respondents anonymously and voluntarily self-report their information and are likely to be highly engaged in the Stackoverflow community.

OECD.AI leverages aggregate demographic information from survey respondents to build indicators and identify trends related to the profession, country, salary, education, gender, and age of AI developers.

Identifying AI Developers

OECD.AI leverages the ‘Developer Type’ field from the survey to identify respondents that work in the domain of Artificial Intelligence. AI developers for the purpose of this analysis consist of respondents identifying themselves as either a ‘Data scientist’ or a ‘machine learning expert’. The ‘Other’ field includes all respondents that do not include either response.

Visualisations

Stackoverflow survey visualisations on OECD.AI focus on two areas from 2015-present:

  • Salary and education breakdown by demographic characteristic, including region, country, gender, and age
  • Popular and trending programming languages.

Our country breakouts only include countries that have over 500 overall respondents in any given year. This is to ensure that we have a sufficiently large sample size for each country, which can provide more reliable and meaningful insights.

The following charts focus on the relationship between self-reported demographic information (location, gender, age) and salary/education. Filtering is available for all views by year, profession (‘Data scientist/machine learning expert’ vs ‘Other’), region, and country.

Chart 1: AI Demographics by country

  1. The pie chart on the left displays a breakdown of survey respondents by country. It is possible to highlight any country in the dataset through the ‘Find your country’ feature under the chart. Clicking on a country in this view will filter the entire visualisation.
  2. The table within the ‘Salary breakdown by education level’ presents the distribution of respondents by salary level (rows) within each education group (columns). Thus, for each education level, the different cells show the percentage of respondents in a given salary group.
  3. The bars below and to the right of the table show the overall distribution of education and salary levels. Clicking on the column headers for ‘education type’ will filter the ‘Country and region breakdown’ pie chart.

Chart 2: AI Demographics by age

  1. The bar chart on the left shows a breakdown of survey respondents by age group. Clicking on a group in this view will filter the entire visualisation.
  2. The table within the ‘Salary breakdown by education level’ presents the distribution of respondents by salary level (rows) within each education group (columns). Thus, for each education level, the different cells show the percentage of respondents in a given salary group.
  3. The bar charts below and to the right of the table show the overall distribution of education and salary level. Clicking on the column headers for ‘education type’ will filter the ‘Country and region breakdown’ pie chart.

Chart 3: AI Demographics by gender

  1.  The pie chart on the left shows a breakdown of survey respondents by gender. Clicking on a gender will filter the entire visualisation.
  2. The table within the ‘Salary breakdown by education level’ presents the distribution of respondents by salary level (rows) within each education group (columns). Thus, for each education level, the different cells show the percentage of respondents in a given salary group.
  3. The bars below and to the right of the table show the overall distribution of education and salary levels. Clicking on the column headers for ‘education type’ will filter the ‘Country and region breakdown’ pie chart.

Chart 4: Programming languages of today and tomorrow

This visualisation highlights the 15 most popular programming languages by use and interest. It is possible to filter this view by year, country, and profession.

  • The x-axis corresponds to the percentage of respondents who included a given programming language in their response.
  • Colours represent the share of developers who are interested in using a language compared to the share of developers who have used a language in the past year.
    • Blue shapes correspond to respondents who want to use a particular language in the next year.
    • Orange shapes correspond to respondents who have used a particular language extensively in the past year.
  • Stars are used to highlight programming languages that have a higher percentage of respondents who are interested in using it in the next year than respondents who have used the language in the past year.

Indicator breakdown

It is important to note that many survey fields have changed since 2015.  The structure and content of the survey has become more stable beginning in 2018. This analysis uses the following standardised fields to present information over time:

Country

  • Country data refers to the country of residence of the respondent at the time of the survey and is derived from responses to the question ‘In which country do you currently live?’.

Education

  • The Education field refers to the formal education of the respondent. OECD.AI buckets responses by: Associates & less than bachelors, Bachelors, and Advanced degree (which includes Masters degrees and above). As this is a multiple-choice field, the analysis captures the highest attained formal education level.
  • In 2015, Formal information pertaining to bachelors, masters, and PhD degrees were only specified for computer science and related fields.  There are no categories for general degrees not related to Computer Science.  Thus, for this year, the following mappings were made:
    • Bachelor of science in computer science (or related field) → ‘Bachelors’
    • Master’s degree in computer science (or related field) → ‘Advanced Degree’
    • PhD in computer science (or related field) → ‘Advanced Degree’

If a respondent indicated ‘Other’, the response was filtered out, as this could include non-computer science degrees.

Salary

  • For 2015 and 2016, users directly provided their salary as a range of yearly USD.
  • For 2017-2022 users were asked to identify their salary, currency, and payment frequency (weekly, monthly, or yearly). Salaries were converted by stackoverflow into yearly salaries assuming 12 working months and 50 working weeks.
  • Exchange rates were calculated by stackoverflow based on the dates of the survey.
  • For 2017-2022, the exchange rates were calculated from: June 2017, January 2018, February 2019, February 2020, June 2021, May 2022

Gender

  • The gender field allows for multiple choices and is structured slightly differently according to year. This analysis breaks gender into three categories: ‘Woman’, ‘Man’, and ‘Other’.  Users can select one or more options.
  • For 2015 and 2016, the ‘Transgender’ field did not exist. In following years, the Transgender field is not consistent. For this reason, our analysis does not provide granular data for this group. As it is a multiple-choice field, the following mappings are included in the following fields:
    • Transgender;male → Man
    • Transgender;female →Woman

Age

  • The age field is mapped from the question ‘What is your age?’. Depending on the year, users can either select an age group, or input an integer value. All integer values are converted to the following age groups for consistency:
    • 24 and Under
    • 25-34
    • 35-44
    • 45-54
    • 55-64
    • 65 years or older
  • In 2015, users were asked to select their age from a pre-defined age group bucket that did not align with the previously defined groupings. This analysis used the following age buckets for this year:
    • 24 and Under
    • 25-34
    • 35-39
    • 40-50
    • 51-60
    • 60 years or older
  • In 2017, the survey did not provide a field for age. This analysis does not include age data for this year.

Programming Languages

  • This analysis leverages two survey questions to show the preferred programming languages by respondents. These questions refer to: programming languages that respondents have done extensive development work in over the past year, and languages developers want to work in over the next year. This is a multiple choice field, with a write in option.
  • The list of languages included in the multiple choice list varies by year, as survey designers curate a collection of programming languages to include based on responses from the previous year. To facilitate comparability over time, this analysis shows only the top 15 most chosen languages.

Number of respondents

The number of respondents depends on the year of the survey. As the survey popularity and structure changes over time, caution is advised when comparing over time. It is important to note that the structure of the survey changed significantly after 2017, leading to an increase in participants and more standardization among fields and responses.

In terms of overall responses, the number of overall respondents in 2018 and onwards approximately ranges between 70,000 and 100,000. For 2016 and 2017, the number of respondents is around 55,000. 2015 has significantly less results, with only about 26,000 overall respondents.

Additionally, our default views present a subset of overall responses, focusing on respondents who self-identify as Data Scientists or Machine Learning Experts. In recent years, this subset comprises about 10% of the total respondents in any given year. For 2018 on, this subset ranges between 5,500 and 8,500. In 2016 and 2017, these numbers are closer to 3500. In 2015, this number is even more restricted, to less than 1000 respondents.

20152016-20172018-current year
Count of overall respondents26,000~55,00070,000-100,000
Count of data scientists/machine learning expertsless than 1000~3,5006000-9000

It is important to keep in mind that this data and survey represents a subset of employees in the artificial intelligence field who are active in the online data science and programming communities.

As such, caution should be exercised when interpreting the results of the data, and it should be considered as one of many sources of information about the field of data science and machine learning.