OECD.AI uses aggregated job postings published on Adzuna, a platform indexing millions of job postings worldwide, to estimate the demand of AI skills in 16 countries: Australia, Austria, Brazil, Canada, France, Germany, India, Italy, the Netherlands, New Zealand, Poland, Russia, Singapore, South Africa, United Kingdom, United States. To construct and keep the visualisations up to date, OECD.AI processes around 125 000 job vacancies per week. Job vacancies are an important element to understand labour market dynamics: they reveal companies’ preferences for skills as expressed through online demand. As such, job postings are an indication of residual – and not absolute – demand for skills.
The dataset used contains multilingual job postings from 16 different countries. Only roles in IT occupations are considered, irrespective of the industry of the company. For vacancies in the UK and the US, Adzuna uses a machine learning algorithm to estimate the job type. For all other countries, Adzuna leverages the job type as pre-defined by recruitment entities.
Each job posting is made up of a title (e.g. “Software developer”, “Data scientist”) and a short job description.
The relevant skills included in job postings were identified and translated to make them comparable across countries and languages. This was done using a “Wikification” approach, which consists in annotating large bodies of text with relevant Wikipedia concepts (Brank et al, 2017). The advantage of this approach is manifold. First, Wikipedia is a freely available source of information, which is continuously updated and covers a wide range of topics. Second, Wikipedia has a rich internal structure, where each concept is associated with a semi-structured textual document (i.e. the contents of the corresponding Wikipedia article) that facilitates the semantic annotation process. Third, cross-language links are available in Wikipedia to identify pages that refer to the same concept in different languages, thus making it easier to support multilingual and cross-lingual annotation.
Each job posting was transformed into a list of up to ten Wikipedia concepts – each representing a skill – ranked by relevance based on their PageRank value (PageRank is an algorithm used by Google Search to measure the importance of web pages. It works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites). Generic concepts (e.g. analysis, business, etc.) were identified and removed. Approximately 800 concepts are included in the database.
Translating concepts into skill demand over time
Word embeddings – a language modelling technique – is used to group similar concepts according to their relative semantic distance. This consists in mapping every “wikified” concept into vectors of real numbers. Consequently, through the embedding process each job posting becomes a vector of up to 10 values, corresponding to the wikified concepts that compose it. Summing all the job posting vectors in a day across all concepts gives the daily demand for skills.
A three-month exponential moving average is used to smooth the time series. This method was chosen because it places a greater weight on the most recent data points, thus giving higher importance to more recent hiring trends.
Bundling similar job postings
A Latent Dirichlet Allocation (LDA) model is used to bundle similar skills into categories. This bundling exercise makes it possible to analyse skill demand across different categories of IT skills. Compared to other topic modelling techniques, LDA has the advantage that a single skill can belong to different categories. By testing the model against different metrics, the optimal number of categories was estimated to be 50 (Griffiths et al. 2004; Cao et al. 2009; Arun et al. 2010; Deveaud et al. 2014).
The LDA model returns two probabilities: the probability that a skill belongs to a cluster as well as this same probability but this time normalised across clusters. This normalisation process curbs the importance of frequently appearing skills across clusters (e.g. data), increasing the relevance of underrepresented – but significant – concepts (e.g. tensorflow).
Following a process similar to Wu et al. (2016), a Long Short-Term Memory (LSTM) neural network is used to improve the accuracy of the clustering. LSTM models have the advantage of being able to process complex sequences with non-linear relations. The LSTM model uses job posting vectors and probability matrices from LDA to produce more accurate skill cluster representations. It does so by assigning each skill with 50 values to indicate how strongly it contributes to each of the 50 clusters or categories identified by the LDA model.
However, for data visualisation purposes, 50 skill categories are far too many to allow for any meaningful interpretation. Through hierarchical clustering, it was found that some of the initial 50 categories were similar enough to be grouped. This additional clustering exercise resulted in 16 higher-level categories of IT skills (Note: the pairwise cosine similarity – a widely used technique to measure the similarity between two non-zero vectors – of the embedding matrices of the 50 skill categories was calculated and min-max normalised. Hierarchical clustering techniques were then used to identify similar categories). When combined, each subjacent category – or “subcategory” – was assigned an equal weight (e.g. if four categories were combined into one, each of their concepts received a weight of 0.25).
The analysis enabled us to form 16 categories of IT skills and five Artificial Intelligence subcategories, each of them being statistically significantly different from the others.
The resulting categories of IT skills were labelled as follows using expert input:
- Applications & networks
- Database administration
- Open source programming
- Data management
- Web development
- Systems programming & robotics
- Database architecture
- User interface
- Data processing
- Artificial intelligence
- Microsoft tools & cloud
- Enterprise services
- Digital security
- Source code management
- Testing & quality
The “Artificial intelligence” category is made up by the following five subcategories, which were labelled using expert input:
- AI software development
- AI research & methodology
- Machine learning tools
- AI data management
- AI web development
Data visualisations on OECD.AI exploit the above categorisation of skills to portray demand by AI skill, skill category, subcategory, country and over time.
Brank, J.; Leban, G.; and Grobelnik, M. (2017). Annotating Documents with Relevant Wikipedia Concepts. Proceedings of the Slovenian Conference on Data Mining and Data Warehouses (SiKDD 2017), Ljubljana, Slovenia, 9 October 2017. http://ailab.ijs.si/dunja/SiKDD2017/Papers/Brank_Wikifier.pdf
Cao, J.; Tian, X.; Jintao, L.; Yongdong, Z.; and Sheng, T. (2009). A density-based method for adaptive lda model selection. Neurocomputing – 16th European Symposium on Artificial Neural Networks 2008 72, 7–9: 1775–1781. http://doi.org/10.1016/j.neucom.2008.06.011
Deveaud, R.; SanJuan, E.; and Bellot, P. (2014). Accurate and effective latent concept modeling for ad hoc information retrieval. Document numérique 17, 1: 61–84. http://doi.org/10.3166/dn.17.1.61-84
Griffiths, T.; and Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences 101, suppl 1: 5228–5235. http://doi.org/10.1073/pnas.0307752101
Rajkumar, A.; Suresh, V.; Veni Madhavan, C.; and Narasimha, M. (2010). On finding the natural number of topics with latent dirichlet allocation: Some observations. In Advances in knowledge discovery and data mining, Mohammed J. Zaki, Jeffrey Xu Yu, Balaraman Ravindran and Vikram Pudi (eds.). Springer Berlin Heidelberg, 391–402. http://doi.org/10.1007/978-3-642-13657-3_43