What Changed from 2021 to 2022 in ML and DS? — Kaggle Survey

Sandhya Krishnan
9 min readNov 24, 2022

This blog explains what changes are there in Data Science and Machine Learning with respect to Kaggle Survey 2022 especially considering the gender “woman”.

Photo by Lukas: https://www.pexels.com/photo/close-up-photo-of-survey-spreadsheet-590022/

Since 2017 each year Kaggle has conducted an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. This blog explains what changed from 2021 to 2022 in Machine Learning and Data Science. In one year there is only a 3.2% increase in women participants in the survey. Insights are taken from my Kaggle notebook and my Tableau Dashboard.

Overview of Countries and Survey takers in 2021 and 2022.

https://www.kaggle.com/sandhyakrishnan02/what-changed-from-2021-to-2022-in-ml-and-ds

Country Wise Analysis

India and US Remain as the Top two Countries in the Contribution of the Kaggle Survey, with an increase of 8.02% and 1.97% respectively. The rise in woman contributors in the former is 2.7% and later is 0.89%

https://public.tableau.com/app/profile/sandhya.krishnan8275/viz/KaggleSurvey2022/KaggleSurvey2022

Japan and China were in 3rd and 4th position in 2021, but they dropped to 6th and 7th position with more than a 1% decrease in contribution. Brazil which was in 5th position, Nigeria in 7th position, and Pakistan in 9th Position in 2021, is in 3rd, 4th and 5th position in 2022 within and around 0.5%. Even in Nigeria the overall contribution is significantly less compared to the top 2, but woman’s contribution is increasing.

https://www.kaggle.com/sandhyakrishnan02/what-changed-from-2021-to-2022-in-ml-and-ds

Countries Cameroon and Zimbabwe are two new countries that took part in the survey.

10 countries, Austria, Belarus, Denmark, Greece, Iraq, Kazakhstan, Norway, Sweden, Switzerland, and Uganda which has contributed last year, but this year there are no participants from these countries.

Age Wise Analysis

In the age group 25–29, 22–24, 30–34, and 45–49 there is an overall decrease in contribution. Age group 30–34 has a maximum decrease in the contribution which is 1%. Whereas there is an increase in the contribution of Women in all age groups. Even though the 22–24 age group have decreased in contribution in overall survey takers, woman’s contribution in this group has a maximum increase which is 0.6%.

https://www.kaggle.com/sandhyakrishnan02/what-changed-from-2021-to-2022-in-ml-and-ds

The popularity of Data Science Courses

Coursera remains to be in the Number one position from last year. University Courses resulting in university degrees have moved to 2nd rank from 4th position. Seems like even though the internet has all information on Data Science and Machine Learning, people are more focused on organized learning and gaining a university degree. Kaggle Learn Courses moved to 3rd overall from 2nd rank and in women to 4th. Udemy also moved to 4h rank overall and in women to 5th rank.

https://www.kaggle.com/sandhyakrishnan02/what-changed-from-2021-to-2022-in-ml-and-ds

Most Helpful Platform to study Data Science

The most helpful platform to study data science is ranked the same by overall survey takers and the by women survey takers. Online courses (Coursera, EdX, etc), Video platforms (YouTube, Twitch, etc), and Kaggle (notebooks, competitions, etc) rank 1, 2, and 3 for DS starters friendly. As there is a trend in an increase in learning from University Courses with a university degree, it’s worth watching if its rank moves up next year.

https://www.kaggle.com/sandhyakrishnan02/what-changed-from-2021-to-2022-in-ml-and-ds

Paper Published and Research with ML

Among survey takers, 21.9% of them have published papers either or both in theoretical and Applied research.

https://www.kaggle.com/sandhyakrishnan02/what-changed-from-2021-to-2022-in-ml-and-ds
https://www.kaggle.com/sandhyakrishnan02/what-changed-from-2021-to-2022-in-ml-and-ds
https://www.kaggle.com/sandhyakrishnan02/what-changed-from-2021-to-2022-in-ml-and-ds

702 survey takers published both Applied and Theoretical ML Research Paper, out of which 140 are by a woman. 847 survey takers published only Theoretical ML Research Papers, out of which 226 are by a woman. 1761 survey takers published only Applied ML Research Paper, out of which 348 are a woman.

Coding Experience and Preferred Programming Language

There is an increase of 4.8% of survey takers who have never written any code. And in all other categories, there is a slight decrease.

https://www.kaggle.com/sandhyakrishnan02/what-changed-from-2021-to-2022-in-ml-and-ds

Python and SQL remain to be in rank 1 and 2 in both 2021 and 2022. R has moved to 3rd rank in overall survey takers and it remains in 3rd for women survey takers.

https://www.kaggle.com/sandhyakrishnan02/what-changed-from-2021-to-2022-in-ml-and-ds

Comparing Python, SQL, and R with coding experience, Python is the most preferred language in all categories of coding experience, that is from less than 1 year to more than 20 years. R is having maximum popularity with 20+ years of coding experience survey taker. The average number of users of SQL is the same among all coding experience users, but 10–20 years of coding experience are using it more than others.

https://www.kaggle.com/sandhyakrishnan02/what-changed-from-2021-to-2022-in-ml-and-ds

Preference for Integrated development environments (IDEs)

https://www.kaggle.com/sandhyakrishnan02/what-changed-from-2021-to-2022-in-ml-and-ds

Even if the Jupyter notebook is highly used IDE(57%), compared to last year there is a decline of 5.7%. Pycharm has a 3.3% reduction in usage, R studio has 2.4% and Spyder has 2.6%. Whereas VS code has an increase of 2.9%, even though overall usage is only 38.7%.

Hosted Notebook

https://www.kaggle.com/sandhyakrishnan02/what-changed-from-2021-to-2022-in-ml-and-ds

Colab Notebooks is mostly used (37.2%), but overall there is a 0.5% decrease in its usage. Kaggle Notebooks is second largely used with 31.16%, but it has a maximum decrease in usage from last year which is 5.44%. Preference for IBM Watson Studio, Amazon EMR Notebooks, Code Ocean, and Azure Notebooks also decreased.

Data Visualisation Libraries and Machine Learning Framework

There is a sight decrease in usage of all popular visualization libraries, is it because, survey contributors are using Highcharter, Pygal, and Dygraphs?

https://www.kaggle.com/sandhyakrishnan02/what-changed-from-2021-to-2022-in-ml-and-ds

Like visualization libraries, there is a slight decrease in the usage of popular ML frameworks, and the `Other` framework option is showing an increase.

https://www.kaggle.com/sandhyakrishnan02/what-changed-from-2021-to-2022-in-ml-and-ds

Current Job Title

The Most popular Designation of survey takers is Data scientists.

https://www.kaggle.com/sandhyakrishnan02/what-changed-from-2021-to-2022-in-ml-and-ds
https://www.kaggle.com/sandhyakrishnan02/what-changed-from-2021-to-2022-in-ml-and-ds

Considering women the most popular survey takers are Data Analysts. Is it that Woman Data Scientists are less active on Kaggle, so missed the survey?

Male and Female Job Title Visualization

Job Title Vs Programming Language

Irrespective of Job Title Python is most commonly used

https://www.kaggle.com/sandhyakrishnan02/what-changed-from-2021-to-2022-in-ml-and-ds

Apart from Python, as per Job Most Commonly used Other languages are(More than 20%):

  • Data Administrator and Data Analyst — SQL and R
  • Data Architect — SQL, Java, #C
  • Data Engineer — SQL, BASH
  • Data Scientist — SQL, R
  • Developer Advocate — SQL, Java, C++
  • Engineer (non-software) — SQL
  • Machine Learning/ MLops Engineer — SQL, BASH, C++
  • Manager — SQL, R
  • Research Scientist — R, MATLAB, C++
  • Software Engineer — SQL, JavaScript, Java, C++, C
  • Statistician — R, SQL
  • Teacher/professor- SQL, C, R, MATLAB, Java

Job Title Vs ML algorithm Usage

https://www.kaggle.com/sandhyakrishnan02/what-changed-from-2021-to-2022-in-ml-and-ds

Apart from Data Administrator, Data Analyst, Developer Advocate, and Engineer (non-software), Linear or Logistic Regression is frequently used. Data scientists regularly used Linear or logistic regression, decision trees or random forests, and Gradient Boosting Machines. Bayesian Approaches which is a very important ML algorithm are even Statistician used by 22%, whereas they use Decision Trees or Random Forests by 46%. Research scientists have highly used here which is only up to 27% which is less than CNN 48% and DNN 30%. Machine Learning/ MLops Engineers have the highest usage of CNN at 63% and have DNN at 41%, Transformer Networks at 38%, and RNN at 33%.

Yearly Compensation

As the Most popular Designation of survey takers is Data scientists and among women is Data Analysts, individual slabs with % of overall and women survey takers in each slab is below. If interested to see other designations, visit the notebook.

Data Scientist Compensation Slab Details

Data Analyst Compensation Slab Details

0–999 slab is having the highest number of survey takers overall and in women irrespective of the designation. So ignoring this slab to get the much bigger picture.

https://www.kaggle.com/sandhyakrishnan02/what-changed-from-2021-to-2022-in-ml-and-ds
https://www.kaggle.com/sandhyakrishnan02/what-changed-from-2021-to-2022-in-ml-and-ds
https://www.kaggle.com/sandhyakrishnan02/what-changed-from-2021-to-2022-in-ml-and-ds

The highest Slab Overall is Data Architect[150,000–199,999] with 11.58%, and Data Administrator with [1,000–1,999] is 11.43%. The highest Slab for women is Data Administrator[1,000–1,999] with 5.71%, and Teacher/professor with [1,000–1,999] is 2.52%.

Overall Compensation is best in Data Architect[150,000–199,999], Data Administrator[1,000–1,999, and 50,000–59,999], Teacher/professor [1,000–1,999], and Manager (Program, Project, Operations, Executive-level, etc)[150,000–199,999]. Woman Compensation is best in Data Administrator[1,000–1,999, and 60,000–69,999],Teacher / professor[1,000–1,999],Statistician[150,000–199,999], and Data Architect[70,000–79,999]

Money spent on machine learning and/or cloud computing services at home or at work

https://www.kaggle.com/sandhyakrishnan02/what-changed-from-2021-to-2022-in-ml-and-ds

76.41% of survey takers have not spent any money to learn ML or on cloud computing services, maybe almost all details are available for free, and the only requirement is to have internet. Less than 1% have spent more than 100,000 USD for ML or/and or Cloud computing services.

Business Intelligence Tool

https://www.kaggle.com/sandhyakrishnan02/what-changed-from-2021-to-2022-in-ml-and-ds

Tableau has moved from 2nd rank to 1st rank overall beating Microsoft Power BI which was leading last year, and it is interesting to see Women have chosen Tableau from 2021 as there most favorite Business Intelligence Tool. Google Data Studio Remains in 3rd rank. Amazon QuickSight has moved to 4th overall which was 7th in last year.

Favorite media sources that report on data science topics

https://www.kaggle.com/sandhyakrishnan02/what-changed-from-2021-to-2022-in-ml-and-ds

YouTube (Kaggle YouTube, Cloud AI Adventures, etc), Kaggle (notebooks, forums, etc), and Blogs (Towards Data Science, Analytics Vidhya, etc) ranks from 1 to 3. Course Forums (forums.fast.ai, Coursera forums, etc), Twitter (data science influencers), and Journal Publications (peer-reviewed journals, conference proceedings, etc) follow it from 4th to 6th.

--

--