California Data Science Job Market Analysis

Be a data-driven job seeker

Navid Mashinchi
Towards Data Science

--

Clip by author

Objective:

The goal was to examine the current data science job market in California. Which city has the most job posts in CA and offers more opportunities? As a current Data Science Master’s student, I wanted to find out which jobs a master’s degree holder would be eligible to apply for and what skills are most in-demand. Based on my findings from the data that I have collected, I have conducted an Exploratory Data Analysis (EDA) to investigate California's market.

Methods:

Photo by Franki Chamaki on Unsplash

Data Collection: The first step was to collect the data. I decided to web scrape the biggest job search website, “Indeed.” In the search bar “What” I typed in “Data Scientist” and for “Where” I typed in “CA”. I collected all the job posts information from the last 14 days as of December 27th, 2020. The reason why I looked at the last 14 days is because I wanted to have a look at the most recent job posts. In the above time frame, 340 data science jobs have been posted. The data consisted of 340 rows and 5 features (Title, Company, Location, Summary, Job Description).

Data Cleaning/Feature Engineering: I used this dataset in my Jupyter setup and cleaned the data mainly using Pandas. The goal was to look at entry-level and mid-level data science positions. Therefore I had to extract all the job postings that were meant for Senior positions. I iterated over all the job titles and looked for the following keywords:

  • ‘Senior’
  • ‘SENIOR’
  • ’SR’
  • ’Sr’
  • ’Sr.’
  • ’Director’
  • ’DIRECTOR’
  • ’Manager’
  • ”MANAGER”
  • ‘Data Analyst’
  • ’DATA ANALYST’
  • ”Lead”
  • ’lead’
  • “LEAD

Any job titles that included the above keywords were extracted. I also added Data Analyst to the list since we are strictly looking for Data Science jobs. When extracting the rows, our dataset went from 340 rows to 197. In other words, 58% of the jobs that I scraped from Indeed fits our target.

To figure out what the top skills are, I had to create indicator columns (with value True/False) for the following tools and skills:

Tools: ‘python’, ‘r’, ‘pytorch’, ‘sql’, ‘pyspark’, ‘aws’,’spark’, ‘sas’, ‘nosql’,’salesforce’,’tableau’, ‘pandas’, ‘scikitlearn’, ‘sklearn’, ‘matlab’, ‘scala’, ‘keras’, ‘tensorflow’, ‘scipy’, ‘numpy’, ‘matplotlib’,’spss’, ‘linux’, ‘azure’, ‘cloud’, ‘mongodb’, ‘mysql’, ‘oracle’,’snowflake’, ‘kafka’, ‘javascript’, ‘jupyter’, ‘perl’, ‘bigquery’, ‘unix’, ‘react’, ‘scikit’, ‘powerbi’, ‘lambda’, ‘ssrs’, ‘django’,’seaborn’, ‘github’, ‘git’, ‘splunk’,’rapidminer’,’jquery’, ‘nodejs’, ‘d3’, ‘plotly’, ‘bokeh’, ‘xgboost’, ‘rstudio’, ‘shiny’, ‘dash’, ‘hadoop’,’angular’, ‘nltk’, ‘flask’, ‘node’, ‘firebase’,’php’, ‘rpython’, ‘unixlinux’, ‘postgressql’, ‘postgresql’, ‘postgres’, ‘ruby’,‘tensor’,’dplyr’,’ggplot2',’esquisse’,’bioconductor’,’shiny’,’lubridate’,’knitr’,’mlr’,’quanteda’,’rcrawler’,’caret’,’rmarkdown’,’plotly’,’stringr’,’swirl’.

Skills: ‘statistics’, ‘chatbot’, ‘cleaning’, ‘blockchain’, ‘causality’, ‘correlation’, ‘bandit’, ‘anomaly’, ‘kpi’, ‘dashboard’, ‘geospatial’, ‘ocr’, ‘econometrics’, ‘pca’, ‘gis’, ‘svm’, ‘svd’, ‘tuning’, ‘hyperparameter’, ‘hypothesis’, ‘salesforcecom’, ‘segmentation’, ‘biostatistics’, ‘unsupervised’, ‘supervised’, ‘exploratory’, ’recommender’, ‘recommendations’, ‘research’, ‘sequencing’, ‘probability’, ‘reinforcement’, ‘graph’, ‘bioinformatics’, ’knn’, ‘outlier’, ‘normalization’, ‘classification’, ‘optimizing’, ‘prediction’, ‘forecasting’, ’clustering’, ‘cluster’, ‘optimization’, ‘visualization’, ‘nlp’, ‘regression’, ‘logistic’, ‘boosting’, ‘recurrent’, ‘convolutional’, ‘bayesian’,’bayes’, ’random forest’, ‘natural language processing’, ‘machine learning’, ‘decision tree’, ‘deep learning’, ‘experimental design’, ‘time series’, ‘nearest neighbors’, ‘neural network’, ‘support vector machine’, ‘computer vision’, ‘machine vision’, ‘dimensionality reduction’, ’text analytics’, ‘power bi’, ‘a/b testing’, ‘ab testing’, ‘chat bot’, ‘data mining’.

Regarding the minimum degree requirement. I created a point system from 1–3.

  • 1 → Bachelor’s Degree
  • 2 → Master’s Degree
  • 3 → PhD

I created a dictionary that would map each point to the different syntax of the degree name.

Image by author

Using that dictionary, I would split each job description and check if any of the dictionary's keys were inside the list that has been split. We would get a list of numbers for each row in the dataset, and I would select the minimum number. For example, consider the following section of a job description:

  • “ Master’s Degree or Ph.D. required.”
  • We split the above line → [“Master’s”, “Degree”, “or”, “Ph.D.”, “required.”]
  • Based on the point system, we would get the following list [2,3].
  • We grab the minimum number 2.

If none of the above keys are inside the list, we add the value 0. Finally, we would change the values of 0–3 to the following:

0 → “Not Specified”

1 → “Bachelor’s Degree”

2 → “Master’s Degree”

3 → “Doctoral Degree”

The next aspect of feature engineering was to add the population number of each location to our dataset to calculate the number of job posts based on the 100,000 population rate. I used the pandas function pd.read_html, which reads HTML tables into a list of data frame objects. I used the following website to get the population information:

Last but not least, I had to add the coordinates of each location to the dataset. For this, I used geopy.geocoders from the library called Nominatim, which allows you to get OpenStreetMap data based on your target location.

Data Visualization: Concerning the plots, I used seaborn, matplotlib and folium. The first folium plot, which is at the top of the article, displays each data point based on the minimum degree requirement. A popup text has also been added for each data point. Once you click on a data point, the job title, company name, location and minimum education requirement will be displayed. The second folium plot shows which jobs a Master’s student is eligible to apply for.

Results — EDA:

Between December 13th and December 27th, most of the data science job posts came from the following locations:

  1. San Francisco: 39 Jobs
  2. San Diego: 18 Jobs
  3. Santa Clara: 17 Jobs
Image by author

However, based on each region's population and the number of job postings the results are different if we consider the number of job posts based on the 100,000 population rate. The count of job posts per location was divided by the matching city’s population and multiplied by 100,000. As a result, the top 3 cities were:

  1. Westlake Village: 48 Job
  2. Menlo Park: 46 Jobs
  3. Palo Alto: 15 Jobs
Image by author

As expected, when we look at the map below, most of the job opportunities come from Northern California (Bay Area) and Southern California (LA & San Diego). There are not a lot of opportunities in Central California.

Image by author

Below we can see that 50.25% of all the job postings haven’t specified a minimum degree requirement. 2.03% require a bachelor’s degree, 22.34% a master’s degree and 25.38% a doctoral degree.

Image by author

San Francisco has the highest amount of job postings where a degree hasn’t been specified in terms of location and minimum degree requirement. Menlo Park has the highest amount of job postings where the minimum requirement is a master’s degree. Santa Clara and San Diego have the most job postings where a doctoral degree is a minimum requirement. San Francisco is also the city that has posted the most job postings where a bachelor’s degree is a minimum requirement. See below:

Image by author

Out of all 197 data science jobs from the past 2 weeks, we can see the break down below, which shows the number of job postings per degree:

Minimum Education
Bachelor's Degree 4
Doctoral Degree 50
Master's Degree 44
Not Specified 99

An applicant holding a bachelor’s degree can apply to 52.28% of all the job postings ((4+99)/197). Holding a master’s degree makes you eligible to apply for 74.61% of the jobs and a Ph.D. holder meets the minimum education requirement for all job postings. Below we can see the plot of a master’s degree holder and its eligibility status.

Image by author

Plus, I have also added a folium plot that shows which jobs a Master’s degree holder can apply for. I think the folium plot gives a nice overview.

Clip by author

Finally, when we take a look at the top 5 skills from all the job postings, we get the following result:

  • Python: 148 counts
  • Machine Learning: 148 counts
  • Research: 109 counts
  • Statistics: 89counts
  • SQL: 71 counts

Conclusions:

The city of San Francisco leads by a big amount when it comes to the total number of job postings. It has twice the number of job postings compared to the 2nd ranked city of San Diego. Not only does it lead in the number of total posts, but it also leads in the category of job postings where the minimum degree requirement hasn’t been specified. If you don’t have a degree and didn’t graduate from a traditional education system, I would focus on San Francisco even though you are competing with the rest. This would be a good strategy for a Bootcamp graduate, for example. If you hold a Master’s degree, I would pay attention to Menlo Park since it has most of the jobs that listed a minimum requirement of an MS degree, plus it ranked 2nd in California if we consider the number of job postings per 100,000 people. Even though San Francisco leads by the numbers in the categories mentioned above, if we consider the 100k rate, San Francisco ranks 12th and San Diego 26th. The cities Menlo Park and Palo Alto provide most of the opportunities since they lead the ranking. Even though Westlake Village is ranked first in the “Job Posts per 100k by Location in CA” plot, the city has been identified as an outlier since its population number of 8280 is minimal compared to all the other locations. As expected, most of the job opportunities come from Northern California and Southern California. There is not a demand for Data Scientists in Central California.

We can conclude that many companies don't specify a minimum level of education regarding the minimum education requirement. 50.25% of the job posts haven’t listed a minimum degree requirement. Graduate degree levels such as a master’s degree and Ph.D. are very close (22.34% and 25.38%, respectively). Followed by 2.03%, where a bachelor’s degree is a minimum requirement. Given the fact that almost half of the job postings don’t list a minimum education level, we can conclude that people coming from a non-traditional education background don’t necessarily have any chance to land a job as a Data Scientist. However, due to the competition, one would think that an applicant who holds a graduate degree (MS or Ph.D.) would have a much higher chance to succeed. As of today, if you are looking to apply for a new job in California in the new year and you choose the Indeed job search website, your chances to be eligible to apply for the jobs depend on your educational background. Below you can see the break down:

  • People from a non-traditional educational background (Bootcamp grad, Certification, etc.) meet 50.25% of all job posting’s minimum education requirements.
  • Bachelor’s Degree holders meet 52.28% of all job posting’s minimum education requirement.
  • Master’s Degree holders meet 74.61% of all job posting’s minimum education requirement.
  • Ph.D. holders are eligible to apply for all.

Finally, the top 5 skills in demand are python, machine learning, research, statistics and SQL. The results aren’t shocking to me. However, it was interesting to see research in 3rd place. If you think about it, it makes sense that research skill is high in demand. At the end of the day, a data scientist’s ability to search for, locate, extract, organize, evaluate, and use or present information relevant to a problem he/she is trying to solve is an essential skill to have under your belt.

I hope this article was insightful and gives you a better understanding of how California's job market is shaping up. If you are looking to conduct the same analysis for a different location, please look at my GitHub page. Those who are currently on the job hunt remember to be a smart and data-driven job seeker. Let me know if you have any questions on this topic or have some feedback. If you enjoyed this article, I’d be very grateful if you would share it on any social media platforms. Thank you and until next time️! ✌️

--

--

Data Scientist at Kohl’s | Adjunct Professor at University of Denver | Data Science Mentor at SharpestMinds