What’s Like a Data Scientist in Silicon Valley?

Growth & Learning that benefit every aspiring Data Scientist

Li Miao
Towards Data Science

--

Preface

Search for Data Scientist on google today, or what skills are needed to be a Data Scientist, you will be overwhelmed by tons of information — Medium, LinkedIn, News, private coaching sites, etc. Everybody is telling us that Data Scientist is a widely needed profession in 21st century. To become a Data Scientist requires mastery of statistics, programming, and machine learning. But does a lot of information really bring us a lot of value? What kind of job are real industrial data scientists doing?

Work is a combination of compensation and self-realization. Every choice is to find a point that allows us to maximize satisfaction in the space outlined by these two dimensions. Want to be a Data Scientist, or want to become a better Data Scientist, depends on what kind of value you can provide in your work and how you can achieve self-realization in your career.

The Germans say that the world is concrete. Today I will specifically talk about the Silicon Valley Tech industry I am familiar with, the different requirements and skills for Data Scientist work, how to grow in this position, and what Charlie Munger told us

All I want to know is where I’m going to die so I’ll never go there

Myself as A Data Scientist

After graduating from UIUC, I joined Microsoft’s Silicon Valley office, and it has been 4 years now. Our group is mainly responsible for Language Model, which is the core component of the Speech Recognition service provided by Microsoft’s cloud platform Azure.

As a Data Scientist, what impact and value have I created for the company in the past few years?

  • Achieved 26.2% Disfluency Tagging F1-score improvement in Teams Live Caption & Office Word Dictation and 2.54 BLEU score advancement in English to German translation by designing and building the Deep Learning Disfluency Tagging system
  • Improved Bing Voice Search Speech Recognition accuracy and user experience with 7% Surprise Metric reduction and 2% Word Error Rate reduction by establishing the first Deep-Learning-powered denoised N-gram Language Model
  • Piloted the multilingual Neural Network Language Model pretraining across 26 European locales, built all-in-one subword tokenizer

My work is more towards machine learning in Natural Language Processing, which is commonly referred to as NLP and just a branch of Data Science. In Silicon Valley, the Data Scientist and skills required by different companies and groups are horses with different colors. In the next chapter, I will talk about various kinds of Data Scientist demanded by Silicon Valley.

What Are the Types of Data Scientist in Silicon Valley?

In general, there are three types of positions as Data Scientist.

  • Data Analyst
  • Data Engineer
  • Machine Learning Engineer

First of all, the skills required for these three positions are different.

  • Data Analyst. Responsible for using SQL and other languages ​​to process data, summarize data, visualize data statistics, derive business insights and complete reports based on data analysis. There is a Data Scientist Analytics track in Facebook, mainly responsible for designing statistical experiments as A/B Testing. For example, we have now designed a fire-new news recommendation system, so how do we know whether this new system can boost up user stickiness and help us increase the number of subscribers to advance revenue? Online evaluation plays the role as well as a series of A/B testing experimental design and statistical analysis.
  • Data Engineer. Strictly speaking, this is a branch of Software engineer, mainly for the design and construction of large-scale data infrastructure. For example on Instagram, every time you browse, click, or even the interval between browsing two products, they are real-time user feedback data to be precipitated into the data system. These data can help us construct user portraits and push personalized products recommendation more accurately. The storage, processing, querying, and maintenance of all these large-scale online data are the job of the data engineer.
  • Machine Learning Engineer. Responsible for the design and development of large-scale machine learning systems. Need great understanding of machine learning, deep learning, excellent programming skills, and, in my opinion, the most important pillar, how to transform a business problem into a machine learning problem. Design quantitative metrics to define problems, collect and process data on a large scale, and achieve automatic machine decision-making to improve overall performance through iterative optimization of intelligent algorithms. The ubiquitous recommendation system in life is a typical application of machine learning, such as YouTube’s video recommendation, Spotify’s daily playlist, Amazon’s product recommendation and so on. In addition, Google Assistant/Alexa’s intelligent voice recognition and human-computer interaction, machine translation, intelligent driving assistance, and online advertising. Behind them are the machine learning system.

Secondly, in terms of compensation, engineering-related positions are generally higher than analytics, which is mainly determined by the fundamental theory of Economics — supply and demand. Relatively speaking, the learning curve of chart visualization through a series of commercial software is lower than machine learning and programming, and talents with real industry experience are scarce. From the demand side, the penetration of cloud computing platforms into many industries has made it possible to digitalize and fluidize large-scale data, and the rapid expansion of data intelligence has increased the demand for talents in this area.

Machine Learning Engineer >= Data Engineer >> Data Analyst

Of course, SDE is what Silicon Valley needs the most. In many products, machine learning is icing on the cake. For example, we all use Zoom to hold video conferences. Some of the functions provided by machine learning will help improve user experience and customer stickiness. However, starting from the First principle thinking, we first need a low-latency and barrier-free video communication software. For sure, I am aware of many new products and services nowadays are entirely based on data intelligence. They establish the engineering system, collect real-time user behavioral data, promote the intelligence system through iterative machine learning, attract more users, gather more data, and push Flywheel of Data->Model->Product, such as TikTok. In general, machine learning in industry are essentially engineering problems, not entirely scientific problems. Regarding this part, I will give you some specific examples in a future blog post.

What Skills do Data Scientist Need to Have?

Image by Li Miao

The above skill tree contains the hard skills that I consider must-have for Data Scientist, and there will be different focuses according to the needs of different positions. Of course, hard skills are not the full story to thrive in the real world. To truly grow, soft skills are even more important. How to communicate with colleagues, how to collaborate, how to write e-mails, how to tell a good story of what you did, how to show leadership, how to manage yourself and even direct manager, how to expand your influence, etc. These are the areas that I keep reflecting on everyday and constantly learning. I will share more of my ideas in future blogs.

In the interview, the interviewee is generally inspected from the following four aspects

  • Coding: Python (Algorithm/Data Structure) + SQL
  • Machine Learning System Design
  • A/B Testing Design + Statistics
  • Resume Projects

What Shouldn’t Do as A Data Scientist?

1. Don’t think algorithm is the ONE and only ONE.

In many mature products that already have Machine Learning, it is unlikely that the product performance can be suddenly improved by conceiving a spanking new algorithm in the short term. Such opportunities exist in the transition from traditional Machine Learning to Deep Learning, especially in products with huge data scale. Or in the long run, we can look forward to the algorithm breakthrough moment every five years. But in many real-world cases, it is the data that really helps us to improve the system performance where new information that has been correctly processed bring in the value.

We are living in an era where computing power and universal algorithms are infrastructures as Water and Electricity. With the Cloud service, everyone can easily leverage machine learning and build their own data products. Precise and novel digitalized data that has not been mined in the past is the core in your work.

2. Don’t overlook real business needs!

We can spend a lot of time to build a deep learning model, which improves the offline metric by 1%. However, this offline 1% may not translate to online evaluation and be of no help to real business needs. How to translate real business needs into data + machine learning solution, and how to calibrate our model training target with the final business objective are the first things we need to think about in our work.

3. Don’t Ignore the Whole Picture.

If you are just trapped in your own project and do not have a grasp of the product’s big picture, you will first lose the opportunity to discover new points of growth, and secondly, the marginal return of your effort will be greatly reduced. If it took six months to increase 1% of a certain offline metric, and no one pays too much attention to this 1% improvement, then the return on investment of this project is deep in the valley. Constant thinking in the overall direction empowers us to find virgin lands, so that we can lead from 0 to 1.

What’s Next

This is the first post of my reflection in the role as a Data Scientist. The following series will discuss more about machine learning algorithms, what a real machine learning system alike, my (painful) LeetCode practice, and the new things I learned at work everyday.

Welcome to follow my personal blog www.thelimiao.com! I’d share more my learning and growth in the future.

STAY TUNED

--

--