I frequently get asked questions and see confusion online about the differences between different data related positions. Therefore I decided to write a brief guide to the rolls and skills required for the different positions.
Positions
Data Engineer (analogous to big data software engineer )
Typical Education: B.A/B.S.
Common Tools: Spark, Flink, Hadoop, NoSQL
Languages: Java, Scala, Python
Where they are hired: Very large companies, mid-sized tech companies, and startups.
Required Skills: Distributed systems (important), data structures/algorithms (very important), databases (important), programming (very important)
Data engineers or big data software engineers generally setup, develop, and monitor the organization’s data infrastructure. They also integrate or productionize the models designed by data scientists. More specifically, data engineers setup pipelines that allow data scientists to easily experiment with data and create the production pipelines for services. For instance, data engineers might setup a data lake and a Spark cluster which data scientists then pull data from and submit data jobs too. Then if the Data Science team created a new model the data engineering team would optimize it and deploy it into production in conjunction with the engineering team.
Data Scientist
Typical Education M.S. or PhD
Common Tools: Scikit-learn, Pandas, Numpy, XGBoost
Languages: SQL, R, Python
Where are they hired: large/mid-sized organizations and tech startups
Skills: Statistics (important), databases (somewhat important), programming (important), linear algebra (somewhat important), business knowledge (somewhat important), distributed systems (somewhat important), feature extraction, data visualization
The definition of a data scientist can vary wildly between organizations. At some places a data scientist is closer to data engineer and at others they are closer to a research scientist. In general, data scientists attempt to answer business questions and provide possible solutions. Data scientists often begin with a vague question like "how do we increase user retention," figure out what data they need/how to collect it, analyze it, and then propose a solution. Data scientists frequently use machine learning techniques in their solution. For instance, in order to retain users data scientists might build a model that predicts which users are most likely to leave the site. Then use those predictions to target users likely to leave with a specific enticement to stay.
Unlike research scientists they generally don’t specialize in any one area of predictive modeling and instead will use whatever is the best tool for the job whether it’s trees, deep learning, or simple regression.
Data Analyst
Common Tools: Excel, Access, Tableau
Languages: SQL, VBA
Skils Required: Basic SQL/database knowledge, basic programming, Microsoft products.
Where are they hired: organizations of all sizes in all industries
Data analysts are similar to data scientists in their job goals, however they often have a more limited scope and tools. Data analysts generally generate basic reports/visualizations for specific problems and present that data. They generally do not do much predictive modeling or detailed statistics.
Research Scientist
Typical education: PhD
Common Tools: Caffe, Torch, Tensorflow, numpy
Languages: MATLAB, Python
Skills/Knowledge: linear algebra/calculus (very important), statistics (important), programming (somewhat important).
Where they are hired: large tech companies and data/ml startups
Research scientists usually specialize in a specific area like NLP or CV. As the name suggests they are most concerned with research and publication. They mainly work on finding new novel methods within their field and publishing the results. Although they may sometimes work on business problems their primary priority is research in their field of expertise.
Research Engineer
Typical education: B.S/M.S.
Languages: C, C++, Python, CUDA
eSkills/Knowledge: programming (very important),
Where they are hired: Very large tech companies, specialized data startups
A research engineer is to a research scientist as a data engineer is to data scientist. Research engineers tend to support research scientist in implementing by implementing and testing the algorithms developed by research scientists. They write code usually in C or C++ to create optimized computational platforms and implementations of M.L. algorithms. They are usually only found at very large companies like Google and Facebook.