One of the things that can be quite overwhelming about joining the field of Data Science is the amount of terminology you got to get familiar with to advance in the field.
We can divide the field’s terminology into many subcategories: some general vocabularies, the field of focus vocabularies, tools vocabularies, and workflow vocabularies.
For a newcomer, the sheer amount of terms you need to memorize and know what it means can be very discouraging and pure confusing, especially if you need to look up special information but don’t know exactly what to look up for.
I was there once, too. I was a beginner that got super lost in the sea of terminology, not to mention the actual implementation of different ideas. So, I decided to write this article – and maybe a few more – about the different terminologies in the field and what they mean to help those who are new to the field or thinking of joining in.
This article will focus on clarifying the terminologies of the different steps of a data science workflow. I realize that the workflow may vary slightly based on the type of project itself; however, some common, basic steps occur in most data science projects.
So, let’s get right to it…
Data Exploration
Data science is all about the data; how/ why was the data collected? How it was structured, and what story does it tell. Answering these questions is the first and most essential step of any data science project.
To answer them, the data scientist will perform some manual and automatic analysis techniques to understand better what the data represents and gain a deeper understanding of it. Data exploration can also be performed by plotting the data using different tools. Doing so will help distinguish patterns and trends within the data, leading to more meaningful analysis.
Data Visualization 101: 7 Steps for Effective Visualizations
Data Mining
Data mining is the process of structuring, analyzing, and formulating raw data in order to find patterns and anomalies through mathematical and computational algorithms. That is, data mining is a technique used to gather actionable insights from the dataset and use it to build something useful.
In order to gain information and insights from the data, the data needs to be clean, structured, and organized. All that falls under the term data mining. Data mining is significant as it helps make more sense of the data and assist with making decisions that could potentially be useful for future data gathering.
Data Pipelines
When we work on any data science project, the data will need to go through such set steps of processes in order to produce valuable results. This set of processes is called data pipelines. In a data pipeline, each process’s output is the input of the process after it, starting with the raw data and ending with the desired output.
Often, these pipelines consist of three elements, sources, processing steps, and destination. These three elements depend on the application in question and the results we are seeking after.
Data Wrangling
Data wrangling is an umbrella name used to refer to collecting, selecting, and transforming data to answer an analytical question. Data wrangling is also used to refer to data cleaning or data munging. The whole purpose of data wrangling is to make the values consistent for all datasets.
Data wrangling often takes about 80% of the project’s time, while modeling and exploring the data takes up the remaining 20%.
When data scientists wrangle data, they often aim to transform it into one of four structures – the first one is the most common one – analytic base table, denormalized transactions, time series, and document libraries.
ETL
The ETL process is divided into three sub-processes: Extract, Transform, and Load. ETL is often conducted on data in a form that is not yet ready for analyzing to make it optimized for analytics. Although the name of the process includes three steps, these steps are often varied based on the ETL tool used.
Performing ETL is important in obtaining accurate analysis results from the data. The results of your analysis are only as good as your input data; that’s why ETL is an important step for getting the data into the best structure for analysis and modeling.
Web Scraping
While all the terminologies we mentioned so far are steps performed on the gathered data, web scraping is the process of searching and scaping the web to gather the data we need.
During web scraping, the developer or data scientists write scripts to automatically gather information about a specific topic and fetch it for analysis and modeling. Many tools can be used to collect data from the web, such as beatifulSoup in Python, Scrapy, and Cheerio for JS.
Takeaways
Getting into data science can be overwhelming for so many reasons, from the vast amount of resources online to the many, many terminologies one must know to emerge in the field and start building meaningful projects fully.
One of the useful things that helped me get a better grasp of the field and know exactly what to look for and read was building a dictionary for the field’s basic terminology that I can refer to whenever I got lost or confused.
This article is a part of this dictionary, and I am planning to write more articles about focus-specific topics (such as Machine Learning and statistical tools) in the future to help anyone lost or new to the field.
Choose the Best Python Web Scraping Library for Your Application