While browsing through the articles and tutorials for Data Science, both online and Medium, I noticed most are geared more towards intermediate developers and beyond. Even with some of the beginner-friendly tutorials, there’s a series of buzz words used that not everyone may know. So, the goal of this article today is to review the terminology to add some clarity to the wide world of Data Science.
This won’t be a fully comprehensive list. There are just too many terms that we could look at. But it will have some of the important ones, just not all of them due to the volume of information. But if there are any important vocabulary terms you think should be included, feel free to add them in the comments.
Data Science
It’s only fitting to define the whole reason we’re here first. Data Science is the analysis of data, usually in large amounts. The goal is to provide meaning to that information to solve problems or make decisions. It encompasses the actual studying of the data, the data itself, the visualization, the prediction, the decision-making process, and so on.
Big Data
The reason I include Big Data on the list is that "big data" sounds like any large amount of data. So how big is Big Data? Big Data is referring to any data too large to be stored on a single computer. This would mean that it’s also too big for something like SQL or Excel. It takes more effort to make meaningful information from the data because processing slowly could take weeks for single queries. That’s why typically it is too big for SQL.
But it’s not just the size of the data. It’s also how quickly data is generated. Big Data follows Moore’s Law, which states that the computing power doubles every two years.
Data Mining
With data so large, there must be some way of making meaningful information from it. That’s where Data Mining comes in. Data Mining refers to determining the relationships between variables and their outcomes are given a set of data. Typically, this is done by machines at a large scale. It also refers to the cleaning and organizing of that data. Ideally, it aids in decision-making with data that would have been too large to sort and describe. Such tasks include regression and classification, which we’ll get to later.
While finding meaning in data, Data Mining may search for frequent patterns, correlations, clusters, associations, or any predictive analysis that can be made.
Data Warehouse
A Data Warehouse is a repository for data, whether it’s current or even historical. It’s the environment for data mining or decision support to occur. It’s the system that allows the quick analysis of data.
Data Modeling
When you’re making predictions and analyzing data, you need some way to visualize that data. Data Modeling is visualizing that data in a written or documented format. In other words, you created a model based on the data. These models help to explain the outcome of processes.
Data Visualization
Data modeling talks about creating a model. Data Visualization coincides and really details that model. For example, Data Visualization could be viewing the data in a bar chart, a pie graph, a histogram, and so on. Even cartooning based on use cases could be considered Data Visualization. It encompasses all forms of visualizing data.
Data Governance
When you’re dealing with so much data, there must be some set of rules to ensure both integrity and security. Therefore, Data Governance exists. As it sounds, Data Governance is the managing, or governing, of data. This consists of the rules, regulations, policies, or even the actions set around data that aids to manage it in some way. These rules must comply with both legislation and the policies of the company.
The rules from Data Governance help secure the data, ensure its integrity, and even keep the data relevant.
Data Wrangling
Sorting through Big Data quickly sounds like a great idea. However, how can data so easily move through the predictions and classifications created? If some system of organization is made, you would need to ensure data can be processed through that system. That’s where you need to wrangle, or control, your data. Simply put, Data Wrangling is "taming" or formatting the raw data. This process helps the data to fit into the workflow so meaningful information can be made from it.
Data Pipelines
Last term starting with "data". So, at some point in time, you’re going to need to get data from one place to another. For example, you may have your data imported into one database, but find you need it in a different database. The series of functions or even scripts that pass that data along is considered the Data Pipeline.
Sample vs Population
This may simply be a reminder, or perhaps you already know this one. Just like math class. When you think about a subset of data or just a small portion of the overall data, this is considered a Sample. The Sample size is usually predetermined to help make summaries on what the remaining data could be or to provide to your machine using a Supervised Learning method. We’ll briefly discuss that in just a minute.
If you’re looking at the entire dataset, which is every piece of information you have, you would be looking at the Population.
Algorithm
Another more mathematical definition, but one we’ll need for the next term. An Algorithm is a repeatable step of instructions with the goal of processing data. For our purpose, it’s referring to the steps a machine uses to accomplish tasks.
Machine Learning
Machine Learning is the process where a machine analyzes data using an algorithm that aids in making predictions about the outcomes of data. But it’s not just about the predictions. It’s also about classifying, categorizing, clustering, or grouping, or at least gaining more understanding about the data. Of course, this is a very short simplification, but there are many articles devoted to only explaining Machine Learning if you should want to read more about that in particular.
Supervised Learning
For Machine Learning to use an Algorithm, it must first pick a learning method. The first one we’ll talk about is Supervised Learning. With Supervised Learning, you need a sample set from the data. This Sample is used to gain a general understanding of similarities for categories, or patterns to predict outcomes. This training method requires the data set to learn about what type of data there is before processing the whole data set.
Unsupervised Learning
The next learning method is Unsupervised Learning. Unlike Supervised Learning, Unsupervised Learning does not use a Sample data set. Instead, it simply processes the full Population of data and learns more about the data as it goes. This makes the learning method much more adaptable to new cases in the data, so that it may rearrange its Algorithm or its procedure as new data is discovered.
Reinforcement Learning
The final learning method of Machine Learning we’ll review is the Reinforcement Learning method. If you’ve ever owned a dog, you know there are two ways to teach it right from wrong. Positive reinforcement is providing something at the end, such as giving your dog a treat. Negative reinforcement does not give a treat. Negative would involve taking something away. In the case of Machine Learning, it would be like removing a condition that causes the incorrect result.
If you want to learn more in-depth about Supervised, Unsupervised, and Reinforcement Learning, I wrote an article you can check out.
Unstructured Data
As it sounds, Unstructured Data is not fit inside a predetermined model. It instead is any data that may be difficult to classify.
Structured Data
Structured Data is the opposite of Unstructured Data. It fits the database model or any other data model that has been predefined.
Classification
Classification is a method inside of Supervised Learning. It’s a way to categorize data on how similar it is to other data points. The learning method compares common traits to "classify" those data points as being more like each other. The category can then be used for any new data being processed with similar traits.
Regression
Another Supervised Learning method example is Regression. This method determines how the tracked value affects the values of other fields in the dataset. An example would be how the square footage of a house affects the price. These variables are typically continuous, which means they have mathematical "lower" and "higher" values. In the example, square footage can increase or decrease between houses, as can the price associated.
Clustering
We’ve looked at Supervised Learning types, but what about Unsupervised Learning types? Clustering is an Unsupervised Learning method where data is compared to how similar it is. If similar, they can be grouped. The more similar it is, the closer the data points would lie. If data points are not similar, they would be placed further away. This is heavily adaptable to new data, but could also increase the complexity if more features are added.
Correlation
Correlation refers to how values are related to one another, and typically either move in the same direction or the opposite. What that means is if one value increases, the other also increases in positive correlation. An example is the number of hours spent studying correlated, or directly followed, the test scores. In a negative correlation, while one value increases, the other decreases. For example, if you decided to split the hotel tab equally with a group, then the more people there are the less the room will cost you.
Neural Networks
Loosely based on the neural connections in the brain, Neural Networks is a Machine Learning method is a system separated into layers but connected by nodes. It takes input, gives output, and has a hidden layer where most of the decision-making occurs.
Deep Learning
Deep Learning is a Machine Learning Method that uses Neural Networks. Its goal is to act closer to the human brain. It makes informed decisions and tries to learn from each decision it makes. Even though to start it must have simple problems or patterns to solve, over time it learns more and can move to more complex issues. As it gains complexity, it also tries to improve accuracy. In terms of Artificial Intelligence (we’ll get a brief description of that too), Deep Learning is used for speech recognition, image recognition, translation, and so on.
Decision Trees
Decision Trees are part of Machine Learning that allows you to visualize decisions being made. From its branches, there are splitting paths depending on the decision made. Usually, they are Yes or No results that split the data. Eventually, when you drill down you will reach a leaf, which represents the result. However, Decision Trees tend to overfit data, which means the answers may not always be the most accurate.
Back Propagation
Back Propagation is a little harder to define in a summary. In the Neural Networks, the output predicted is compared to the actual output. In Back Propagation, if the error is high, the bias and weight are updated to minimize the loss of data and try to increase the accuracy.
ETL
ETL stands for Extract, Transform, Load. It’s what Data Warehouses use to make raw data usable. It refers to extracting data from sources, transforming the data into models while enforcing the quality of the data, and loading the data into a format the appropriately visualizes the data. It’s a major part of Data Science because ETL processes are used in businesses, not only for Machine Learning and Big Data.
Web Scraping
Web Scraping is pulling data from the source code of a website, which you can then collect or filter the information you are looking for from that website. Typically, you must write code to go gather the text from the website, and then you can create a script to sort out what you need or simply print the results you found.
Business Intelligence
Business Intelligence (BI) is a descriptive way to analyze business metrics. Even though it’s business-related, it is still more technical. It involves generating reports or finding important trends. In describing data, you use Data Visualization.
Artificial Intelligence (AI)
We’ll stay a little vague on this one. Artificial Intelligence is machines that use Machine Learning to adapt a learn as they work. Chatbots, for example, as considered to be AI and they learn the more they are used. But AI can also be made to solve problems or make predictions. It is a large field that encompasses a variety of machine types, such as nonplayable videogame characters learning during the boss fight on how you play, or even self-driving cars.
Augmented Reality (AR)
Again, this will be a little vague, but there are articles dedicated to describing AR if you want to learn more. Augmented Reality is basically enhancing the reality you live in. By this, I mean you add embedded information to the world you live in. For example, I want to build a smart mirror. As I brush my teeth in the morning, a smart mirror would display the weather or even news line headers. AR would enhance daily life by providing more information with just a glance.
Conclusion
There were a lot of terms were covered. A few of them were more common buzzwords, and a few others were less used. I tried to cover at least a few of the important ones. Hopefully, having this list to review any time you want is helpful, and hopefully, the definitions helped to explain the Data Science terminology meanings a little bit easier. Again, this is not a fully comprehensive list. There’s a lot more that I likely missed. But it’s at least a good place to get you started. If you think I missed any important terms, feel free to add them in the comments. Until next time, cheers!
Read all my articles for free with my weekly newsletter, thanks!
Want to read all articles on Medium? Become a Medium member today!
Check out some of my recent articles:
Something I Learned This Week – Entity Framework Is Picky About Primary Keys
Why Data Science Is Important for All Developers
References:
Glossary of common Machine Learning, Statistics and Data Science terms – Analytics Vidhya
Data Science Terms and Jargon: A Glossary
30 Data Science Terms Explained in Plain English (with Examples) – Springboard Blog
Data Science Terminology: 26 Key Definitions Everyone Should Understand
What is a Data Warehouse? Definition, Architecture and Benefits Guide
Introducing the Concept of Augmented Reality (AR) based Ground-Zero-Analytics