Data Mining in Brief

Sidath Munasinghe
Towards Data Science
4 min readDec 23, 2017

--

Data mining is a very popular topic nowadays. Unlike a few years ago, everything is bind with data now and we are capable of handling these kinds of large data well.

By collecting and inspecting these data, people were able to discover some patterns. Even the whole data set is a junk, there are some hidden patterns that can be extracted by combining multiple data sources to provide valuable insights. This is called as Data Mining.

Data mining is often combined with various sources of data including enterprise data that is secured by an organization and has privacy issues and sometimes multiple sources are integrated including third party data, customer demographics and financial data etc. The amount of data available is a critical factor here. Since we are going to discover patterns in sequential or non-sequential data, correlations, to determine if the amount of obtained data is of good quality, as much as data available is good.

Let’s start with an example. Assume we got some data related to login logs for a web application. As a whole, this set of data has no value. It may contain the username of a user, login timestamp, time spent to log out, activities have done etc.

If we take an overview look at this, it is a whole mess. But we can analyze this to do extract some useful information.

For example, this data can be used to find out a regular habit of a particular user. Further, it will help to find out the peak hours of the system. This extracted information can be used to increase the efficiency of the system and make more user-friendly.

However, data mining is not a simple task. It takes a certain amount of time and it requires a special procedure as well.

Data Mining Steps

The basic steps of data mining are follows

  1. Data Collection
  2. Data Cleaning
  3. Data Analysis
  4. Interpretation
Basic data mining steps
  1. Data collection — The first step is to collect some data. As much as information we have is good to make the analysis easier later.We have to make sure that the source of data is reliable.
  2. Data cleaning — Since we are getting a large amount of data, we need to make sure that we only have the necessary data and remove the unwanted. Otherwise, they may lead us to false conclusions.
  3. Data Analysis — As the name says the analysis and finding patterns is done here
  4. Interpretation — Finally the analyzed data is interpreted to take important conclusions like predictions

Data mining Models

There are different kinds of models associated with data mining

  1. Descriptive modeling
  2. Predictive modeling
  3. Prescriptive modeling

In Descriptive Modeling, it detects the similarities between the collected data and the reasons behind them. This is very important in constructing the final conclusion from the data set.

Predictive Modeling is used to analyze the past data and predict the future behavior. Past data give some kind of hint about the future.

With the significant development of web, text mining has added as a related discipline to data mining. It is required to process, filter and analyze data properly to create such predictive models.

Applications of Data Mining

http://slideplayer.com/6218639/20/images/21/Data+Mining+Applications.jpg

Data mining is useful in many ways. For marketing, it can be applied effectively. Using data mining we can analyze the behavior of customers and we can do advertising by getting more close to them.

It will help to identify trends of customers for goods in the market and it allows the retailer to understand the purchase behavior of a buyer.

In education domain we can identify the learning behaviors of students and the learning institutions can upgrade their modules and courses accordingly.

We can use data mining to solve natural disasters as well. If we can collect some information, we can use them to predict things like land sliding, rainfall, tsunami etc.

There are much more applications in data mining nowadays. They can vary from very simple things like marketing to very complex domains like making environmental disaster predictions etc.

Special Remarks

  • Data mining should not be used when complete, accurate solution for a particular problem is possible. When such solution is not possible we can use data mining techniques with lots of data to characterize the problem as input-output relationship.
  • Need to analyze the problem property to determine whether it is a Classification (discrete output Ex: True or False) or Estimation (continuous output Ex: real numbers between 0,1) problem.
  • The inputs should have sufficient information to make an accurate output. Otherwise it will lead to an inevitable decrease.
  • There should be enough data make a accurate result. Need to select a proper algorithm according to the input data. Some algorithms need large amount of data to reach a good accuracy while others reach quickly.
Algorithm Comparison

Thanks for reading…

Cheers!

--

--

Senior Tech Lead | AWS Certified | Technical Content Writer | MSc in Data Science | AWS Community Builder