The world’s leading publication for data science, AI, and ML professionals.

Data Science Tactics – A new way to approach data science

An efficient way to think about data science

Tactic , is sequence of actions, aiming to achieve a certain goal. The word tactic originates from military warfare and supposed to have its origin in 1626. One of the classic tactic is called Oblique Order used in Greek Warefare. The first recorded use of a the Oblique Order tactic was in 371 BCE at the Battle of Leuctra, in Greece, when the Thebans defeated Spartans. The sequence in this tactic is as follows

  • In this tactic an attacking army focusses its forces to attack a single flank. The left flank is more stronger than the right flank
  • The left flank advances more rapidly and the right flank tries to avoid conflict as much as possible
  • The enemy left flank is outnumbered and defeated
  • Then the left flank goes to defeat the other enemy flanks
Oblique Order tactic
Oblique Order tactic

In chess, a tactic refers to a sequence of moves that limits the opponent’s options and may result in tangible gain. In football, for example, counter-attack at high speed is a tactic.

In business, one of useful tactic, is offering discounts in order to promote the products

Tactics are very important part of any domain, be it game or business. It helps beginners to understand what are good tactics and what are bad tactics. It helps managers, like football managers, to plan the team based on tactics which are planned for the game

Advantages of Tactic

So why do have tactics ? It gives a way to think and plan before executing. It gives a way to approach the war or the game. One cannot start waging a war or playing football without thinking first about the approach. There is lot at stake, so it is necessary to first think of an approach.

Tactics also help everyone involved a way to act and communicate. In football, once the game starts, it is not very easy to communicate between players given the rapidity of the game, the distance between the players as well all crowd noise. But with a tactic in place, the players can take positions very rapidly based in situation of the game

Tactic also a way for leader to communicate to their team. Before any football game, the manager holds a briefing session where she or he explains the tactics to the players. Imagine a situation where there are no such briefings before a match. The players would be confused on how to play and would be a disaster for the team

One of the most important way in which a tactic helps is also to communicate with the outside world. People who are not directly involved in the match or a war are still interested to know how war was fought or how a game was won. There are various books and articles which have documented the tactics. These documents help to understand which were the winning tactics and which were not. These lessons learned from the past are very useful resources for future leaders in the field

What does Tactic mean in data science world

The concept of tactic is not much used in the world of Data Science. Even though the concept is very useful , the word tactic is not very prevalent with the data scientist. One of the reason is that it has not been attempted to bring tactical thinking in the domain of data science. However bringing tactics to data science could be useful for both the data scientist as well as data science domain in general

Let us first see what does it mean to understand what tactics could mean to the world of data science. As mentioned in above section, tactics gives a way to develop an approach before executing. For data scientist, tactics can help deciding an approach to a business problem and avoid to directly jump to algorithms. In many cases , the data scientist directly start coding before thinking on how to approach the problem. This may not lead to optimal results and efficient use of time. Having knowledge of tactics could bring a way to think at an "higher level" as well as explore different possibilities

Tactics can also help to avoid the problem of algorithm obsession. If you are algorithm-obsessed, you will try to use always the latest trend in algorithms to solve all problems without first thinking of an approach. However thinking first of tactics and then algorithms helps also selecting the right type of algorithm suitable for the problem

Like in football or chess, there is a way to know how the past games were won. This is done by describing the tactic which was used in the game. This knowledge is extremely useful for current coaches as well as analyst, who study the past games to decide on their tactics. In the data science world , there is no such high level description of any solution. The reason is that there is no way of universal way of describing a data science solution. Every data scientist tries to describe the solution in her or his own way. Also most of the time , the description of solution is just notebook and code.

With the use of the concept of tactics, there is a way to develop a universal definition of describing any data science solution. Such universal and high level description can help a data science solution from a high level approach perspective rather than going through code or notebooks.

Tactic in Data science illustrated with Segmentation

Segmentation or Clustering data is one of the very important data science techniques. It aims to find data records which are similar and assign them to a cluster or segment. The group of similar data records is called cluster or segment. Both this terminologies are used quite often.

This ability to group many data records into a few clusters or segments is very useful in various domains. For example, in marketing, one can group millions of customers into few segments. For example a cloth retailer can segments its customer base into groups such as fashion-addicts, price-sensitive, discount-lovers etc.. Then specific marketing campaigns can be designed for each segment.

Let’s look at different tactics for segmentation

Tactic 1: Identify Segment Formation Visually

Tactic 2: Segmentation with pre-defined number of segments

Tactic 1: Identify Segment Formation Visually

Technically any data can be clustered. However it is better to have well-formed clusters which are separate from other clusters. Well-formed and separate clusters help giving meaning to a cluster. If the clusters are too close or overlapping, it is difficult to understand the meaning of the cluster.

The objective of this tactic is to verify existence of well-formed and separate clusters. It is advisable to make this verification to ensure if any clustering would make sense or not. It can also help decide what kind of clustering algorithm to apply.

Dataset to illustrate the tactic

We will look at automobile dataset which has information about automobile. The dataset is available at UCI Machine learning repository. (*Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.)

This data set consists of three types of entities: (a) the specification of an auto in terms of various characteristics, (b) its assigned insurance risk rating, © its normalized losses in use as compared to other cars. The second rating corresponds to the degree to which the auto is more risky than its price indicates

The snapshot of the dataset is shown here

Automobile dataset
Automobile dataset

Tactic Sequence

The tactic sequence is shown here. This is explained in the following sections

tactic sequence
tactic sequence

Cluster Objective

Clustering exercise needs to have an objective. This helps in selecting features which are relevant, give meaning to cluster as well as using the result of cluster for business purposes. In this example, let us say that our objective of the clustering is group cars by their technical characteristics. This would help in determining how many clusters or groups could be formed based on technical characteristics. It would also help to find which cars are similar in terms of their technical characteristics.

Feature Elimination

With the objective fixed, we only need data related technical characteristics and do not need features related to insurance or losses. So you should first remove all features which are not related to objectives before running the clustering algorithm.

Also, we are going to use Principal Component Analysis (PCA) algorithm for dimension reduction. Generally the categorical features do not impact PCA. So we will also remove categorical features

So let us keep only the features which are relevant for this objective as well as which are continuous namely the following: num-of-doors, curb-weight, num-of-cylinders, engine-size, city-mpg, highway-mpg, wheel-base, length, width, height, bore, stroke, compression-ratio, horsepower, peak-rpm

We will remove all features which are not related to clustering objective, such as insurance or losses. Also we remove all categorical features such as make, fuel-type, body-style etc..

Standardisation

All the categorical features have different units. For example unit of num-of-doors is not the same as engine-size. The dimension reduction technique, explained below, is very sensitive and can give wrong results of the variables are not in similar units. So we first bring all numeric variables in terms of its standard deviation.

Shown here is a example of original values and scaled values after standardisation

standardisation
standardisation

Dimension Reduction

Even after removal of some features, we still have about 15 features. With such a high number of features , it is not feasible to plot any visualization. As humans, we are capable of visualizing data in maximum three dimensions

So in this step, we use dimension reduction technique to reduce the dimensions to two dimensions without loss of information in the data. Algorithms such as PCA or TSNE are useful for dimension reduction algorithms. Here we will use PCA algorithm

With PCA, with 2 dimensions, we see that first principal component (let’s call it PC0) captures 40% of variance and second principal component (let’s call it PC1) captures 16% of variance. So with two dimensions, we capture 56% of variance. As this is more that 50%, it is acceptable as the first two principal components should capture most of the variance

pca components
pca components

Lets us also see which features in the dataset influence most the principal component. The influence is given by eigenvalues. The eigenvalues for each feature and principal component is shown below.

pc0 eigenvalues
pc0 eigenvalues
pc1 eigenvalues
pc1 eigenvalues

As we can see that the first principal component is most impacted in positive direction by curb_weight and in negative direction by highway_mpg. Similarly, the second principal component is impacted in positive direction by peak-rpm and in negative direction by height

Visual Cluster Analysis

We can now transform the dataset to two dimensions based on the two principal components. Shown here is dataset, which has 15 features, now plotted on a 2D scatter plot.

scatter plot
scatter plot

Visually analysing, we see that the dataset could have possible three clusters. There are some data-points which can be considered as outliers

Tactic 2: Segmentation with pre-defined number of segments

In this tactic, we will see on how to do segmentation when number of segments (or clusters) is known. The number of clusters could be given by business, for example the need is to segment into a fixed number of segments. Alternatively the number of clusters have been visually determined with tactic given above.

Dataset to illustrate the tactic

We will use the automobile dataset which was used in previous tactic

Tactic Sequence

The tactic sequence is shown here. This is explained in the following sections

tactic sequence
tactic sequence

Cluster Objective

Same as in previous tactic

Outlier Removal

As we will use clustering algorithms to make segments, it is important to note that clustering algorithms can be very sensitive to outliers. If you have extreme outliers present, then the result of clustering can be very strange. So it is preferable to remove outliers before using clustering algorithms.

As we have seen in the previous tactic (Identify Segment Formation Visually), there are some outliers in the dataset which can be visually identified. Shown below is the PCA plot used in previous tactic. In addition, the outliers are marked with the data record number

removal of outliers
removal of outliers

These outliers which are visually identified can be removed from the dataset based on the data record number. The data record number can be identified in the dataset, as shown below, and then these data records can be removed

outlier records
outlier records

Feature Elimination

With the objective fixed, we only need data related technical characteristics and do not need features related to insurance or losses. So you should first remove all features which are not related to objectives before running the clustering algorithm.

In the previous tactic, we had removed the categorical features generally the categorical features do not impact PCA. However in clustering, we can keep the categorical features. They have to go through a special treatment called one-hot encoding, which is explained below.

So let us keep only the features which are relevant for this objective , both numeric as well as categorical, are the following: fuel-type, aspiration, num-of-doors, body-style, drive-wheels, engine-location, wheel-base, length, width, height, curb-weight, engine-type, num-of-cylinders, engine-size, fuel-system, bore, stroke, compression-ratio, horsepower, peak-rpm, city-mpg, highway-mpg

Correlation

In this step, we will try to find out which features are correlated. The reason we do this is that we will use these features to analyze the results of clustering. As shown in later step, the cluster analysis is made using a scatter plot. Features which are highly correlated , either positively or negatively, can be used as axes of the scatter plot. The scatter plot is very effective for cluster analysis if its axes are correlated variables

Shown here is the feature correlation shown as a heatmap.

correlation heatmap
correlation heatmap

A very light color indicates variables which are very highly positively correlated and a very dark color indicates variables which are very negatively correlated. There are many boxes with very light or very dark color. So let us select few of such variables, such as

  • length and width, which are positively correlated
  • highway-mpg and width, which are negatively correlated

Standardisation

Same as in previous tactic (Identify Segment Formation Visually

One Hot Encoding

For categorical variables, we need to convert into numeric values. This is done by one hot encoding algorithm

categorical to one hot encoding
categorical to one hot encoding

For example, the categorical feature body-style is converted into multiple features such as body-style_convertible, body-style_hardtop, body-style_hatchback, body-style_sedan, body-style_wagon. A 1 is put in the columns for corresponding style, else the value is 0

Clustering

There are different clustering algorithms. As here the assumption is that number of clusters are already known, we can use K-Means algorithm. As based on tactic "Identify Cluster Formation Visually" for the same dataset, we concluded that three clusters would be a good choice

In this step, we use K-Means algorithm to cluster all data points into three clusters. Here are the results where each data point is assigned to a cluster. The 2D scatter plot, shown below, is based on two principal components obtained after dimensionality reduction using PCA (as explained in previous tactic Identify Segment Formation Visually). The color of the points are based on cluster assignment

pca clustering
pca clustering

The clusters are well formed and separate. There is some overlap of clusters, but it is minimal. Cluster 0 and Cluster 1 are more compact while Cluster 1 seems a bit spread out. The number of points per cluster are illustrated with bar graph shown here

cluster size
cluster size

The other way of visualising is using a scatter plot with axes being features which are highly correlated (positively or negatively). As in we have seen in step Correlation, following could be choices of correlated variables

  • length and width, which are positively correlated
  • highway-mpg and width, which are negatively correlated

The scatter plot based on these features along with the different color for cluster is illustrated below

k means clustering
k means clustering

Giving meaning label to cluster

One of the important task in segmentation is give a meaningful name to a cluster or segment. Once the segments are created, it is convenient to call them by some meaningful name rather than segment0, segment1, etc.. Also for business users, it is easier to communicate the results of segmentation with some useful names. In this tactic, we will see on how to give meaning to a cluster

One possibility is to use the scatter plot which has been shown above to interpret the clusters.

We can see that Cluster1 are cars having small length and width. So we can label the cluster as small car segment. The Cluster2 has cars whose length are not small and not large. So we can label them as medium car segment. The Cluster0 has cars which have higher length and width. So we can label the cluster as large car segment.

It also helps to see actually a few members of the cluster in order to better understand the results. Here we can see some photos of few cars in each cluster. Based on the photos, you will observe that Cluster1 has cars which have a hatchback body-style. The Cluster2 are generally cars with sedan body-style. The Cluster0 has many cars which are wagons

In addition to scatter plot, there are other ways to determine meaning of a segment, which we will explore in other tactics

So here you saw some example of two tactics required for clustering or segmentation.

Example of tactics
Example of tactics

Once you start mastering the tactics, rather than algorithms, you will be in better control of the data science process. Thinking of tactics before algorithms will help you have a better data science approach. It will also help you to become less algorithm obsessed and develop a broader view. You will develop a clear thinking process about data science rather than getting bogged down by infinite number of algorithms

Additional resources

Website

You can visit my website to make analytics with zero coding. https://experiencedatascience.com

Please subscribe to stay informed whenever I release a new story.

Get an email whenever Pranay Dave publishes.

You can also join Medium with my referral link.

Join Medium with my referral link – Pranay Dave

Youtube channel Here is link to my YouTube channel https://www.youtube.com/c/DataScienceDemonstrated


Related Articles