
Data Science is an interdisciplinary field that touches many different domains. However, learning data science has mainly two sides. One is theoretical knowledge and the other one is software tools.
Without proper tools, we won’t be able to put our ideas into action. They allow us to extract insights from data. As the size of data increases, the complexity and variety of tools increase as well.
Unfortunately, there is not a single tool that meets all our needs. In order to stay competitive and perform efficient operations, we should adopt new tools. In this article, I will explain my strategy to learn new tools.
Let’s start with two highly popular tools in the data science ecosystem: Pandas and SQL. If you plan to work as a data scientist and a data analyst, you should master them.
I have a dataset that includes several features about some houses in Melbourne along with their prices. It is from the Melbourne housing dataset on Kaggle.

Consider a case where we need to calculate the average price for different types of houses.
Here is how we do it with Pandas:
melb.groupby("Type").agg(Avg_price = ("Price","mean")).round()

We can do the same operation with SQL as follows:
SELECT type, ROUND(AVG(price)) AS Avg_price FROM melb GROUP BY type;
type | avg_price
------+-----------
t | 933735
h | 1242665
u | 605127
This is a very simple example that demonstrates how I practice to learn new tools. I try to compare them with the ones I already know. The comparison involves doing several examples.
There are two main advantages of this way of learning:
- Since I do a comparison based on a tool I already know, I start off with learning the operations I frequently do. It saves me a lot of time. It is much more efficient than learning all at once.
- It gives me the chance to make a high level comparison. I obtain a comprehensive understanding of how the software tools are designed and how certain operations are laid out.
The comparison also makes it easier to get familiar with the syntax. Although different software tools are likely to have different syntax, the structure of the syntax for common operations follow a similar pattern.
Let’s do a slightly more complicated example. This time we will add another tool which is the data table package of R.
We will calculate both the average price and distance for different types of houses. We will also add a filtering operation to only include houses that have more than 3 rooms.
Pandas:
melb[melb.Rooms > 3].groupby("Type").agg(
Avg_price = ("Price","mean"),
Avg_distance = ("Distance","mean")
).round()

SQL:
SELECT
type,
ROUND(AVG(price)) AS Avg_price,
ROUND(AVG(distance)) AS Avg_distance
FROM melb
WHERE rooms > 3
GROUP BY type;
type | avg_price | avg_distance
------+-----------+--------------
t | 1211218 | 11
h | 1548914 | 12
u | 1067696 | 10
Data table:
melb[Rooms > 3,
.(Avg_price = mean(Price), Avg_distance = round(mean(Distance))),
by=Type]
Type Avg_price Avg_distance
1: h 1548914 12
2: t 1211218 11
3: u 1067696 10
They all provide efficient ways to do the same task. How they execute certain parts like filtering, grouping, and aggregating might be different. There exist slight differences between the syntax as well.
You may argue that we don’t need to learn all these tools if they can perform the same operations. The choice may not always be up to you. If your company does development with the R Programming language, then you will probably need to use the data table package instead of Pandas.
In some cases, a particular tool provides certain advantages over others. For instance, if a company stores the data in relational databases, performing data analysis and manipulation with SQL is the most viable solution.
There is a rich selection of tools in the data science ecosystem. When you work with large-scale data, traditional tools like Pandas and SQL may not be the most optimal choice. In such cases, you may have to use tools that allow for distributed computing such as Spark.
Conclusion
Data science has gained tremendous popularity in recent years. One of the reasons that contribute to this popularity is the advancement in software tools. Thanks to these tools, we are able to handle data very easily and quickly.
As a result of this rich selection of tools, data scientists face the challenge to learn new tools in no time. I think learning by comparison is a great method to adopt new tools.
Thank you for reading. Please let me know if you have any feedback.