The world’s leading publication for data science, AI, and ML professionals.

My Best Method to Learn New Tools in Data Science

And how I practice it.

Photo by Element5 Digital on Unsplash
Photo by Element5 Digital on Unsplash

Data Science has experienced a tremendous growth in recent years. The advancements in data collection, storing, and processing have contributed to this growth.

The potential to create value using data attracted many industries. More and more businesses have adapted data-centric strategies and processes in their operations.

The ever growing demand has also motivated developers and open-source community to create new tools for data science. Thus, the people who work in the field of data science has many libraries, frameworks, or tools to do their work.

Some of these tools are designed to perform the same tasks just in a different programming language. Some are more efficient than others. Some focus on a particular task. The undeniable truth is we have many tools to choose from.

You may argue that it is better to stick to one tool for a particular task. I, however, prefer to have at least a couple of options. I also would like to do a comparison between different tools.

In this article, I will try to explain how I learn new tools. My strategy is based on comparison. I focus on how a given task can be accomplished with different tools.

The comparison allows me to see the differences as well as the similarities between them. Furthermore, it helps to build an intuition about how the creators of such tools approach particular problems.

Let’s say I’m comfortable with Pandas library in Python and want to learn dplyr library in R. I try to perform the same tasks with both libraries.

Consider we have the following dataset about a marketing campaign.

marketing (image by author)
marketing (image by author)

I would like to create a new column that contains the ratio of the spent amount and the salary. Here is how it can be done using both Pandas and dplyr.

#pandas
marketing['spent_ratio'] = marketing['AmountSpent'] / marketing['Salary']
#dplyr
mutate(marketing, spent_ratio = AmountSpent / Salary)

Let’s do another example that compares Pandas and SQL. Consider we have a dataset that contains groceries and their prices.

(image by author)
(image by author)

We want to calculate the average item price for each store. This task can be accomplished with both Pandas and SQL as follows.

#Pandas
items[['store_id','price']].groupby('store_id').mean() 

             price                 
store_id  
-------------------                            
   1       1.833333                 
   2       3.820000                 
   3       3.650000
#SQL
mysql> select store_id, avg(price) 
    -> from items
    -> group by store_id;
+----------+------------+
| store_id | avg(price) |
+----------+------------+
|        1 |   1.833333 |
|        2 |   3.820000 |
|        3 |   3.650000 |
+----------+------------+

This method makes it easier for me to learn the syntax and grasp the concepts. It also helps me practice what I already know while Learning a new tool.

One of the challenges with learning software libraries and frameworks is not memorizing the syntax but to know when to apply which method or function. By learning through comparing with what I already know, selecting the functions that fit a given task becomes easier for me.

I use this method to learn pretty much any tool. Let’s also make a comparison between two different data visualization libraries for Python. I already know Seaborn and would like to learn Altair.

Recall the marketing dataset from the beginning of the article. As an example, we will create a scatter plot that compares the spent amount and salary for both females and males.

Here is how it is done with Seaborn.

import seaborn as sns
sns.relplot(
    data=marketing, x='AmountSpent', y='Salary', 
    kind='scatter', hue='Gender', aspect=1.5
)
(image by author)
(image by author)

Here is how it is done with Altair.

import altair as alt
alt.Chart(marketing).mark_circle().encode(
     x='AmountSpent', y='Salary', color='Gender'
  ).properties(height=300, width=450)
(image by author)
(image by author)

If you keep learning in this way, you will notice that the tools are more similar than you could anticipate. After a while, learning a new one starts to become fun instead of a challenge.

I have also done comparisons between SQL and NoSQL. SQL (Structured Query Language) is used by most relational database managements systems to manage databases that store data in tabular form. NoSQL refers to non-relational database design. It still provides an organized way of storing data but not in tabular form.

I have a simple table in MySQL and collection in MongoDB with same data about cars and their prices. For each brand, I want to calculate the average price of cars made in 2019.

Here is how we can perform this calculation with SQL.

mysql> select make, avg(price)
    -> from car
    -> where year = "2019"
    -> group by make;
+---------+------------+
| make    | avg(price) |
+---------+------------+
| BMW     | 53000.0000 |
| ford    | 42000.0000 |
| hyundai | 41000.0000 |
+---------+------------+

NoSQL version with MongoDB is as follows.

> db.car.aggregate([
... { $match: { year: "2019" }},
... { $group: { _id: "$make", avg_price: { $avg: "$price" }}}
... ])
{ "_id" : "BMW", "avg_price" : 53000 }
{ "_id" : "ford", "avg_price" : 42000 }
{ "_id" : "hyundai", "avg_price" : 41000 }

Conclusion

My best method to learn a new tool is comparison. I challenge myself to accomplish the tasks I can easily do with the tool I know by using the new tool. This method has been quite efficient for me.

After a while, it becomes fun to try out new libraries and frameworks. I steadily build my selection of tools for different tasks. However, it does not prevent me from trying new ones.

I strongly recommend to try this method for at least one new tool. I think it will make your learning journey easier and more fun.

Thank you for reading. Please let me know if you have any feedback.


Related Articles