Well, saying that I don’t use libraries for my work as a data scientist would be like saying I own a bike but travel only by walking from one place to another. It’s theoretically possible but not practical or efficient for all the cases. Similar to different modes of transportation, we have multiple libraries in python, which makes our job easier and faster. Knowing these libraries will help you save energy, time, and mental power while implementing an algorithm from scratch.
Currently, when Data Science is on the boom, there are multiple libraries for implementing one algorithm, and it’s completely okay to feel overwhelmed because of all these options. In this article, I will list some of the essential libraries and their use case, helping you find appropriate libraries for your use case.
NumPy
Well, I am stating the obvious here. NumPy is one of the most essential and frequently used Libraries by our community. It is the best library when working with arrays, structured data, scientific computation, statistical analysis (e.g., mean, median, linear algebra), etc. NumPy library is very efficient in terms of computation speed.
You can learn more about NumPy here.
Try this out for hands-on experience-:
- Generate two arrays of size (10, 10) named array1 and array2, and fill them with random integers.
- Find out the positions in array1 where the value is greater than 10 and make them 0
- Find out the positions in an array2 where the value is less than 5 and make them 10
- Multiply these two arrays element-wise
- Save the resultant array in a ".npy" file
Try to implement these in as few lines as possible to get a flavor of the efficiency of NumPy.
Pandas
Pandas is mainly used to read, analyze, manipulate and save tabular data. Pandas essentially provides two data structures.
- Series – A pandas series is a one-dimensional data structure that comprises a key-value pair. It is similar to a python dictionary.
- DataFrame – A pandas data frame is a two-dimensional data structure. It is a combination of two or more pandas series.
Whenever I am given data in a CSV file, the first thing which comes to my mind is to analyze that using pandas. It’s an essential library for data analytics, manipulation and filling out missing values, etc.
You can learn more about pandas here.
Try this out for hands-on experience-:
- Download the csv from here and use it for all experiments.
- Read the csv file and drop rows that are duplicate
- For a column with integer/float values, get statistical parameters like mean, median, variance, etc.
- Write a dataframe to a csv file
Matplotlib
Matplotlib is a go-to library for data visualization. From visualizing data for our understanding to making a beautiful visualization for a presentation, matplotlib does it all. I think you can understand how essential matplotlib’s role is in a data scientist’s life.
You can learn more about matplotlib here.
Try these out for hands-on experience-:
- Draw a standard graph using two series data.
- Add a title to the graph, x-axis, and y-axis
- Add legend for the graph
- Save the graph
Scikit-Learn
Scikit learn is a valuable library for implementing various traditional Machine Learning algorithms, including supervised and unsupervised algorithms. You can implement various algorithms like decision trees, random forest, KNN, K-means, etc., available in Scikit. This library also provides the implementation for various data pre-processing and post-processing algorithms like normalization, converting labels to one-hot encoding, etc. You will find this library present in multiple courses and books because of its wide range of implementations. So whenever you have to implement any basic machine learning algorithm, you should first try to see if scikit learn can be used or not.
You can learn more about Scikit learn here.
Try these out for hands-on experience-:
- Download housing data from here and read using pandas
- Fill the missing values using pandas, visualize the data using matplotlib
- Normalize data using the scaler module of sklearn and then fit a linear regression model using sklearn.
The above exercise shows how multiple libraries are used for a basic machine learning problem.
OpenCV
If you are working with images or video, it’s almost impossible to find a library that can match OpenCV’s range of functionalities. It provides a wide range of traditional image processing algorithms like canny edge detection, SIFT, SURF, hough transformation, etc. It also provides the implementation of deep learning based algorithms for image classification, object detection, segmentation, text detection in images, etc. I work a lot on images and video, and I can tell you that I have to use OpenCV at least once for any task.
You can learn more about OpenCV here.
Try this out for hands-on experience-:
- Download any image, and then read the image using OpenCV
- Convert the image to grayscale
- Use the bilateral filter to reduce noise in the image, then apply erosion and dilation filters.
- Find edges in the image using canny edge detector
- Save the new image.
NLTK
Like OpenCV is an essential library for images and video, NLTK is a necessary library for texts. For all tasks like stemming, lemmatization, generating embedding, tokenization, visualization, etc., you can use the NLTK library. Some of the essential deep learning algorithms are also implemented in the NLTK library.
You can learn more about NLTK here.
Try this out for hands-on experience-:
- Use this corpus. Convert the whole corpus to lowercase
- Do word tokenization of the corpus and then do stemming of given data.
PyTorch
Although I am writing this last, I can’t thank the developers of PyTorch enough. It’s my go-to library for implementing any custom neural network or deep learning based method. Whether it’s audio, text, image, text, or tabular data, you can use PyTorch to write a neural network and train a model on the given data. PyTorch also plays an essential role in deploying deep learning based methods on GPU, reducing inference time by using parallelization.
You can learn more about PyTorch here.
Try this out for hands-on experience-:
- Download MNIST dataset
- Write data loader for the data as well as optimizer
- Write a custom fully connected neural network
- Train the neural network for the MNIST dataset
- Evaluate the model on the validation dataset
- Save the model.
Conclusion
I have included all the libraries which I use in my data science routine. It’s not that these libraries are the only ones. Each library mentioned here has an alternative available. But generally, I have seen the range of functionalities provided by these seven libraries to be the best among their competitors. If you feel like more libraries need to be included in this list, do let me know in the comments.
Follow us on medium for more such content.
Become a Medium member to unlock and read endless stories on medium.