Data Science is one of the most exciting fields at the moment and the demand for specialists is growing. There are many Data Science courses available online. The difficulty with learning Data Science is that it requires a lot of practice in order to become comfortable with real-life data science projects.
Over the last few months, I have been learning Data Science and exploring this area myself. It is worth mentioning that I’m not Data Scientist (my main area is Web Development) but I love all things programming and I wanted to try it out and find out a little bit more about various Data Science techniques and algorithms.
I wanted to find out if Data Science is really that exciting and powerful as people say. I will answer the question what I think about it in another article but for now, I wanted to share with you six projects that I worked on that will help you expand your Data Science knowledge if you are new to it or want to try it out. If you are new to Data Science or just want to explore this area a little bit more then the below projects will be excellent for you to just do that.
In these 6 projects, you will find the most popular problems you may face when working on Data Science projects. From data cleaning, through normalisation and standardisation, dimensionality reduction, feature engineering to regression, Computer Vision, Natural Language Processing (NLP) to Neural Networks, using popular Python libraries like Pandas, Numpy, scikit-learn, Tensorflow, Keras, TextBlob, etc. After completing all projects from this list you will have hands-on experience of popular data science techniques and algorithms.
If you have never worked on Data Science projects these are also a couple of introductory articles that will help you set up your computer with all that is necessary to work on these projects and show you how you can also work with Git and Github so you can store your projects there.
What you need to know
Basic understanding and knowledge of Python would be useful. I’m not covering the Python programming language in any of these articles. If you don’t know Python, I recommend familiarising yourself with at least basic Python before starting working with these projects. You can do this for example by completing the course Programming for Everybody (Getting Started with Python) on Coursera.
Once you know the basics of Python you are ready to start working on these projects. Prior Data Science knowledge is helpful but not necessary. All projects contain an explanation of all the algorithms, concepts and Python Data Science libraries that are used in the projects. I will explain the code and every step of the project so you can understand what and why you have to do for each project.
Although Jupyter Notebooks for all projects are also available on Github and you are welcome to work with those, I recommend that you write the code yourself and not copy/paste from or use Jupyter Notebooks. This way you will learn much more and retain more information.
Introduction and setup
(1) How to set up your computer for Data Science
Before you can start working on Data Science projects, there are a few things you need to set up on your computer. Luckily, there are free and open-source tools that make this process very simple.
(2) Introduction to Git and GitHub
In this tutorial, I will explain the essential steps that will enable you to create your GitHub repository, add and commit your local files to Git and push them to an online repository on GitHub.
Basic Data Science Projects
(1) Analysing pharmaceutical sales data
In the first part of this project, you will learn how to load a data set from a file to Pandas (Python data manipulation and analysis library) and how to perform statistical analysis and find certain information in the Pandas Data Frames. In the second part of this project, you will learn using regression (a technique that enables to find a relationship between independent and dependent variables) to predict future sales based on historical sales data. You will use three different regression algorithms: Linear Regression, Polynomial Regression and Support Vector Regression (SVR). Before you train your regression models, you will scale and split the data into training and testing data – both are very common and important data science techniques. Scaling will enable better model performance and thanks to splitting the data we can train our model on a different set of data and then calculate the accuracy score of the model to see how it performs on another set of data. Because you are using different regression models you can also use VotingRegressor for better results. VotingRegressor is an ensemble method that fits several regressors and averages the individual predictions to form a final prediction. You will use a popular Matplotlib library to visualise the data and regression predictions. Familiarising and practising Pandas data manipulation and Matplotlib visualisation is important as they are very common in many Data Science projects to manipulate the data and visualise results. Regression is also a very common and useful technique in many Data Science projects.
Here are the project resources:
Medium article: https://towardsdatascience.com/analysing-pharmaceutical-sales-data-in-python-6ce74da818ab
Project on GitHub: https://github.com/pjonline/Basic-Data-Science-Projects/tree/master/1-Analysing-Pharmaceutical-Sales-Data
(2) Predicting Titanic Survivors Using Data Science and Machine Learning
In this project, you will use data set of Titanic survivors to build a model predicting who survived and who died the Titanic disaster based on the passengers’ features like sex, age, passenger class, etc. After loading the data from the file to Pandas Data Frame you will perform exploratory data analysis. Exploratory data analysis enables us to understand what features we have in our data set and how they are distributed and also if we have any missing values in our data set. Having a better understanding of the data will help us with data pre-processing and feature engineering. During the preprocessing phase, you will clean the data and fill any missing values. You will also extract some new features from existing features (by using data binning among other techniques) and remove features that you don’t need and have no impact on the performance of the model. To train the model you will use two new classifier models: KNeighborsClassifier and DecisionTreeClassifier. You will then compare the performance of these models. You will also learn K-Fold Cross Validation technique while working on this project. This technique helps to better use the data and reduce bias as well as it gives us a better understanding of the performance of the model.
Here are the project resources:
Medium article: https://towardsdatascience.com/data-science-titanic-challenge-solution-dd9437683dcf
Project on GitHub: https://github.com/pjonline/Basic-Data-Science-Projects/tree/master/5-Titanic-Challenge
(3) Introduction to Computer Vision with MNIST
In this project you will start working with two very important Data Science concepts; Computer Vision and Neural Networks. MNIST is a digital database of handwritten digits. You will build and train a Neural Network to recognize handwritten images of digits. You will use Keras which is a Python library specifically for Neural Networks. You will look at different types of Neural Network layer activation functions and other functionality and configuration of the Neural Network. You will also learn how to save and load your trained model to and from the file. You will also use Keras function _tocategorical that converts integers to a binary class matrix which improves the performance of the Neural Network. In this exercise, you will learn how to create, train and use simple and effective Neural Network with Keras and evaluate its performance.
Here are the project resources:
Medium article: https://medium.com/swlh/introduction-to-computer-vision-with-mnist-2d31c6f4d9a6
Project on GitHub: https://github.com/pjonline/Basic-Data-Science-Projects/tree/master/3-Introduction-to-Computer-Vision-with-MNIST
(4) Recognizing Cats and Dogs Using Neural Networks With Tensorflow
In this project, you will continue working with Computer Vision and Neural Networks and you will build a little bit more complicated network using Keras and Tensorflow. Your task in this project is to build, train and test a Neural Network that will be recognising and categorising pictures of Cats and Dogs. You will have three sets of images of cats and dogs: train, test and valid; train to train the Neural Network model, valid to validate the model during training and then test to test the trained model. Each group contains different images of cats and dogs.
Here are the project resources:
Medium article: https://medium.com/swlh/recognising-cats-and-dogs-using-neural-networks-with-tensorflow-6f366ad30dbf
Project on GitHub: https://github.com/pjonline/Basic-Data-Science-Projects/tree/master/9-Cats-and-Dogs
(5) Image Face Recognition in Python
In this project, you will approach a different but also quite common and interesting Computer Vision problem which is face recognition. You will use _facerecognition Python library for face recognition and Python Imaging Library (PIL) for image manipulation. You will not only recognise known faces on the testing image but you will also mark and label faces on the image with PIL.
Here are the project resources:
Medium article: https://medium.com/an-idea/image-face-recognition-in-python-30b6b815f105
Project on GitHub: https://github.com/pjonline/Basic-Data-Science-Projects/tree/master/4-Face-Recognition
(6) Twitter Sentiment Analysis in Python
In this project, you will look at another important concept of Data Science which is Natural Language Processing (NLP). Using Python NLP library TextBlob, you will perform sentiment analysis of a number of recent tweets for a selected Twitter account. To access tweets you will first set up Twitter Developer API account. Then you will create a virtual environment and install required libraries for the project. You will use Tweepy Python library to authenticate with Twitter Developer API and download tweets. You will then clean the tweets and perform some basic NLP. You will calculate subjectivity and polarity for each tweet and label each record as positive or negative. You will then calculate the percentage of positive tweets for the account and visualise classes of tweets on the graph. At the end, you will generate a word cloud to see the themes and most common words used in the tweets you were analysing.
Here are the project resources:
Medium article: https://towardsdatascience.com/twitter-sentiment-analysis-in-python-1bafebe0b566
Project on GitHub: https://github.com/pjonline/Basic-Data-Science-Projects/tree/master/8-Twitter-Sentiment-Analysis
To summarise you will learn and practice the following Data Science techniques, algorithms and concepts:
- Pandas
- Matplotlib
- Python Imaging Library (PIL)
- Data Preprocessing
- Feature Engineering
- Feature Scaling
- Train/Test data split
- Data binning
- Statistics
- Voting Regressor
- Linear Regression
- Logistic Regression
- Polynomial Regression
- Support Vector Regression
- K-nearest neighbours classifier (KNN)
- Decision Tree classifier
- Neural Networks with Keras
- Face Recognition (Computer Vision)
- PCA (Principal component analysis)
- K-Fold Cross-Validation
- Performance validation using accuracy_score metric
- Tweepy
- WordCloud
- TextBlob (NLP)
- Stemming (NLP)
- Tokenization (CountVectorizer)
I hope this list of basic Data Science projects is useful and it will help you learn more and practice your Data Science skills.
Happy coding!
Not subscribing to Medium yet? Consider signing up to become a Medium member. It’s only $5 a month and it will give you unlimited access to all stories on Medium. Subscribing to Medium supports me and other writers on Medium.