Everyone’s running sprints, try a marathon
Table of Contents

1. Introduction
Launched in late 2020, the IBM Data Science course is one of the few data science courses released by IBM to help graduates and working professionals seeking to pivot and break into data science. Other courses from IBM include the IBM Data Analyst Professional course and IBM Data Engineering Professional course.
Each course comprises several modules, with each module linked to a digital badge and certificate upon completion. Interested learners can opt to complete particular modules or all the modules to earn the full course Certification.
2. Course Overview
The Data Science course comprises ten modules. These are namely:
- What is Data Science
- Tools for Data Science
- Data Science Methodology
- Python for Data Science & AI
- Python Project for Data Science
- Databases & SQL for Data Science
- Data Analysis with Python
- Machine Learning with Python
- Capstone Project
The first four modules are relatively light in content and can be easily completed within a week if one sets aside eight hours a day for learning. In the next four modules, learners are introduced to Python, SQL, and IBM’s cloud services (for data storage, execution of notebooks, and cloud computing. The course cumulates in a capstone project where learners would need to apply data science skills covered over the various modules; from problem definition, to data collection, data cleaning, exploratory data analysis, feature engineering, and model training and evaluation.
3. The good
For the uninitiated, the first four modules provide a good and rather concise introduction to data science. Teaching assistants are quite responsive and prompt in providing clarifications to learners.
The topic on SQL is covered via the IBM Cloud environment, so learners need not be concerned with downloading/ installing additional software. The module in SQL is shared with the Data Engineering course, and there is optional course content (under the "Honours" certificate) for more complex SQL operations.
For Python data analysis, in addition to matplotlib and seaborn, two other libraries are also covered: Plotly (for interactive plots) and Dash (for serving interactive analytic applications).
For the capstone project, learners are introduced to Folium (a data visualization library for geospatial data), and the FourSquare API. Notebooks exercises are provided to guide learners on the use of these for capstone.
4. Areas of Improvement
The course would have been better in the following areas starting with
Typos in notebooks. Several notebooks contain minor spelling typos, but none serious that could affect readability nor code execution.
IBM Cloud account set up instructions. The instructions for setting up of account could have stated that not all regions allow for free cloud services. I selected "Japan" instead of the default "East Dallas" to host my services initially, and found some of the features requiring additional costs. It was resolved by starting another account and following the set-up steps accordingly.
IBM Cloud environment availability. Connectivity to the hands-on lab was hindered by the unstable server availability at times. Fortunately, the lab exercises can be completed off-line, and then uploaded to the labs for completion; having Ananconda on their local machine to run Jupyter notebooks can be very helpful.
Coverage of Machine Learning content. The course provides a good introduction to machine learning using Python and the various algorithms. It would be better if these are also covered:
Data preparation for Machine learning. Specifically, the three areas that could have been better are:
- Splitting the data into train and test set before scaling. This was mentioned in passing and not actually implemented in notebook exercises. Doing the train-test split after scaling would lead to data leakage, as the test set would be scaled accordingly with data characteristics from the train set as well. The consequence is that the model actually ‘sees’ the test set to a certain extent when trained on the train data.
- Stratify the dependent variable in train and test split. Stratification ensures the training and test data subsets will have the same proportion of class labels.
- Techniques for dealing with Class Imbalance. Data sets with almost equal proportions of class labels are rarely the case of real-world applications. Additional modules on techniques to mitigate class imbalances such as Synthetic Minority Oversampling TEchnique, or SMOTE would be a good addition to the curriculum.
Dash Assignment Starter code. The indentation in the provided starter code block made the code more confusing than it should and could have been better refactored. I skipped this part to the next module and revisited the assignment while doing another assignment in parallel.
Model Hyperparameter Tuning. Learners are introduced to the concept of hyper-parameters for model building. There are select sections where learners are taught how to find the optimal hyper-parameter (e.g. optimal k-value for k-means clustering). A coverage in using scikit-learn’s GridSearchCV or RandomizedSearchCV for parameter search would certainly help build learner’s awareness and competency in model building.
Deep Learning. For learners seeking to get their hands on deep learning, go for IBM’s AI Engineering Professional course.
5. Closing
Would I recommend this course? I enrolled in this course with the intent to uncover blind spots while doing my own revisions, and also to add a certificate to my CV. Course fee was also something on my mind, and that is why I set an aggressive learning schedule for myself. The first seven days of course enrollment are free and it provided all the more an incentive to make good progress. After I realized that I wasn’t able to complete the course within a week, I paced myself to complete it within a month.
For beginners and fellow data science enthusiasts seeking to add a certificate to their cv, this is a decent course for consideration. For more advanced ML topics (i.e. Deep Learning), look for other specialized courses.
Courses and certifications are one means to developing proficiencies. In the context of building competencies for career in data science, I would recommend targetted side-projects, joining data science communities/ subscribing to data science journals, and having a curious mindset to develop sufficient skills and knowledge in data science. I’m glad I took the course as it helped me in uncovering blind spots within the allocated period. These had branch-off into side-projects for deeper exploration and study. I hope this review is helpful to others in deciding on a course if that is what they decided on.