Before you get into Data Science

A journey in learning Data Science from a non-IT or non-engineer degree

Cakramurti Ocha
Towards Data Science

--

Photo by Chase Chappell on Unsplash

Hi there!

If you’re reading this, I am assuming that you are a student from a non-IT or engineer degree or a non-IT/engineer-professional that wants to know more about data science or maybe just want to hang around. Either way, thank you for coming to my story.

A little bit background of me: My name is Ocha. I am an accounting undergraduate currently taking a Master’s degree in IT at The University of Melbourne. I started to learn data science since 2018, so it’s been more than 2 years now. So in this post, I would like to share my journey and what I have learned or I wish I know before committing to the path of the 21st century’s sexiest job, according to the famous Harvard Business Review article¹.

What is Data Science?

In my own definition:

“Data Science is solving problems using data available by aggregating data, doing experiments, and eventually making predictions which lead to solutions of the problems.”

We have been hearing this a lot lately that “data is the new oil”. The problem is we are talking about terabytes or even petabytes of data. One of the data scientist responsibilities is to aggregate those data and find insights or patterns which then turned into something actionable.

Things get more complicated when those insights get communicated to less-technical audiences, including executives. A good data scientist knows how to present techniques in producing the results in an easy to understand manner. For example, presenting on why product P is good to be marketed at 18 years old people without having to explain about unsupervised clustering.

So you can see here that a data scientist is actually a combination of someone who can use various tools, such as Scala, R, and Python, with knowledge in math and statistics to find all possible patterns alongside business expertise. It is crucial to have a high aptitude to learn since having expertise in only one of the required data science skills is hard enough, let alone those three skills.

Without further ado, these are my journey in pursuing those skills.

Phase 1: Getting to know

1. Learn the tools (programming language)

The first time I started to learn a programming language was in 2016. It was just to satisfy my curiosity back then and thinking how cool it was if I can code while working as a consultant. I started my first online course which is Coursera Python for Everybody by Michigan University with Dr Chuck (Charles Severance) as the instructor. I really enjoy his way of teaching, which is to beginner audiences. But I dropped out that course at some point (I think it was when I was learning about data structures) because it was a lot at that time.

I resume the course when I was working as a risk consultant at one of the Big 4 Accounting firms in January 2018 and realised the need for using Python in analysing data. This time, I have finished the whole specialisation course (barely). I struggle a lot when Dr Chuck explaining about BeautifulSoup and using SQL in Python and this happens because of my lack of coding implementation experience. Coding is about practice. The more you code, the better you are. So for fellow non-IT/engineers, I highly recommend starting with this course if you are interested in data science. Dr Chuck will not only teach you on the basics of Python but also doing projects like using Twitter API and Google Maps API.

2. Bertelsmann Udacity Data Science Scholarship

This is where things started to get exciting. Back in May 2018, I joined the Bertelsmann and Udacity challenge in data science scholarship challenge. The challenge is to learn basic statistics, basic Python (again), and SQL while being active in the forum. This time, the SQL part is using what’s called Mode Analytics embedded in the Udacity’s course. Basic statistics and Python are really fun since I have learned statistics during my accounting undergraduate and for Python, I just keep practising what I had, until SQL. The first part when selecting data, aggregation, and joins seems not hard I can still easily follow until the subquery mania came. The subquery made my head boil that if an egg is on top of my head, it will be cooked. Nevertheless, I manage to finish the SQL part in about a month.

After three months of doing the challenge, I was surprised that I am eligible for the second phase of the challenge, which I was awarded a free nanodegree, data analyst nanodegree, from Udacity. This is where things starting to get the most interesting. In the early phase, I honed my skills using pandas with the project analysing bike-share data. Learning new things is always hard as I don’t have enough familiarity with pandas and had to search for functions and documentation frequently.

Done with pandas, I was ready for the most exciting of the nanodegree, which is A/B testing. Here, Udacity courses explain first about what are the statistics theory behind A/B testing, which basically hypothesis testing. The project was to analyse an A/B test data for a website. I learned how to do bootstrap sampling in order to create a distribution for later find whether the new page is significant over the old page.

After done with the A/B test project, I was faced with a new programming language, which is R. The project was to perform exploratory data analysis using R on a red wine dataset. Not enough with that, data wrangling using Python comes after. And finally, visualisation using Tableau, which is the first BI tool I learned. All of this took around 4 – 6 months while working spending almost all weekends.

If you would like to learn more about the project, I am on progress to publish my Data Analyst Nanodegree projects on my GitHub https://github.com/cakraocha/udacity-bertelsmann-dataanalyst-nd-ica so feel free to have a visit.

To summarise the getting to know phase, I managed to pick up the basic tools and math skills only by two MOOCs: Coursera and Udacity. This is just only the start though because the challenging part is how you can use the tools in the real world.

Phase 2: Looking for real-world application

Thanks to Udacity, I managed to change my career from a business consultant to a data analyst at another Big 4 firm. I am super excited at first because I was going to apply and even learn more about the data science process. I was wrong.

The first thing I was asked to do was to build an automation system for the tax team using SSIS and Microsoft SQL. Although I had learned SQL at Udacity, it was MySQL and apparently, it has different syntax with Microsoft SQL (which also can be called T-SQL). The different syntax is not a big problem though, it was other functionalities that T-SQL has which I really have no idea. I even don’t know how to import the data to T-SQL because of the unfamiliarity with T-SQL Graphical User Interface (GUI).

Not done with that, apparently the firm that time was using Power BI as the BI Tool. It was a very different experience switching from Tableau to Power BI. One example of the difference is the import data system. Tableau use SQL-like join for the relationship between data while Power BI using a relationship with cardinality which has to be defined for each table. Another example is the syntax for doing complex aggregation. Tableau use expressions and quite easy to learn while Power BI uses a whole set of language called DAX, similar to Excel functions but with more syntax.

I spent my first weeks learning to be familiar more with T-SQL and Power BI. Although I managed to learn all of that, in all honesty, at some point I really want to give up because of the feeling that I am not good enough for the job. I was saved by a really exciting project that demands my analytical skills and felt that I can contribute more.

During my time as a data analyst, there are many questions that I have not been able to answer. For example, how can I apply unsupervised learning to a survey which had more than 100 features? How can I do experiments using statistics to a survey which leads to the solution? Is it worth the time to do experiment A? What if nothing comes out of it? In business, things are moving fast and we have to decide whether an approach is worth to be doing or not. Moreover, it involved other functions as well, adding more complexity to the analysis. More often I know the important questions to be solved but I don’t know what approach is suitable for the problem. This is when I realised that I need to learn more technical skills, especially in math, statistics, and programming.

Phase 3: Getting back to learn

After those learning while working, I decided that I need to accelerate my learning in the analytics path by going back to school. I applied to The University of Melbourne in Master of IT and currently, I am on my way in finishing the course. I learn so many things in terms of technical skills from Algorithm and Complexity to AI Planning and Statistical Machine Learning. It is definitely not easy as I came from the Accounting background, I have to work extra hard in understanding the math and new terms in IT (for example, I only know Manhattan Distance when I starting the course). I can assure you that this course is not for everyone, especially for those who are coming from a completely different background. I recommend to have these 3 things to be prepared:

  • Strong will to learn Math
  • Strong will to learn Programming
  • Time

You can reach me out if you have any questions on the degree and how do I cope with it.

And that’s it!

My last recommendations to beginners:

Learn programming first

There are two ways in my opinion on how to learn programming: The safe way and the hard way.

The safe way is, obviously, learn Python. Python is a high-level language and it is quite easy to follow. If you master it enough, you can move instantly to pandas, NumPy, and even TensorFlow or PyTorch. The disadvantage of starting with Python is quite hard in understanding object-oriented programming (OOP) and software engineering in general because of the structure of the language itself.

The hard way is starting with C++. The advantages:

  • You can learn OOP right away.
  • You can learn a lot when learning and implementing Algorithm.
  • Nowadays, C++ is being used in advanced deep learning because the core of TensorFlow and PyTorch is C++.
  • Once you master C++, you can learn other languages easier and faster.

Note that it is high risk because starting with C++ is super hard and it will mostly guarantee you to be discouraged in learning programming.

Thank you for reading my journey in pursuing the skills of data science! I hope you guys are thrilled after reading this in going through your own journey.

Reference

  1. https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century

--

--