Want To Be A Valuable Data Scientist? Then Don’t Focus On Creating Intricate Models

Photo by Jonathan Borba on Unsplash

So you decided to become a data scientist.

Congrats! As a fellow data scientist, I can say that my career is fulfilling and rewarding. That being said, job expectations are different from reality.

I get a lot of questions from aspiring data scientists, asking me what they should focus on. I hear a range from Deep Learning Udacity Nanodegrees , Advanced Statistical Analysis on Coursera, visualization tutorials on Tableau’s training website, software engineering documentation in regards data pipelines/Spark, etc. While these are all equally important, this is still a lot of information to cover. I wasn’t a master at all of this stuff when I first got hired, but I learned only what I needed to know for the task at hand. Yes, it required sacrificing some weekends and downtime to learn a certain technology. But self-learning whenever I absolutely need the information is important for me to not get bogged down by the vast content out there.

So yes, curiosity to learn new tools and concepts is a must. As a data scientist/software engineer, you’re already aware of how quickly the software world changes. Every month, a package your tool is dependent on updates. Furthermore, new software tools are released every 6 months that address problems that you were trying to solve before.

That being said, I believe there’s another skill that every data scientist should master: the ability to analyze the data.

Wait a minute. Shouldn’t a data scientist do more sophisticated work such as building machine learning models?

Not exactly. Building a machine learning model is pretty straightforward. Say I want to extract Medium blog texts to build my own NLP classifier. I want to build a tagging system that determines which blog posts are political, sports related, business related, entertainment related, etc. All I really need to do is to write one line of code to read in the texts and associated labels, one line of code to vectorize the texts and labels (convert from text to numerical format), one line of code to build a Naive Bayes classifier that is trained on that text+labels, and one line of code to deploy the classifier onto an endpoint (Sagemaker, etc). This is 4 lines of code to have a model in production.

Granted, there are some data scientists that can build their own neural networks using Pytorch. This does require graduate level math and statistics. But those jobs are rare and are usually for FAANG/companies with an established data infrastructure for research work. Many data scientists stick to simple machine learning models and focus on feeding them the right data. Determining the right data requires analyzing what data they have and extracting certain parts that are useful.

But what if we want to improve our prediction speed? Shouldn’t we have a complex machine learning model to achieve such?

Perhaps. Microsoft AI has build an incredible gradient boosting model called Light-GBM. I have tested it out and compared it to XGBoost, which is one of the fastest skikit-learn classifiers. Light-GBM is lightweight, so predictions are faster than XGBoost. Light-GBM also supports parallel and GPU learning, therefore optimizing speed. However, there are cases where Light-GBM is not recommended. Light-GBM recommends at least 10,000 training data points for it to be effective. Else, it is prone to over-fitting.

Furthermore, it is unwise to choose an algorithm simply for speed if you don’t fully understand how the algorithm works. Take our NLP classifier from the previous example. Why did I use Naive Bayes instead of a boosting algorithm? Why did I select Naive Bayes instead of a decision tree algorithm? The reason is that Naive Bayes is straight up math. That is as fast as you can get. It’s true this assumes that your observations for each class/labels are independent of each other. But Naive Bayes outperforms any boosting algorithm in terms of speed. Boosting and decision tree algorithms are slower because they need to traverse though a certain tree path to get the optimal prediction. Furthermore, Naive Bayes performs well on smaller data sets, which decision trees overfit on.

Take a step back and think of NLP in a broad sense. When building an algorithm, you need features for your model. In NLP, those features end up being the unique words of the text. This can be over 2,000 features in a single blog text! And this may include filtering out stop words/unnecessary words/etc. Imagine how long it will take building a random forest algorithm with 2,000 features. This will take hours on end to build on a single CPU, and probably close to a few seconds to classify a new blog text. Boosting is an improvement over decision trees in terms of prediction speed, but Naive Bayes will still outperform any decision tree algorithm. Accuracy, however, is another story. Decision tree algorithms are more accurate than Naive Bayes, but they can overfit depending on your dataset.

If you’re building an NLP classifier for your company, you have to understand what tradeoff you are fine with. This is all dependent on the data you have, and you’ll have to analyze it to determine which algorithm works best.

Note: If you want to learn more on the details behind these algorithms, I recommend StatQuest to learn more on statistics and different machine learning algorithms.

Fair point, but isn’t that what a data analyst already does? Is a data scientist really nothing more than a glorified analyst?

Yes. Data scientists just have more technical chops (software engineering, algorithm design, cloud development) than data analysts. That might change in the future as the tools get easier to use.

Ok, so why can’t I let the analyst do that job while I focus on the cool, complex model?

You can, but that just hinders your development as a data scientist. Alluding to my prior point, it’s always better to feed a simple model with clean data than a complex model with bad data. Getting clean data requires analyzing data on your end so you can design a pipeline to effectively build and train your model.

To illustrate, I’ll share a work example with you. For my job, our team was building a NLP classifier for patient medical records. Our client wanted an automated tagging system so that they can go through 1000 page medical records and see what that page is about. We had 50+ classification labels, ranging from Heart conditions to Brain Injuries.

We were also given limited training data for each classification. We had 5 pdfs per classification, each with 20–1000 pages in length. I can’t tell you details on how we solved that problem, but we were able to get models with a 90% accuracy and above.

Our team wondered if we could publish those models to Github. We wanted some sort of version history to keep track of the changes we made to improve model accuracy. The problem is that we are analyzing medical records, and we have to make sure that there is no trace of Protected Health Information (PHI) in any of our code/scripts/models. It doesn’t matter if the Github repository is private for our use; we would be in danger of exposing PHI if Github suffers a data leak in the future. For those unfamiliar, PHI ranges from a patient’s first name, last name, SSN, address, date of birth, etc. These are information that theoretically won’t be part of our model features, and we have removed all traces of such. However, patient’s names are tricky when there are hyphens involved. Take hailey-hailey. That is a name of a skin disease instead of a person’s last name. That would be a relevant feature for our models. So there are edge cases on when we would keep a hyphenated name.

I was double checking model features for our Back Injuries model, and came across this interesting feature.


Note, I cannot list the actual patient name for PHI reasons. I’m using a fictional character name (Emma Geller-Green).

So in this case, that is a patient’s full name that appeared in a feature. But we were perplexed on how it got there for two reasons

  • Back Injuries training data should not have picked up a person’s name as an important feature. A person’s name usually appears 5 times on a 400 page medical record, so the frequency would be minimal for the Back Injuries model to pick up on. Furthermore, the person’s name is rarely mentioned in a page that describes back injuries.
  • Our stopwords list has names like emma. Since we didn’t have logic to address hyphenated last names, green-geller should have been listed instead. emma should have been removed.

We next double checked what the optical character recognition (OCR) read the text of these medical records as. We checked and realized that OCR was reading geller-greenemma as one word. OCR tools like Tesseract are impressive and pretty accurate on reading messy pdfs, but they are far from perfect.

So that explains why emma wasn’t removed. But that still doesn’t explain why Back Injuries picked up that full name as a key feature. We went back to the 5 training pdfs for Back Injuries. We opened up a 40 page training pdf with almost every page classified as Back Injury. To our surprise, that pdf was from the 1980s. Every single page of that pdf had Geller-Green Emma in big, bolded letters at the header.

A machine learning model doesn’t know what back injuries are. It just notices patterns and makes assumptions. The fact that Geller-Green Emma was on every training page labeled Back Injury was enough for the model to assume that this name indicated that particular specialty. Naturally, our team added logic to handle those 1980s pdfs and remove the hyphenated patient names from them. Instead of creating our own PyTorch model to handle this exception, we just cleaned up the data set. This approach was easier for us to test and quickly deploy in production.

A model is always predicting on new, unseen data in production, and could very well make this same mistake on a different name. Analyzing the data and cleaning it is so important when deploying it to production.

Plus, I would hate to be diagnosed with back problems just because I told my doctor, “I think Emma Geller-Green’s mother looks cute”.

Thanks for reading! If you want to read more of my work, view my Table of Contents.

If you’re not a Medium paid member, but are interested in subscribing to Towards Data Science just to read tutorials and articles like this, click here to enroll in a membership. Enrolling in this link means I get paid for referring you to Medium.

Your home for data science. A Medium publication sharing concepts, ideas and codes.

Recommended from Medium

Choose carefully where to store your digital self

Employee Spotlight: Vladimir Iglovikov

End to End Reliability for Data

Creating a Custom R Package

How I came to know so little about Data Visualisation


COVID vaccination in India: are we on track?


Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Hari Devanathan

Hari Devanathan

Data Swiss Army Knife. AWS Certified in Machine Learning.

More from Medium

My Biggest Mistakes as a Data Scientist

Are We Asking Too Much from Data Scientists?

Websites That Help Data Scientists Prepare For Salary Negotiations

I hate this question… What if I’m too high? Or too low? What budget do you have in mind?

You might be in the wrong Data Science role. Here’s why.