Notes from Industry

Learning Data Science from your laptop presents challenges that serve as the basis for understanding how the tools of data science work. Moving from laptop to enterprise, however, presents several new challenges that the learning data scientist is never really exposed to. I certainly experienced this when in my first data science job I was presented with a common data science problem: "Please develop micro-segments for our customer base so we can market to them more precisely." My plan was clear—employ some feature engineering of the data, perform some data reduction of those features, and then cluster away to a solution. Unfortunately, the problems began immediately when I was left scratching my head on how to access the data in the first place.
Sure, I had colleagues share code snippets that had connection parameters to databases where data lived. All good and fine, however, those same code snippets didn’t solve my problem as they assumed I had access to the data in the first place. Thus, I was faced with one of the most prevalent issues enterprise data scientists deal with, security, and all before I had even had a chance to flex my enterprise data science muscles.
In this article, I will examine the top 5 challenges enterprise data scientists face and provide a few tips for overcoming them.

Challenge #1: Understanding the Role of Security
As I note in my anecdote above, most data scientists are not exposed to enterprise security protocols. Making things worse is the fact that security is usually implemented in layers meaning that there are often many gates and hoops one must jump through to access data, especially if those data live on different servers.
Keep in mind that enterprise security’s goals include 1. protecting the data at rest and in transit, 2. establishing identity and access management (IAM) controls, 3. disaster recovery, 4. education on vulnerabilities and social engineering, and 5. monitoring server endpoints for weird traffic (for more detail ).
The biggest concern from that list for data scientists? IAM. Data scientists are most often faced with IAM controls that are difficult to traverse when trying to access data for their duties. Here is a brief example of how this is operationalized in the enterprise setting. Each employee is given a unique user identity (ID). Those IDs are then assigned to different roles. At its most basic, roles allow users to do certain things such as READ, WRITE, and/or UPDATE databases. Individuals are also placed in various groups that allow security to group access rights for multiple data sources to a group of individuals who have similar job functions.
Tips for dealing with security: The best way to overcome this challenge is to meet with your security team. No one ever does and most will be very excited that you have taken the initiative to learn how their security protocols work. Most relevant is to understand how to request access by understanding the systems that have been put into place (e.g. roles and groups and ways to identify them). Moreover, meeting with security will start to build trust and further provide context on how you as a developer are responsible enough to learn the proper channels for accessing data. When people trust you, they tend to be more open to giving you access to things 😊

Challenge #2: SELECT *; Understanding Big Data
Going from a laptop with limited compute resources and smaller canned data sets that are great for learning data science techniques does not set the newly hired enterprise data scientist up well for the massive data available at enterprise scale. Thus, another common challenge I see young data scientists struggle with is how to properly sample colossal data sets into smaller, more manageable subsets that allow for more effective experimentation and discovery.
Tips for dealing with Big Data: It is important to understand that even large corporations have limitations in compute resources available to perform data science duties. Therefore, we need to be strategic in how we subset that data to identify smaller sets that allow for experimentation. Some common variables to subset large data sets by include things like date ranges, lines of business, or customer segments. Test model viability by training models on smaller data sets to see if performing on larger data is even worth the effort. For example, if you find that your accuracy on a classification model only performs at 60% but the business requires closer to 90%, simply leveraging a lot more data isn’t likely to get you there and another approach would be required.

Challenge #3: Using Version Control
Although the tide is shifting as more and more students of data science are exposed to services like GitHub for access to coding examples in their education, there are still a sizable number of budding data scientists who do not fully understand how or why we use version control systems like Git in the enterprise setting. For a more detailed treatment of Git for data science see here, but in short, Git is powerful when collaborating with other developers and presents a way for sharing code and making updates to code that are efficient and traceable.
Tip for dealing with version control: The quickest tip is to learn how to use GitHub. When learning GitHub, my biggest challenge was understanding the distinction between my local repository (the one that sits on your laptop or in your own personal development environment) and the remote repository that represents the latest and greatest code set for a particular solution. Any changes made locally are not represented in the remote repo unless we perform specific Git actions to update the remote (e.g. git commit). Obviously, there is a lot more to learn but the sooner you start to store your own code on GitHub the sooner you will start to internalize how Git works.

Challenge #4: Understanding How to Scale
Building on the limitations to compute that most learning data scientists face, comes with it the enterprise concern over how to scale data science solutions for production. Say for example we train a model on 200,000 customers, and that it can identify the likelihood any one customer is going to cancel a subscription. Now the enterprise wants to run your model on all their 12 million subscribers, every time the data refreshes every hour. As the collective sweat begins to bead on your forehead just reading that scenario, know that this is a significant enterprise problem that has several different possible solutions. Moreover, knowing the solutions available can help to greatly inform your development efforts and is why it is such an important challenge to consider early in your data science career.
Tips for dealing with scale: There are many different scalable data science frameworks and each fit slightly different use cases with some overlap. The way I think about scale is to consider whether your production solution needs to be transactional, where you score a single customer’s data as it is made available to the model in real time vs batch where you score millions of customers all at once (like the scenario I presented above in this section). Transactional data science products are useful for creating intelligent applications and because those applications have users, they need to be responsive to the user experience. This means that transactional data science products need to be lightweight and quick. Alternatively, batch data science products do not need to be as quick to produce results however we also don’t want them taking days on end to complete. This distinction is a slight oversimplification as there are many use cases that fall somewhere in between however the frameworks for each are different and can be combined in those instances.
To overcome this challenge, understand that transactional data science typically involves containers that can be scaled horizontally (running on lightweight machines that add more machines as the demands increase) using services like Kubernetes, Elastic Container Services in the cloud, or cloud functions. Batch data science requires frameworks that manage and orchestrate the splitting of large data sets across multiple cores (vertically) and multiple machines (horizontally). Frameworks for dealing with these very large data sets include Spark, Dask, and Ray. These latter frameworks are particularly good for data scientists to learn because they also allow for distributed model training and can be packaged in containers to further improve scalability for complex models that operate on transactional data.

Challenge #5: Communicating Data Science to Business Stakeholders
The final challenge I commonly see young data scientists face is in dealing with business stakeholders. While as data scientists we speak of probabilities, specificity, precision, and ROC curves, business stakeholders think in terms of key performance indicators (KPI), business rules, and financial impacts. In other words, there is a disconnect between the language of data science and the language of business stakeholders who use our products to inform their business decisions.
Tips for overcoming this challenge: The best way to overcome this challenge is to learn the KPIs of your enterprise. Make sure you are working towards aligning how your data science efforts relate to those business KPIs. Deliver results from your data science efforts in terms of business decisions and tell the business story of the consequences of that business decision. Simulate how using your product leads to changes in KPIs that are better when compared to simulations not using your product.
Bonus tip: If there is one final piece of advice I can offer, it is that metrics are motivating and moving metrics are even more motivating. Therefore, demonstrating how business metrics move in relation to your data science efforts through visualizations can ensure your business consumer stays focused on your value rather than trying not to look stupid because they have no idea what your model is doing.
Like engaging to learn more about data science? Join me here