Minimum viable domain knowledge in data science

Abhishek Mukherjee
Towards Data Science
9 min readApr 15, 2019

--

Image by Gerd Altmann from Pixabay

When I was finishing my PhD in physics almost a decade ago, there was a steady stream of physicists moving to finance, biology and other fields where quantitative skills were in high demand. Such moves were, in large part, motivated by the belief, tinged with a hint of arrogance, that the modeling expertise that one acquires while doing a doctorate in physics is transferable to other domains. Some of these moves were successful, some were not. To the best of my knowledge, the physicists who ended up making successful moves were the ones who had the humility to recognize that productive modeling depends quite crucially on the semantics of a field, and had the tenacity to learn the relevant domain knowledge.

In the years after I finished my PhD, the stream turned into a wave. There is now a mass migration from quantitative fields into various industry and research sectors. Sometime after finishing my PhD, I also joined this mass of domain immigrants — the data scientists. From my own experience and from those of my peers, I am quite sure that what was true for my physicist peers from a decade ago, is also true for data scientists. It is simply not possible to do useful data science without sufficient domain knowledge.

This is not a new realization. The most ancient archetype of a data scientist is from Drew Conway where he paints a data scientist as someone who is well versed in mathematics/statistics, programming (hacking skills) and domain knowledge (substantive expertise). However, while there are numerous books, articles, courses and self-help guides dedicated to the mathematical/statistical and programming aspects of data science, very little is written about the crucial domain aspect of data science. In this piece I will present my own opinionated view on this topic, informed by my experience and those of my peers.

Since “data scientist” was declared to be the sexiest job of the 21st century, data science has come to mean different things to different people. I think of data science as the act of building computational models for complex systems while leveraging moderate to large amounts of data. Domain knowledge provides the context for building these models. There are three main distinguishable, yet interrelated, aspects to domain knowledge: the problem context, the information context, and the data collection mechanism.

The problem context

One has very little chance of solving a problem if one does not understand what one needs to solve. But what does it really mean to understand the problem context in data science? Consider the example of a recommender system. No one wants to build a recommender system just for the sake of building one; perhaps a business wants to increase its revenue, and believes that building a recommender system will help towards that goal.

The first step is to specify and formalize the goal, e.g. increase revenue. The model that a data scientist would build would be used to calculate something. The data scientist needs to have the domain knowledge to clearly articulate the domain specific assumptions that can be used to relate the problem goal to the calculated quantity. In the recommender system example, the model might calculate the affinity that a user has towards a product. Using this calculation one can show a user the products towards which (s)he has the most affinity (as calculated by the model). The assumption here is that if a user is shown the products that they have the most affinity towards, then they have a higher likelihood of buying them, hence increasing the revenue for the business.

More often than not, it is not possible to directly measure whether a model is achieving its goal or not. Rather, one needs to make this judgement based on proxies — the evaluation metrics. A data scientist needs to be able to provide a reasoning (preferably quantitative and data driven) as to why the chosen evaluation metric is a suitable one. Typically, this boils down to a question of attribution. For example, one might use the click through rate as an evaluation metric for an online product recommender system. One then needs to have a plausible model for attributing a part of the revenue to the clicks generated by the recommendations. This would require the data scientist to understand how users browse webstores.

Given a model and a dataset one can calculate quantities only up to a certain level of accuracy. Typically, it is relatively easy to build models that have moderate to good accuracy, but large investments are required to increase the accuracy beyond that point. One the other hand, the level of accuracy that is actually required depends very heavily on the problem at hand. A data scientist should have a very clear picture of the value generated by incremental improvements in accuracy. In many cases the value vs accuracy graph would look like an S. There would be a threshold below which the model would generate almost zero or negative value, after which there would be a range where small increments in accuracy will result in proportional or exponential returns in value, and then there would be a threshold after which any further increase in accuracy will result in diminishing returns. Other profiles of the value vs accuracy graph are also possible. See this excellent piece that elaborates on this topic. It is only when a data scientist has a good understanding of the payoff, i.e. additional value generated, from increased accuracy can (s)he make any educated decision on what level of accuracy is good enough.

A data science problem is almost never an unconstrained optimization problem — there are always constraints. Some of those constraints will be technical in nature, such as an upper bound on the amount of computational resources available. Other constraints will be domain related, such as constraints related to fairness and privacy or constraints related to user experience. A data scientist needs to have the domain knowledge to understand the bounds set by these constraints, and ensure that the model stays within those bounds — otherwise there is little hope of the model ever seeing the light of day in production.

A data scientist would have understood the problem context if (s)he could

  • formalize the problem goal and relate it to the calculated quantities and the evaluation metrics of the model
  • draw at the very least a semi-quantitative version of the value vs accuracy graph
  • show that the model is consistent with the problem’s constraints

Information context

The problem context deals with the output of the model. The information context deals with its input. The hype surrounding big data and machine learning has given rise, in some quarters, to the idea that one can simply throw raw data at one end of an algorithm and generate meaningful insights at the other. That algorithm does not exist — deep learning or otherwise.

One domain where deep learning has been quite successful is image recognition (although one should be careful to not overstate the case given the susceptibility of deep neural networks to adversarial attacks). Nowadays we even have pre-trained models for image recognition that one can download and use with very little to no fine-tuning.

Let us consider for a moment the reasons behind this success. Images always consist of pixels. The pixels have the attributes of color and intensity, and they are always arranged on a regular grid. For the purposes of image recognition, we expect the answer to be unchanged under certain transformations such as translations, rotations and change of lighting. None of this pieces of information were obtained from math/stat or hacking. Rather they are part of the domain knowledge of images. The machine learning models that are successful in image recognition, such as those that use convolutional neural networks, are built on top of this domain knowledge.

In some sense, image recognition is an “easy” domain. Images are geometric objects, and hence it is easier to formalize the domain knowledge about images. Natural language processing has also seen some recent success in building universal models. Once again, although written text is not as structured as images, one can still identify words as the basic entities of a language, with a document being a ordered list of words — i.e. there is some natural structure to the data.

Unfortunately, the data in most domains do not come with such natural structure. Neither are they big or interesting enough for large multinational companies or research institutions to invest in building domain-specific algorithms. In these domains, it is up to the data scientists working in the trenches to determine the information context of the data.

One might think that the information context is nothing but the structure in the data. But it is much more than that. The training data for algorithms or inputs for models such as a set of vectors, or tensors, or an ordered list of tokens — they all have structure. But, by themselves they do not have any information context. The information context essentially consists of the recognizable concepts within a domain — entities, relationships and their attributes. For example, pixels (entities), their colors and intensities (attributes), and their relative positions in a grid (the relationships) form the information context within the image domain. A data scientist should be able to determine the relevant information context and assign it to the data.

One of the reasons for emphasizing the information context is interpretability. Remember, interpretability is always domain specific. If one does not understand the concepts of the domain, one has very little chance of being able to build interpretable models. But, it goes beyond model interpretability and into model improvability. Although, models work with data, modeling happens at the conceptual level. For a “data scientist” to live up to the second word in that title, she should be able to design sensible strategies for incrementally improving a model — beyond trial and error based (hyper)parameter tweaking. It is simply not possible to devise such strategies without a conceptual understanding of the domain.

A data scientist would have understood the information context if (s)he could

  • formalize the relevant concepts of the domain in terms of entities, relationships and their attributes
  • map the training data set for the algorithm, and the inputs/outputs of the resulting model to the the aforementioned conceptual formalism.

Data collection mechanism

Most, if not all, data-driven modeling methods make some assumptions about the representativeness of the available dataset with respect to the overall population that one is interested in. In practice, this assumption is rarely satisfied which limits the trust that one can place in the resulting model. The degree of non-representativeness of the dataset depends acutely on the data collection mechanism which is domain dependent. Elsewhere, I have discussed at length how this representativeness assumption can be violated in different scenarios. I will not repeat that discussion here. Suffice to say that a data scientist needs to have a clear understanding of the data collection mechanism to understand the robustness of the model’s outputs.

Understanding the data collection mechanism is also important for another reason. In general, there are two ways of improving model performance — (i) more powerful algorithms, (ii) better quality and/or more quantities of data. A data scientist needs to have enough of an understanding of the data collection mechanism to determine if there are any levers available to get better quality or more data.

Consider the scenario where we are asked to classify images that are all generated in a room (including the labelled data). Imagine we have built a moderately accurate model. Any further improvement based on more powerful algorithms will require a big investment in algorithmic development. However, suppose we could control the lighting in the room. In that case we could possibly tweak the lighting to get better quality images, and hence better model without the huge investment in building more powerful algorithms.

A data scientist would have understood the data collection mechanism if (s)he could

  • determine the limitations of the model resulting from the non-representativeness of the dataset
  • devise techniques to mitigate the non-representativeness problem, e.g. by having an exploration budget for showing random predictions
  • suggest tweaks to the data collection mechanism that might result in better model performance.

I will end this piece by reiterating what I said in the beginning. Modeling skills that one learns in quantitative fields are highly transferable to other domains. But, you can only build the model right, if you know what is the right model to build. And all the math or all the hacking in the world will not be enough to fix that. Taking the time to really understand what you are trying to model, will. Context is king!

--

--