The world’s leading publication for data science, AI, and ML professionals.

It’s just semantics | How data speaks (not) for itself

Coherent insights require coherent data

How many active clients have you had over the last three months? At the end of this article if you answer "it depends", I have achieved my goal. Without context this question should be impossible to answer.

Photo by David Travis on Unsplash
Photo by David Travis on Unsplash

Yes, I am one of those "It depends" guys. Those irritating people who, when you ask them a question, answer back with another question. As illustrated in the comedy sketch ‘The Expert‘ (painful to watch as it really is quite true), the expert knows the question is unreasonable and tries his utmost to explain why he needs more context in order to "draw 7 perpendicular red lines with transparent ink".

If you want to answer my opening question on the number of active clients, you could provide an answer quickly. It will, however, be a wrong, or incomplete one, because you did not ask what my definition is of "client", "active" or which dates are important for the measurement of the three months in question.

What if, in this particular case, someone’s monthly subscription debit order ran effectively (he selected 30th) within the 3 month period that you have in mind. But because it was a public holiday it only actually went off on the 1st, which falls outside the three-month period you want to look at. What if he didn’t have enough money in his account on the 1st and it got rejected. Is he active? Is it within three months? He has also now cancelled his subscription, but only effective next month…

I am overcomplicating the example to prove the point … context matters. There is no way by just looking at raw un-conformed data that this will EVER make sense to an algorithm. Some interpretation is required as to which one of these pedantic questions are unnecessary.

Business & Technical validity

When I started my career 12 years ago the first lesson my boss taught me was the difference between technical effective and business effective dates. Since then I always thought this distinction was common knowledge until a certain trend appeared in my day-to-day dialogue. At first I thought it was just me, and then as more and more meetings popped up where effective dates were completely confused I started entertaining the idea that maybe it wasn’t me, maybe it was you.

I had to repeat this dialogue several times… for years in different contexts and sometimes to the same people. It was then I realised that common knowledge isn’t all that common (to borrow the phrase from Voltaire)

I feel silly needing to spell this out, but I am going to.

Technical effective dates are not the same as business effective dates. Data can be technically valid, but not business valid. Data can be within business validity dates, but not technically valid. Different sources and different systems deal with technical validity and business validity in completely different ways.

At the start of my career I worked on a module in SAP that handled this extremely well, something I still haven’t seen matched elsewhere. It is a massive overhead of fields, but there isn’t a shadow of a doubt what the data is saying. It is only when I started seeing the lack of this in other systems that it made sense as to why it is often so confusing. In short, each object had seven metadata fields describing this: Technical From, Technical To, Business From, Business To, Version, Invalid Object flag and Invalid Version flag. With these seven fields you knew exactly what was going on.

I will use an example to explain. There is a commission rule on a sales agent contract that says to apply 5% commission for the sale of a product. This rule was technically created (user typing it in front-end) on the 6th of July (technical validity). The rule however was supposed to come into effect from 1st till the 31st of July (business validity). Later it was deleted (technically deleted on the 15th) and a new rule stating 10% commission with the same effectivity was created. The seven metadata fields listed above were able to, quite easily, describe this situation.

diagram by author
diagram by author

I am not going to debate whether the seven fields is worth it in every situation, but having that much info on the data cancelled out any possible confusion. It made me understand what is needed to truly understand the context behind every data point.

It would benefit you (from a mental health perspective) if you could make peace with the fact that source systems will handle this differently. Even internally within one source system it could be handled differently.

I have also seen source systems starting with the "Keep It Simple Stupid" approach in the beginning and then systematically adding more technical dates bespoke to the tables. You gradually see this easy-to-read KISS model morphing into an incoherent WTF model. Some source systems have an active table with only the current records and keep their history tables in separate tables. Others have a mix.

So don’t try and fix the upstream systems, you will lose. We should rather accept that different sources will present this differently and reading the data as is, or trying to join it as is, is not in any way or necessarily a simple task.

Grain

The grain of data that is reported on, refers to the level of detail that is stored, and which associated objects are exposed in the set of data. Using an insurance example: Is the premium paid values stored at the level of client and benefit, or rolled up into only client and monthly view? (I.e. are you seeing the rolled up amount of premium for all your products, or are you seeing your life policy as well as your cat’s pet insurance separately)

Some of the common confusions when talking about grain are around products, sub-products and the groupings within them. The label and hierarchical structures that exist for products, especially in the insurance and investment industry, can make this complicated. However, it is more complicated as corporate companies start gathering or acquiring other companies with different sets of labeling and structuring products in their stable, with data sets having to be merged or housed together.

Semantics

You have to define the meaning of certain terms upfront. This sounds obvious and like common knowledge, right? Remember what I said earlier about common knowledge? This is often not done (and I include myself unashamedly). I think it is because people don’t want to feel or appear stupid by asking too many questions in order to establish this meaning.

Some time ago I was part of a team working on a Russian language project and I didn’t ask upfront what the different meanings were between territory, region, area and branch. (Perhaps because all the Russian versions of it sounded similar to a native English-speaker. (Прошу прощения, мои русские коллеги!). For a long time I kind of assumed they were the same thing but just synonyms. In fact, it wasn’t. Turns out it was quite important in many of the business processes to understand the difference.

So, simple rule… don’t feel stupid. Ask often and early what the meaning of certain business terms is and have a central glossary. People will get over any initial assumptions that your questions are stupid (probably in like 10 seconds), OR they will feel less stupid because they also didn’t know.

Really, it is a win-win situation. I guarantee you there is another ‘stupid’ person in the room, immensely grateful that you asked the question he didn’t have the guts to put forth. (now of course don’t be silly and ask this question whilst in Russia, as it may be the last question you’ll ever ask)

Here is a list of other groups of terms I see where the distinction is often not clarified:

  • Policy, plan, agreement, benefit, product
  • Party, client, customer, lead, prospect
  • Intermediary, broker, consultant, sales agent
  • Security, Instrument, investment vehicle, investment portfolio

These are all examples where you think you know what it means, but I guarantee you another organization or business unit uses the same terms in completely different contexts. Even people in your own teams will have different opinions on what it means.

Again, same as with validity dates, different sources will label things differently. Different sources will also handle the hierarchies differently, or the groupings. Therefore it is key to identify the business entities that need to be modeled and make sure there is agreement from the target consumers what they mean.

This is where I am a big proponent of being able to work in an agile and fail-fast environment. The thing is, even though I said you should get agreement from the business, they won’t know until you show them the wrong thing.

So, perhaps I can change my advice slightly: Show them what you think it means upfront and let them tell you how wrong it is (and they will).


I started by saying context matters. The raw data needs to be presented in a context that makes sense to the consumer or consuming algorithm. The training data that algorithms get will assume the associations are correct. Someone still needs to decide how raw data from different sources needs to be joined so they register at the same semantic meaning, are at the same grain and are associated with the same time validity.

To the trained eye it will sound a lot like I am advocating for what has historically been called a data warehouse. In the data lake era it has gone through many labels lately, "curated" or "consumer" zone as well as the emergence of the data lakehouse. I am not going to elaborate on that for the moment as I am trying to reach both the trained and untrained eyes and more context would be required to tackle the topic holistically.

Organizations are collecting more and more data, either through acquisition of other companies (hence no to little control over how upstream sources structure data) or exploring different methods to collect data from their clients. This inevitably means that the BI environment often gets handed a hospital pass and instructed to make sense of the madness – to "create insights based on all of this rich information".

Some applications might not be as time dependent as others, or rather, the consequences of incorrect time integrity are not severe. If, however, you get the grain and associations wrong in an actuarial application, for example you might produce a beautiful predictive model of where the next claims will come from. It will, however, be the ornamental plant of algorithms: Pretty, but pretty useless.

We should be careful not to get lost in the hype of all these insights we are trying to create and remember that only coherent data can provide coherent insights.

So, let’s try this again: How many active clients did you have in the last three months?


Written By

Topics:

Related Articles