Knowledge Data Science with Semantics Technologies.

An introduction to the (possible) future of data science.

Favio Vázquez
Towards Data Science

--

Illustration by Héizel Vázquez

Welcome to a new series on data science. Here I’ll start making an introduction to some concepts and definitions that will guide our study. To understand this article, I recommend that you read these other articles I’ve written in the past:

I will try to define a new beginning for our field that I’m calling:

Knowledge Data Science

I’m taking this idea from the field of Knowledge Engineering, but just the name, the definition will be a little different. There’s another similar approach to this that you can read here. Let me start by recalling my definition of data science:

Data Science is the resolution to Business / Organizations problems through mathematics, programming and the scientific method that involves the creation of hypotheses, experiments and tests through the analysis of data and the generation of predictive models. It is responsible for transforming these problems into well-posed questions that can also respond to the initial hypothesis in a creative way. It must also include the effective communication of the results obtained and how the solution adds value to the Business / Organization.

And here’s the addition we need to talk about Knowledge Data Science:

Knowledge Data Science is the resolution to Business / Organizations problems through mathematics, programming, scientific method and semantic technologies that involves the creation of hypotheses, experiments and tests through the analysis of data and the generation of predictive models inside of a knowledge representation system. It is responsible for transforming these problems into well-posed questions that can also respond to the initial hypothesis in a creative way. It must also include the effective communication of the results obtained and how the solution adds value to the Business / Organization.

So the main difference here is the addition of Semantic Technologies and the concept of a Knowledge Representation System. And in this article, I’ll describe those.

Some of you, while reading this may think: I’m just getting started in this new field, and now there’s another one? Or: Why do we need even more definitions about data science? Why not just focussing on solving business problems, and that’s it?

And I hear you. But let me tell you something. Some people in the field of data science need to talk about the theoretical parts of our domain; it’s the only way we can systemize its study and make it easy for others to get in it. And why am I adding a new part to it (the semantics)? Because it’s time, we start understanding the wonders it can bring to our field and the way we work. This may not seem evident in the beginning, but hopefully, after the end of this series, you’ll see it.

Semantic Technologies

Read this for more context.

Illustration by Héizel Vázquez

The word semantic itself implies meaning or understanding. As such, the semantic technologies we are going to discuss here are related to data concerning the meaning and not the structure of data.

When we are understanding, we are decoding the parts that form a complicated thing and transforming the raw data we got in the beginning to something useful and straightforward to see. We do this by modeling. And as you can imagine, we need such models to understand the meaning of data.

We usually do this by creating something called a Knowledge-Graph, which depends on us linking data. The key here is that instead of looking for possible answers, under this new model, we’re seeking an answer. We want the facts — where those facts come from is less important.

The data here can represent concepts, objects, things, people, and actually whatever you have in mind. The graph fills in the relationships, the connections between the concepts.

So we can say that semantic technologies are those using the concepts of semantics, ontology, linked-data, and knowledge-graphs to help us understand the meaning of the data we possess. There are great examples out there of such technologies, like:

And more. Some of them are complete solutions to the problem, but others are focused on transactional (OLTP) graph databases. While something like AnzoGraph by Cambridge Semantics is an analytical (OLAP) graph database. I spoke about the difference in an article a while ago:

And why all of this should be important to you? Because people don’t think in tables (like in traditional RDBMS), but they do immediately understand graphs. When you draw the structure of a knowledge graph on a whiteboard, it is obvious what it means to most people.

Also repeating what I wrote before, I’m tired of writing extremely long SQL queries, or even NoSQL queries to get simple data out of a database. And this may be the solution for that and much more.

Knowledge Representation Systems

Illustration by Héizel Vázquez

The way we represent things in our mathematical world is fundamental. Most of the theoretical advances in AI, and particularly on machine and deep learning, come from a better way to represent systems, data and find new and useful techniques to analyze it. Almost all the algorithms we have to do such tasks rely on Algebra, Calculus, and Statistics.

But in recent years, the shift towards representing information in the way of graphs has become more critical. You can find why in the other articles I listed at the beginning and the rest of the article. But the important part here is that we need a representation of our data that not only includes the data itself but where the interactions in it is a first-class citizen.

That’s what the knowledge representation systems give us. A way to represent our data and its relationships effortlessly. As I mentioned before:

Whereas relational databases store highly-structured data in tables with pre-determined columns and rows, graph databases can map multiple types of relational and complex data. Thus, graph databases are not rigid in their organization and structure, as relational databases are. All relationships are natively stored within the vertices of the edges, meaning that the vertices and edges can each have properties associated with them. This structure allows for a database that can depict complex relationships between unrelated data sets.

Us data scientists can find a lot of help structuring the data we have in our organizations in this way. We need to focus on solving problems, not in dealing with data. And as you may hear before, dealing and cleaning data is almost all we do, that has to end. We need to start thinking about building a data platform that allows us to solve problems, and that can deal with the data in better ways.

An excellent example of a knowledge representation system is a Data Fabric that I defined before as:

[…] the platform that supports all the data in the company. How it’s managed, described, combined and universally accessed. This platform is formed from an Enterprise Knowledge Graph to create a uniform and unified data environment.

And with tools like Anzo by Cambridge Semantics, you have automatic query generation (yep that’s a thing), and using them against the complex graph makes extracting features easy and eventually fully automated. Here’s an example of how it would look with Anzo:

https://www.cambridgesemantics.com/product/

The start of the new field requires us to know more about semantics, ontologies, the semantic web, knowledge-graphs, and all of that. I’ve written about that in the past, in the next part, I’ll go in-depth on those items and the connection to data science.

--

--

Data scientist, physicist and computer engineer. Love sharing ideas, thoughts and contributing to Open Source in Machine Learning and Deep Learning ;).