The world’s leading publication for data science, AI, and ML professionals.

Fuzzy data structures – curse or blessing?

Challenges and Solutions from an industrial application point of view


Photo by Joshua Sortino on Unsplash
Photo by Joshua Sortino on Unsplash

Introduction

Some time ago, I found an interesting article[1], which compared SQL and NoSQL data processing via Python and pointed out some drawbacks of SQL. One of the prominently discussed weaknesses of SQL was the lack of storing complex object structures (e.g. Python dictionaries) in a simple attribute, which violates the first normal form (1NF)[2]. In this context, the author briefly discussed the case of fuzzy data formats, which are technically possible with MongoDB. Though this issue was not further spotted in that article, I was wondering if this feature of MongoDB is really a benefit or rather a weakness, especially in the context of industrial applications where I am professionally located. As I could not find an article that pervasively addressed this question, I wanted to analyze and discuss this issue and share my findings!

After this short introduction we will briefly discuss characteristics of different database technologies. Then, we will assess challenges when working with fuzzy data structures based on two simple collections in MongoDB. Next, we will transfer the findings from the simplified examples to a real-world industrial application in the semiconductor industry. Finally, we will have a closer look on tools and methods to overcome these challenges.

Technologies

Firstly, the principles of data storage with MongoDB must be understood. MongoDB is typically classified as a NoSQL database concept[3]. However, NoSQL acts as an umbrella term for multiple technologies which do not only differ from RDBMS but do also differ significantly between each other. Well known types of NoSQL database technologies are key value stores (e.g., Amazon Dynamo), wide column stores (e.g., Cassandra), graph databases (e.g. Nio4j) and document stores, where MongoDB belongs to. A big advantage of document-oriented databases against SQL lies in the fact that data structures can be persisted and queried in the same format as in the software code that works with this data. The data format follows a JSON-like format which is semi-structured and hierarchical. According to AWS, it fits to the requirements of catalogues, user profiles and content management systems where each document can have a unique format. Document-oriented databases provide a flexible indexation, high-performance ad-hoc queries and analyses over collections[4]. A performance comparison between MongoDB and MySQL came to the result that the write performance of MongoDB is significantly higher, especially for larger datasets, whereas SQL performs faster queries[5]. Hence, it can be implied that the application of document stores is useful for scenarios that require frequent data inserts or updates, and/or deal with larger data. Such use cases exist also in industrial environments: for instance, collecting telemetry data from production equipment requires highly frequent inserts and, dependent on the machine complexity, may consist of multiple variables that refer to particular sensors.

Challenges

In order to assess the challenges, a sample MongoDB database is created. The first scenario uses a collection that consists of three documents as shown in Figure 1:

Figure 1: Sample Collection with Documents
Figure 1: Sample Collection with Documents

The documents refer to persons and store some of their attributes which are name, age, city and hobbies; the "_id" attribute is the default identifier of a document. At the first glance, it looks like a straightforward data structure. But let’s focus on the attribute "hobbies". As marked in the figure, we see that this attribute exists for each document. However, the data type of this attribute is obviously different between the documents: (1) an array, (2) an object and (3) a string. This type of fuzziness is really a difference to relational databases, where the data type of a table attribute is strictly defined for all records.

So, we see that we can create flexible – or more negatively spoken – indeterminable data structures. Let’s focus directly on an associated major challenge: MongoDB does not provide a search feature that queries as flexible as the data structures can be constructed. What does this mean in practice? Let’s have a closer look on the "hobbies" attribute values of the three documents.

Figure 2: More detailed view on the hobbies
Figure 2: More detailed view on the hobbies

If we wish to find all persons who have "soccer" as a hobby, we as human beings can simply see that there are two relevant persons in our collection. However, imaging we have thousands or millions of such documents like on social media platforms, we would rather apply a database query that searches for something like "all persons who have soccer as a hobby". However, this is technically not possible with this type of fuzzy data structure as MongoDB requires a unique path where it applies the filter. Though wildcard search is possible and applies also to nested fields in an object, this search tool cannot deal with different data structures[6]. This means practically that we could perform a search on fields that contain the phrase "hobbies" at any level of a document’s hierarchy – but the search does not consider our nested array or object since the inner field name differs from "hobbies".

In addition, we could argue that it is also semantically not useful to search like that, because it goes into the direction of searching the famous "needle in a haystack". So, we can state that MongoDB provides extensive flexibility when we want to store data that is similar in content but different in structure, but as soon as we want to query or analyze this data, we are limited with this fuzzy data structure. Of course, we could put the query logic to an external code, for instance, in Python, and build a function that successively searches for hobbies as string, hobbies in an array and hobbies in an object, and then returns the merged result. But this is a static solution and as long as we do not "close the door" for other structural variants, we cannot ensure that the query would not miss relevant results.

The second scenario which refers to fuzzy data structures is related to a collection that contains documents with different attributes. In our example is a set of persons that are described by different information. We now struggle with the situation that Adam is described by his hair color, whereas Brian and Eve are described by their eye color. Even worse, the attribute "eye color" uses a different spelling in the two documents.

Figure 3: Collection with documents that have different attributes
Figure 3: Collection with documents that have different attributes

While we are now able to filter consistently by hobbies, we cannot search straightforward for persons who have a particular hair or eye color. Again, our queries would potentially ignore documents that would semantically fit to the filter criteria. In addition, our analysis capabilities are limited as we cannot ensure that we have the same information available for all documents. Sure, this issue can also exist in SQL databases if fields in a table are not configured to be mandatory, but at least we can rely on a fixed set of attributes. However, with MongoDB, the variety of used attributes between documents in the same collection is potentially endless if continuously new documents are inserted. Hence, it is even not determinable for a data analyst which attributes are available to work with.

Findings transfer to industrial application

We discussed several challenges in the previous section based on very simple examples. But are these theoretical constructs or can these be real issues in industrial applications? To answer this question, let’s have a closer look on the previously mentioned use case where telemetry data is collected from machines. Figure 4 shows three machines E1 to E3 that belong to two different types T1 and T2. In addition, we see two types of interface definitions ID1 and ID2. Via those interfaces, the equipment telemetriy data is transferred to a central CIM database.

Figure 4: Schematic of an equipment data integration
Figure 4: Schematic of an equipment data integration

In the semiconductor industry, we typically work with an interface standard called "SECS/GEM". An event message based on this standard has an object-like character similar to JSON. However, a benefit of SECS/GEM against JSON or XML is the reduced data overhead which results in less network traffic and faster data processing[7]. Anyway, a raw event message is not compliant to a relational table as it contains non-atomic information that violates 1NF. So, storing an equipment event to a SQL database always requires an ETL process. However, an event message could be directly stored to MongoDB as a document. As the structure of a SECS/GEM message body is defined by the standard, the risk discussed for scenario 1 is low in this application – as long as all equipment interfaces follow this standard (e.g. E3 and E1). But also semiconductor facilities may suffer from legacy equipment that does originally not support any data integration. Upgrades to SECS/GEM may require the maker’s support and tend to higher costs. However, there are cost-efficient alternatives to enable data integration for legacy equipment such as smart sensors, retrofit kits and edge gateways[8]. Having different interface definitions, we have the potential risk of different data structures as far as the SECS/GEM message format is not followed or the data is not transformed before storage. So, from industrial application perspective, the challenges mentioned in scenario 1 are theoretically possible in our described use case but can be prevented if experienced engineers foresee them. The ability to store same information in different formats is not seen as a benefit in this practical use case.

Let’s move on to scenario 2: In the semiconductor industry, we typically work with different machine types provided by different equipment makers (see E1 and E2 vs. E3). Most of these machines have internal software and built-in sensors. Most likely, the available sensors but also the internal technical names of similar sensors differ between machines and makers, even though they refer to the same type of information, e.g. pressure or temperature. Without data harmonization between the equipment interfaces, this situation would lead to the problem statement sketched in scenario 2: a) we cannot consistently filter by or analyze sensor values within the machine park if single machines do not provide that information and b) we have to deal with sensor names that refer to the same information but with different spelling. So, we could conclude that these risks are valid for our industrial use case. However, we can also imply that the capabilities of MongoDB fit to the requirement of storing different message contents and flexibly adding data of newly implemented sensors.

How to address these challenges

Fortunately, there are multiple techniques to overcome the previously discussed challenges. There are probably more approaches and I do not claim that the following list is complete, but I want to provide an overview on the bandwidth of techniques from reactive consistency validation until real preventive measures.

1) Check Field Existence

As a data analyst, you may want to check on particular fields if they exist in a collection as you expect. This prevents you from invalid filters or aggregations. MongoDB provides the "$exists" operator which can be used in a query to return only documents where the specific field is contained or where it is missing[9]. Getting results for both filters indicates that the field must be enjoyed with care.

2) Retrieve all Fields from a Collection

If you want to apply a systematic or even regular comparison to identify missing fields, checking the field existence as previously discussed is probably not the right approach. You rather want to extract a list of fields that exist in your collection at all. This can be achieved by using the "map reduce" approach. The following Python code shows an sample implementation using the pymongo API.

import pymongo
from bson.code import Code

conn = "mongodb://localhost:27017"
client = pymongo.MongoClient(conn)

db = client.mytestDB

map = Code("function () {"
            "  for (var k in this) { emit(k, null); }"
            "}")

reduce = Code("function(k, s) { return null; }")

result = db.diffAttributes.map_reduce(map, reduce, "r")
for d in result.find():
    for k, v in d.items():
        if k == "_id":
            print(v)

By executing this code, we receive the following printed list of unique fields from the assessed collection, which is called "diffAttributes" in this example. Of course, we could instead transfer and persist the list into a document or database or directly implement further checks within the Python code.

name
_id
eye colour
eye color
hobbies
hair color

3) Check the Schema

If you use MongoDB Compass for database management, you can simply analyze your database schema considering all documents in a collection. With the result as shown in Figure 5 you can easily assess the shaping of the fields in a collection. For instance, we see that there are three data types used for the field "hobbies".

Figure 5: Result of a Schema analysis with MongoDB Compass
Figure 5: Result of a Schema analysis with MongoDB Compass

4) Force a schema standard

From our discussion in this article, one could have the impression that MongoDB builds up on anarchy. This is surly not the case. Though there are multiple ways to store complex data structures in flexible ways, we can reduce this flexibility to make our collection contents determinable. Just like with SQL databases you can define validation rules that state, for instance, which fields are mandatory, which data type must be adhered to or which values are allowed[10]. For instance, as shown in Figure 6, we can state that our "hobbies" attribute only allows string values. By executing this validation rule, MongoDB Compass shows which document is valid or not. In addition, we can configure that a newly inserted document that violates this rule should be declined. Hence, we can keep our collection consistent from scratch. The only limitation is that this feature is not available if MongoDB Compass is connected to a data lake.

Figure 6: Validation Rules
Figure 6: Validation Rules

5) Provide a User Interface (UI)

If the data you want to store in MongoDB comes directly from user entries, you could reduce the data structure fuzziness by building a mature UI. This is similar to the approach of having a standardized equipment interface where the data is generated automatically. You may configure dropdown boxes to force allowed values, make textboxes mandatory or even provide workflows to collect data in a certain order. Hence, we ensure that only data is submitted to the database that adhere to allowed data formats.

6) Apply Data Governance (DG)

It sounds trivial: but knowing if your data is consistent or not requires first of all that rules are defined and aligned between the stakeholders with regards to a specific data object. If there has never been a decision if the "hobbies" should be an array or which spelling must be followed for "eye color", you cannot technically resolve that issue in your collection. Here comes DG into place. By having clear responsibilities and decision processes for data, you gain multiple benefits such as increased speed of IT project implementation, improved Data Quality and increased value of the data[11]. If all IT implementation activities follow the decisions from DG, we ensure data consistency by design.

Conclusions

In this article, we assessed the challenges and risks which may occur when allowing fuzzy data structures in a document store. We analyzed and evaluated them based on two simplified scenarios implemented in Mongodb and transferred the findings to a selected industrial use case. We implied that there are situations in industrial applications where the flexibility of MongoDB could be useful but which drawbacks might exist. Finally, we discussed techniques to overcome these challenges where we saw rather reactive approaches (e.g. field existence check) as well as preventive measures (e.g. data governance).

References

[1] M. Sosna, A Hands-On Demo of SQL vs. NoSQL Databases in Python -Impress your friends with SQLAlchemy and PyMongo (2021), https://towardsdatascience.com/a-hands-on-demo-of-sql-vs-nosql-databases-in-python-eeb955bba4aa

[2] Wikipedia, First normal form (n.d.), https://en.wikipedia.org/wiki/First_normal_form

[3] L. Schaefer, What is NoSQL? (n.d.), https://www.mongodb.com/nosql-explained

[4] Amazon, Was ist eine Dokumentdatenbank? (n.d.), https://aws.amazon.com/de/nosql/document/

[5] M. Shah, MongoDB vs MySQL: A Comparative Study on Databases (2017), https://www.simform.com/mongodb-vs-mysql-databases/#:~:text=MongoDB%20vs%20MySQL%3A%20Performance%20%26%20Speed,is%20more%20sensitive%20to%20workload

[6] MongoDB, Path Construction (n.d.), https://docs.atlas.mongodb.com/reference/atlas-search/path-construction/#std-label-ref-path

[7] B. Grey, SECS/GEM series: Protocol Layer (2018), https://www.cimetrix.com/blog/secs-gem-series-protocol-layer

[8] I. Lamont, 4 ways to bring legacy manufacturing equipment into the IoT age (2019), https://www.hpe.com/us/en/insights/articles/4-ways-to-bring-legacy-manufacturing-equipment-into-the-iot-age-1903.html

[9] MongoDB, Query for Null or Missing Fields (n.d.), https://docs.mongodb.com/manual/tutorial/query-for-null-fields/

[10]MongoDB, Set Validation Rules for Your Schema (n.d.), https://docs.mongodb.com/compass/current/validation/

[11]CDQ, Data Governance (n.d.), https://www.cc-cdq.ch/data-governance


Related Articles