Big Data Management: Data Repository Strategies and Data Warehouses

Published in

Towards Data Science

9 min readMay 7, 2021

Introduction

Managing huge amounts of structured and unstructured data is crucial to the success of every company that needs systematic organization and governance to ensure their data is of high quality and suitable for analytics and business intelligence applications. Although the key aspects of big data can be summarized to the popular 3 Vs of Volume, Velocity, and Variety, there are also other key questions that every company needs to ask when choosing the proper process they need to store and transform their data.

Big Data Aspects

Volume: How big is the incoming data stream and how much storage is needed?
Velocity: Refers to speed in which the data is generated and how quickly it needs to be accessed.
Variety: What format the data needs to be stored? Structured such as tables or Unstructured such as text, images, etc.
Value: What value is derived from storing all the data?
Veracity: How trustworthy the data source, type and its processing are?
Viscosity: How the data flows through the stream and what is the resistance and the processability?
Virality: Ability of the data to be distributed over the networks and its dispersion rate across the users

Big Data Management

Due to the exponential growth of enterprise data stores, managing big data has become an increasingly challenging task. Cross-industry research has shown that most organizations only use half of their structured and near one percent of their unstructured data in decision making analyses. Many organizations tend to keep as much data as possible since there is no way to predict which data sources will be valuable in the future. [1]

They often find outdated data or conflicts with other copies in other systems and with the availability of so many data sources, implementing an efficient data management technique can be a cumbersome task. Data spread can result in records being maintained across multiple locations and introduces the risk of duplication which leads to increased management costs and inconsistent security policies.

Not having a data management strategy can lead to lack of trust, missing on great opportunities, decreased customer satisfaction, as well as non-compliance and regulatory penalties. Due to these reasons, organizations tend to implement the disciplines of data management by investing in policies and big data tools that can help them with their needs.

Big data management can be considered as a broad term that includes data cleansing, integration, migration, preparation, enrichment, analytics, quality, management, reporting, governance, and planning. Depending on an enterprise’s needs, the focus and resource allocated on each of these processes can vastly differ.

Figure 1. From Data collection to Predictive Analytics

Data Repository Strategy

Data repositories can be a great solution to data management challenges by centralizing data in one system that refers to the metadata and a single logical namespace. They are used to keep a specific population of data isolated in a data storage entity or entities to mine data for business insights, reporting needs, or machine learning. This term is often used adjacent with a data warehouse or a data mart and its main benefit is to make reporting or analytics easier due to data being isolated.

An effective data repository strategy requires a coherent tactic to unify, regulate, evaluate, and deploy the huge amount of data resources. This will enable enhanced data management capabilities that will ultimately enhance the analytics and query performance.

The first step in defining a data repository strategy is to clarify the primary purpose of an organization’s data objective that will guide them in their data management approaches. A robust data strategy encompasses several elements. This includes creating a data architecture that covers the entire enterprise, defines business needs, and prioritizes data quality and integration. It additionally enables accountability by defining standards on data retention and reducing risk and complexity. [2]

Upon implementing the data strategy, companies are faced with multiple approaches and their decision is based on available resources or previous experiences of the organization. While some strategies help organizations ensure guidelines governing data privacy and the integrity of data distributed through the internal sources, other strategies might focus more on supporting business decisions by creating rapid frameworks that provide real-time quick insights, predictive modeling, and interactive dashboards. Whereas most companies require a balance of these two approaches and choose flexibility to succeed, some would put more emphasis on one with appropriate trade-offs.

Figure 2. Key Factors When Deciding a Data Strategy

Data Repositories

Enterprise Data Warehouse

An Enterprise Data Warehouse (EDW) can be summarized as a subject-oriented database or a collection of databases that gathers data from multiple sources and applications into a centralized source ready for analytics and reporting. It stores and manages all the historical business data of an enterprise.[3] Organizing, transforming, and aggregating various inputs of data sources can save valuable time and management costs for an Artificial Intelligence (AI) ready data structure.

This is where Extract, Transform, Load (ETL) or ELT approaches are often used, and big data distributed frameworks like Hadoop or Apache Spark help organizations with heavy data cleansing and transformation.

They key difference between data warehouses and standard operational databases is that the latter are optimized to preserve precise accuracy in an instance and keep track of rapid data updates while data warehouses provide broad range view of the data over time. Although data warehouses are a popular tool to manage big data, they can become expensive when an organization needs to scale them, and they do not perform well when handling unstructured or complex data formats.

The architectural complexity of EDWs offers many benefits to an organization:

Integrating multiple data sources in a single database for single queries
Maintaining data history, improving data quality, and keeping data consistency
Providing a central view for multiple source system across the enterprise
Restructuring data for fast performance on complex queries

Data Marts

While data warehouse (DW) can effectively deal with large data sets, real-time artificial intelligence, and data analysis for different subsets of business operations requires the usage of data marts (DM). DMs can be considered as scaled down versions of DWs with a more limited scope or a logical subset of them that aims to meet the information need of a specific group of end users in different business units or departments and usually provide aggregated data for a focused content or a customized decision support. They come in dependent and independent formats in which the former gets populated from a EDW and the latter is taken directly from an Operational Data Store (ODS).

DMs reduce the load of queries, transformations, and heavy network usage from other data sources in the organization and provides a customized DM available to end users, giving them more access and control. DMS can also introduce several inherent problems such as information siloing and limiting user access.

Data Lakes

Data Lakes (DL) are another type of data repository with a key difference that the data is stored in its raw native format without any transformation. The data can be structured or unstructured and this makes DLs fit for bulk data types such as server logs, clickstreams, social media, or sensor data.

The data is just stored in the repository without knowing what type of analysis will be done or whether it will be ever used in an analysis. This in return will require a lot of preprocessing when data needs to be used for business insights.

Figure 4. Simple Representation of Data Lake

DLs have lower storage costs due to their more open-source nature and undefined structure and can be established in an organization’s data center with in-house management or in cloud services of different vendors such as Amazon, Microsoft, or Google.

While DWs are targeted towards decision makers with transformations in place, DLs require specialized data scientists to preprocess and analyze the data and they can improve customer interactions, R&D innovations and increase operational efficiencies.

Transactional Stores

Transactional data stores (TS) are optimized for row-based operations such as reading and writing individual records while maintaining data integrity. However, they are not specifically built for analytics, yet due to their place in production environments for many years, they can be used for analytic queries as well as low latency information monitoring.

TSs are ACID (atomicity, consistency, isolation, durability) compliant, meaning they guarantee data validity despite errors and ensure that data does not become corrupt because of a failure of some sort. This is crucial to business use cases that require a high level of data integrity such as transactions happening in banking.

TSs are designed to run in production systems and due to their row-based low latency nature can run operations or queries that require to be nearly in sync with the master database. While DWs due to their column-based nature are optimized to read data, TSs perform better in writing. This might not be a huge problem for companies with small volumes of data but as the available data increases, this can create a difference in choosing the right data strategy.

Operational Data Stores

An operational data store (ODS) is another way to mitigate the challenge of querying up-to-date data from DWs and can be considered as a staging area that provides query capabilities. The ODS can provide fine-grained non-aggregated data that is closer to real-time as it is received before heavy transformations and loading operations which takes the burden off from transactional systems. They are used for operational reporting and as a complimentary element to EDWs.

Their general purpose is to integrate data from different sources into a single structure via data cleaning, resolving redundancies, and establishing business rules. ODS can be a key component of a EDW and due to their multi-purpose structure enables transactional and decision support processing. Data stored in ODS are transaction oriented and smaller in size compared to DWs.[5]

Conclusion

Big data management is a necessity for every company. It improves their customer understanding and innovation in developing new products while enabling big financial and business decision making due to the analysis of large amounts of data for every department. Establishing a data strategy requires problem definition and understanding the business needs of each company to improve their data systems and source management.

Although not all companies need to start worrying about big data management in the beginning, it will be a requirement to start considering when traditional databases are not performing well enough and not providing the benefits of big data repositories. This usually becomes apparent when every aspect of competitive advantage, innovation, revenue growth, and client acquisitions reach a plateau.

It is noteworthy to add that each data repository comes with its own disadvantages. Some companies use data lakes by storing all their data without effective use of information extraction for each department and this fails their business strategy initiative. Dumping data without any goals into a data warehouse will lead to high costs for management, losing track of what is stored and not taking advantage of the newly established resources.

In most cases a data strategy might not provide business value overnight and it is rather a gradual improvement that needs small steps in every stage through feedback and evaluation. A data repository does not guarantee the success of a company’s data strategy; however, it does reduce the likelihood of common failure scenarios, excessive costs and time used in extracting value from data and orients a company for future innovation.

References

What’s Your Data Strategy? The key is to balance offense and defense. by Leandro DalleMule and Thomas H. Davenport (May–June 2017)
Getting your data house in order, McKinsey and Company 2018
Enterprise Data Warehouse: Concepts, Architecture, and Components, 2019
“The growing importance of big data quality”. The Data Roundtable. Retrieved 1 June 2020.
Building the Operational Data Store (2nd ed.), Inmon, William 1999