Big Data Ingestion

Introduction
Do you use navigation software to get from one place to another? Did you buy a book on Amazon? Did you watch "Stranger Things" on Netflix? Did you look for a funny video on YouTube?
If you answered yes to any of these questions, congratulations! You are a Big Data producer. In fact, even if you did not answer "yes" to any of my questions, you’re probably still contributing to big data – in today’s world, where each of us has at least one smartphone, laptop, or smartwatch, smart car system, robotic vacuum cleaner, and more, we produce a lot of data in daily activities that seem trivial to us.
What is Big Data ?
When we say Big Data, we usually mean particularly large data, fast and in different structures, so that it is difficult or impossible to analyze it with traditional tools.
It is common to define the concept of Big Data with the "three Vs":
- Volume – the size of the data
- Velocity – the speed of data gathering
- Variety – the different types of data
In order to solve the complex problems that the world of big data raises, the solution is divided into five layers:
- Data Sources
- Data Ingestion
- Data Storage
- Data Processing – preparation and training
- Serve

Data Ingestion
Data Ingestion is the first layer in the Big Data Architecture – this is the layer that is responsible for collecting data from various data sources—IoT devices, data lakes, databases, and SaaS applications—into a target data warehouse. This is a critical point in the process – because at this stage the size and complexity of the data can be understood, which will affect the architecture or every decision we make down the road.

Why Do we need Data ingestion layer?
- Availability – The data is available to all users: BI analysts, developers, sales and anyone else in the company can access the data.
- Uniformity – a quality data ingestion process can turn different types of data into a unified data that is easy to read and perform statistics and manipulations on.
- Save money and time – a data ingestion process saves engineers time in trying to collect the data they need and develop efficiently instead.
What are the challenges with data ingestion?
- Complexity – Writing data ingestion processes can be complex due to data velocity and variety, and some times so development times can be costly in time and resources.
- Data security – When transferring data from one place to another there is a risk of security to sensitive data.
- Unreliability – During the process, the reliability of the data may be compromised and thus cause the data to be of no value or in the worst case to make incorrect decisions based on untrue data.
Data ingestion types
There are three common ways to do data ingestion and the use will be according to the product needs – is it important to collect data in real time or can it be done once in a while in a timed manner?
Real Time data ingestion
This is the process to collect and process data from the various data sources in real time – also known as streaming, we will use this approach when the data collected is time-sensitive.
For example – the data coming from oil tanks’ sensors will be critical in cases of leakage.
In real time cases, the velocity of the data will be high, so the solution will contain a queue to avoid losing events. The data will be extracted, processed and saved at fast as we can.
Batching data ingestion
Batching data ingestion meaning that data moved from the data sources to the data target on scheduled intervals.
Batch ingestion is useful when companies need to collect data on a daily basis.
Conclusion
Although data ingestion is not a simple process in writing and can be costly in establishing its infrastructure and maintenance over time, a well-written data ingestion process can help a company make decisions and improve business processes.
In addition, this process makes it easier to work with a large number of information sources and allows easy access for engineers and analysts alike.