Intro
Time Series use cases in general, and IoT domain in particular, are growing so fast, so it’s vital to select the right storage for each particular use case.
Nowadays, every other database engine or platform is marketed as the Time Series oriented, so let’s try to go deeper and find out which one suits the best each particular need.
Problem Statement
In order to formalize the engine selection let’s define clearly the inputs and criteria of success. As an input, let’s consider a telemetry dataset.
As criteria of success we have:
- Coverage of functional requirements related to data querying/analytics with different level of advancement
- Coverage of non-functional requirements
Let’s define them more specifically.
Telemetry Dataset
As a continuation of the series of articles about IoT Data Analytics let’s use the Fitness Tracker use case which represents well a typical IoT use case. A dataset (as it is also described [here](https://towardsdatascience.com/iot-data-analytics-part-2-data-model-3da9676cb449) and here) consists of a set of observation, and each observation contains:
- A metric name generating by a sensor/edge, i.e.: heart rate, elevation, steps
- A metric value generated by the sensor bound to the point in time, i.e.: (2020–11–12 17:14:07, 71bpm), (2020–11–12 17:14:32, 93bpm), etc
- Tags or Context description in which a given sensor is generating data, i.e.: device model, geography location, user, activity type, etc.
Functional Requirements
There are so many different possibilities around data analytics, so let’s split them into the level of advancement into the basic, middle, and advanced.
Basic Level: Simple Data Retrieval
- Random data access: for the particular point in time return the proper metric value
- Small range scans: for the particular time range (reasonably small, within minutes or hours depending on the frequency of data generation) return the sequential metric values (i.e.: to draw a standard chart on it)
Middle Level: Time Window Normalization
The measurement events usually supposed to be triggered on a predefined recurrence basis, but there are always deviations in data points timing. That is why it is highly desirable to have capabilities around building predefined time windows to normalize the time series data.
The required capabilities are:
- Building the time buckets to normalize data properly
- Aggregations on normalized time buckets
- Gap-filling, using interpolation to construct new data points within the range of a discrete set of known data points.
To the mid-level capacities it is worth to add more sophisticated diagnostic analytics/ad-hoc queries:
- Flexible Filtering: filter data points based on predicate on tags/context attributes, i.e.: filtering data points by some region, user, or activity type
- Flexible Aggregations: grouping and aggregations on tags/context attributes or their combinations, i.e.: max hearth rate by region by activity type.
Advance Level: Sequential Row Pattern Matching
The most advanced level would include checking if the sequence of events matches the particular pattern to perform introspection and advanced diagnosis:
- Did similar patterns of measurements precede specific events?;
- What measurements might indicate the cause of some event, such as a failure?
There are very few databases that support such a feature as of now, but I believe they will come.
Here we could differentiate the following capacities:
- Finding a series of consecutive events, i.e.: session definition
- Pattern matching: trend reversal, periodic events
Non Functional Requirements
Besides the functional requirements, it’s really crucial to consider non-functional requirements which often are the main drivers for the selection:
- Scalable storage: ability to handle big data volumes
- Scalable writes: the ability to handle a big amount of simultaneous writes. This is closely related to the real-time data access – the ability to have the minimum possible lag between when the data point is generated and when it’s available for reading.
- Scalable reads: the ability to handle a big amount of simultaneous reads
- High Maturity: presence on the market and community support.
Time Series Databases and Platforms on the Market
Let’s review what we have on the market to satisfy our exacting needs. There are a variety of options. It’s really hard to cover all of them, so I’ll try to describe families of engines.
NoSQL with Built-In Sorting
BigTable, HBase, Cassandra, DynamoDB, Accumulo are often used to store time-series data. There are tons of articles on how to implement time series use cases on these storages; how to avoid hot-spotting using salting etc.
Strong Sides: Extremely well scaled for writes. Performing the basic level of analytics extremely efficient.
Weak Sides: All other kinds of analytics are not supported and not efficient
NoSQL Purpose-Built Time Series DB
There are engines that have been designed from the ground up as Time Series databases. In the majority of cases, they are NoSQL. The most prominent example is InfluxDB which is on the top of all Google searches.
MPP SQL Engines on Time-Series Steroids
Some of the mature MPP analytics engines, like Vertica constantly adding new analytics capabilities including the ones related to time-series data handling. The weak point is though, by design they are not created for efficient streaming data ingesting. But if a micro-batching is acceptable, they could be the best match for many use cases.
Strong Sides: Provide the reachest analytics capabilities.
Weak Sides: Real time ingestion might be challenging and not really efficient
_Note: Oracle is not classical MPP, but from the functional perspective supports the MATCHRECOGNIZE clause which covers a sophisticated pattern matching feature
NewSQL In-Memory Databases
The in-memory nature of SQL databases increases their ability to handle fast data ingestion. SQL interface enriched by the time buckets normalization support, like in SingleStore MemSQL database looks very attractive for the time series use cases
Strong Sides: Provide the reach analytics capabilities.
Weak Sides: The scalability for writes and reads are usually limited or is very expensive
Cloud Time-Series Platforms
Azure and AWS released recently their time series data services/platforms:
- Azure Time Series Insights
- Amazon Timestream
The platforms cover many aspects of the time series data storing, visualizing, and really reach capabilities in querying. They have built-in separation of data between hot, warm, and cold storage to make the data storing and retrieval well balances from the cost of ownership perspective.
Strong Sides: Well integrated into the proper Cloud infrastructure, providing the reach querying capabilities as well as
Weak Sides: The close cloud integration for some use cases might be a limitation; the presence on the market is still too short to treat them as mature ones.
Others
There are also other options and niche players. I had a chance to work with one such example called GeoMesa platform/framework. It is a framework on top of NoSQL storages like Accumulo/Hbase/Cassandra with the ability to build special indexes for temporal and geo-temporal data; it also gives extra flexibility via secondary indexes.
Strong Sides: Efficient basic time-series analytics as well as good level of scaling. Geo-Temporal querying support is a key feature and a big bonus for IoT data.
Weak Sides: Despite the very friendly and responsive dev community it has low level of adoption and rather low level of maturity.
Summary Comparison
Please find the high-level Comparison of different time series storages using the criteria we defined at the beginning below:

Where

Conclusion
The topic is really broad and deserves a book to be covered with a good level of detail, but I hope this article will help you to navigate in the Time Series engines world and help to select the right direction for your particular use case.