There are important use cases, requirements and architecture choices that alter the way a data lake is built. Here we will focus on the benefits of data consumption from the data lake directly, while addressing near real time AI inference requirements.
We will succinctly cover some of the architecture choices and their implications, while looking at data engineering using Spark over S3.
Target audience
The reader should have some familiarity with Apache Spark and with AWS.
AI inference latency requirement
We will concentrate on the use case where time-to-inference latency requirement is in minutes/high seconds. In this case you can opt for Apache Spark based architecture to serve as a common basis for your streaming & batch AI processing. You can also share your code between streaming and batch processing, while increasing downstream read performance. If this is not your use case, you can still implement most of the following for the Data Lake portion of your solution.
Architecture

Note that you don’t have to have all the options in place. For instance, one may opt for dropping Kinesis/Kafka altogether while relying solely on Firehose. As we are focusing on AI based use cases, business-level aggregates data lake layer is out of scope for this discussion. In addition, we will be looking at data that can be structured.
Visual Agenda

Basics
Do you really have Big Data and do you need a data lake?
This article deals with Big Data. Before going there, please decide whether you have/expected to have Big Data. If not, don’t go there.
300GBs is not "Big Data".
General data lake structure
Data should be partitioned to a decent number of partitions. Data is kept in big files, usually ~128MB-1GB size. It is generally too costly to maintain secondary indexes over big data.
In addition, common solutions integrate Hive Metastore (i.e., AWS Glue Catalog) for EDA/BI purposes. Data is usually ingested from streaming sources such as Kinesis/Kafka or from a S3 landing zone, which in itself is usually the target of Kinesis Firehose.
How to keep track of input data that was already processed
Spark Structured Streaming already contains .option("checkpointLocation", checkpoint_path)
. It keeps track of all the offsets, regardless of the your input: Kafka/MSK, landing zone in S3, Kinesis, etc.
This is the basic option, please read about more advanced ones below.
Why should I use partitions?
For most cases the data lake consumer, ala you AI application, will only need to read a subset of the data. When the data is partitioned, your query tools such as Spark can utilize partition elimination so only a part of the data will need to be scanned.
Note that this works well for low cardinality partition keys ala data arrival month. For other use cases one has to resort either to some kind of hashing/bin-packing heuristics on the original partition key or utilize solutions such as Z-ordering.
How to define the number of files in partition
As the data lake consumer reads the data, all the files in the partition will be scanned, at least to read metadata. The number of files in the partition should be a small multiple of the number of active Spark tasks scanning the partition. I.e., If your partition has 30 files but your number-of-executors times number-of-cores-per-executor is 130, in general case there are 100 cores-per-executor doing nothing, and you are paying for it, both in cash and in query performance.
The same is true for other provisioned services, such as Snowflake cluster size or Redshift slots.
How to define the data format
As all the files will be scanned for the metadata at the very least, it’s best to utilize S3-Select that supports compressed Parquet and some non-columnar formats. Thus you want you data lake to have compressed Parquet files, which are supported by S3.
We will cover recommended data lake frameworks that utilize Parquet later in this article.
Implications of near real-time AI inference
Data ingestion in micro-batches
It’s important to stream the data into the data lake with the right Spark .trigger()
for your use case, if near real-time AI inference is required.
This simple requirement may have substantial consequences, described below.
Micro-batch ingestion impact – Data compaction
Data ingestion in micro-batches might result in many smaller files. In order to provide predictable query times per partition we ought to:
- Create partition with similar data sizes
- Make sure the data in the small files is moved into bigger ~128MB files. This process is called compaction.
The compaction process improves downstream read throughput, especially over object storage, i.e. S3.
Note that some data lakes don’t implement compaction at all, as it adds complexity and may introduce additional latency into the system. In this case one has to be extra careful about their file sizes and the ETL flows.

Data compaction implementation Strategies
- Rely on commercial implementations such as Databricks
- Use
.option("maxRecordsPerFile", ...)
for files with predictable record size. Note that for nested data only top level rows are taken into account. - Assess total partition size using
aws s3 ls
AWS API counterpart.repartition*()
/partitionBy()
with the required files number to set the correct number for Spark & Hive partitions. - Do the above only for partitions where files have been recently changed. Compacting all partitions each and every time takes too much time and overhead – see details below.
Manifest files
When data is being compacted, new Parquet files appear in the partitions. Which files should the consuming Spark/Redshift/Snowflake query read? We absolutely have to add a metadata. There are two ways of going about it:
- Keep a partitioned manifest file which contents list the up-to-date files in each partition
- Use a framework that automatically keep these contents up to date. More on this later.
Otherwise our jobs will be reading stale data or both stale & up-to-date data together!

ACID frameworks
ACID frameworks enable the following:
- Automatic data compaction (for some only in commercial versions)
- Improve query times via additional metadata.
- Auto generation of the partitioned manifest files
- ACID guarantees for the data lake consumers (some restriction apply, see below)
Options include:
- Delta Lake
- Commercial Databricks version – has caching and Z-order performance improvements that are unavailable in the open source version
- Apache Hudi – two modes of operation
- Apache Iceberg – circa end of 2020 Iceberg did not support streaming from the curated data.
Note that some features, such as Delta Catalog, require Spark 3.0.0+ and thus are only usable in EMR and not in Glue. AWS Glue does not support spark 3.0.0+ at time time of this writing.
S3 limitations on the number of Spark drivers
Due to the fact that S3 does not support atomic renames has deep implications on data ingestion for
- Delta Lake
- Hudi
- Iceberg We in effect can only be using a single Spark driver writing to the data lake at any given time.
This also means that using these frameworks outside of Spark is only ACID reliable for read-only access.
Single Spark driver implications
Corollary of the S3 limitation on the number of Spark drivers: Consuming from several different sources, such as Kinesis & S3 landing zone in parallel must happen with a single Spark driver! This is true even if you are not joining the data between these Spark streams. After all, they are updating the same metadata locations on S3.
Ways of accomplishing this would be either:
- Using Apache Livy with a single session (have not tried myself) or
- Running EMR streaming steps with
.trigger(once=True)
.union()
different Spark streams and write them at once
Data compaction impact – File vacuuming
The compaction process and generally any data updates create obsolete files. They will have to be removed in order not to take up space/have S3 listing overhead. This process needs to be scheduled and invoked, unless you are using Databricks commercial Delta lake.
S3 Disaster Recovery
It’s best to keep the resulting S3 bucket versioned, as it will allow for cross-region bucket replication for disaster recovery purposes.
S3 cleanup
An S3 Lifecycle rule for the older versions of the files must be setup to cleanup the object store. Vacuuming is not enough, it will only result in the S3 delete marker.
Schema
Schema merging
Data schemas evolve in real life schema. In order to support introduction of new columns to the schema, utilize .option("mergeSchema","true")
in spark.
Event sourcing
Say your data can be updated retroactively. What is the right solution?
- Update the data in place. Hold a single replica of the same data.
- Introduce another version for the data.
Most AI based applications issue inferences that are being acted upon. Stocks are bought, emails are sent to the customer, etc. It’s a business related criteria that guides towards utilizing event sourcing, so that AI inferences can be traced down to the data lineage available at the time of inference.
JDBC – impact on schema design for nested data
Sometimes we want to enable BI over our fine grained data, i.e., not over the business-level aggregates.
At times real world data is nested. Please note that many JDBC tools such as Redshift Spectrum won’t allow returning nested data in a straightforward fashion. In this case it’s more performant to use .explode()
prior to persisting the data in data lake. Depending on your data, it can be the only option.
S3 access – additional options
Fast new file listing – s3-SQS
Even when bigger files are used, spark structured streaming will have to list all the files, which can be a lot in a big data lake. Spark’s S3-SQS connector can be used to improve this by keeping a queue listening on new object notifications from S3.
S3 access from within a VPC
In order to access the data lake over AWS network and not over the internet, use S3 VPC endpoints. This is mainly beneficial as a security feature. At times VPC endpoints can improve performance, alas this has to tested per region/use case.
Landing Zone considerations for initial ETL
If you were using S3 landing zone with millions of small files, they need to be imported into the curated data lake so that all the historical data is suitable for querying. S3 file listing has significant overhead. Listing one million files takes more that 5 minutes. Moreover, this should not be done with Spark Structured Streaming as the driver memory will get OOM. Spark Structured Streaming is not meant for listing millions of files. For this use case it is better to list the files into a local file and then distribute the import work with Spark batch processing, so that each task gets a small subset of file names it will read & transform.
Summary
We’ve covered the data lake basics in the near real time AI inference scenario. Advanced ACID frameworks were introduced, together with their limitations over S3. Basic schema considerations were also brought up, as well as S3 listing speedup options.
I hope this can serve as a blueprint for most data lake solutions in AWS.
Thank you for making it to the end of the article – please let me know your thoughts! You can also reach out to me via LinkedIn.