Here it goes again, one of my favorite free data conferences! And the name has been again rebranded for the better. It was initially the "Spark Summit," then the "Spark+AI Summit," and now the "Data+AI" Summit. Databricks now offers a couple of products, and it’s not anymore considered only as "the spark company". A lot of what I’m talking about below goes in the same trends as my last year’s highlights here. But Databricks released a lot of interesting products this year!
Delta sharing
This new Databricks product is defined as the industry’s first open protocol for secure data sharing.
With companies going multi-cloud and the rise of Cloud Datawarehouse, there are big challenges in terms of data availability. Data engineers are spending a lot of time just moving/copying the data to make it accessible, queryable in a cost-efficient and secure manner in different places.
Delta sharing aims to solve this by storing "once" and read it anywhere. It uses a middleware (Delta Share Server) to talk between the reader and the data provider.
Delta sharing, on paper, is, in my opinion, the biggest product release since Delta format itself. However, there are some concerns worth mentioning :
- Even though Databricks claims query is optimized and cheap, I think we need to take into account egress/ingress cloud cost. What happens if the data recipient is doing an ugly query which is not optimized?
- Any open standard sounds promising as long as there’s global adoption. Still, Delta has clearly a bit more traction than their other ACID format brothers (like Iceberg and Hudi), so it is definitely the best horse to bet on.
Delta live table
Another Delta Databricks product! You can see it as a super-powered "view" from a delta table where you can use either pure SQL or python for processing. You can create an entire data flow and creating multiple tables based on a single delta live table. The delta live engine is smart enough to do caching and checkpoint to only reprocessing what’s needed.
This is pretty interesting as instead of considering multiple copies of data as classic data pipelines in a data lake architecture(raw/bronze/silver), you have one source of truth. This enables clear lineage and, therefore good documentation of transformations.
Besides, because data quality is hot (see below), Databricks added their own data quality tool with declarative quality expectations.
Unity Catalog
Databricks is launching its own Data Catalog. The data catalog is another trend in the data industry. Development on major opensource projects (like Amundsen, DataHub, etc.) has been keeping up the pace last year. In the meantime, other big cloud providers, like Google, also released their own data catalog. On top of that, as companies are going more and more towards a multi-cloud strategy, it makes data discovery and governance an even bigger concern.
An interesting point that Databricks tackle with their own catalog is that they want to simplify the data access management through a higher API. This is something other solutions don’t really focus on, and it’s a major advantage as managing access at low level (File-based permission like s3 or GCS, for example) can be really tricky. Fine-grained permission is difficult, and the layout of data is not really flexible as often tight to a metastore like Hive/Glue.
More python
The most common denominator between data profiles (Data Scientist, Data Engineer, Data Analyst, ML Engineers, etc.) is probably SQL, and the second would be python. According to Databricks, most of the spark API calls today are done through Python (45%) and then SQL (43%).
It’s clear that Databricks wants to reduce the gap between "laptop data science" and distributed computing. Lowering the entry barrier will enable more users to do AI… and more money for data SAAS companies 😉 As python has wide adoption and it’s beginner-friendly, it makes sense to invest in.
Most of the improvements go into the so-called "Project Zen", among these :
- Readability on pyspark logs
- Typehints improvements
- Smarter autocomplementions
Pandas into spark natively
If you are not familiar with the koalas project, it’s pandas API on top of the Apache Spark. Koalas project will be merge through Spark. Everywhere you have a spark data frame, you will have a pandas dataframe without an explicit conversion needed.
Of course, for small data use cases, Spark will still be an overhead on a standalone node cluster but the fact that it can scale without any change to the codebase is pretty convenient.
Building low code tool to democratize Data Engineering/Data Science
It’s incredible how many talks there were this year about ETL pipeline and data quality frameworks. This again double bet on the trends I talked about last year. A lot of companies wants to lower the entry barrier for Data Engineering with motivations of
- Reducing complexity in increasing reusability through ETL
- Metadata and configuration driven for ETL. Configurations double up as documentation for your data flows
- Make it easier for SQL devs to write production-ready pipelines and increasing the range of contributors.
Conclusion
It was great to see Databrick’s product catalog increasing. It feels that they are more going towards an integration strategy rather than trying to be the next platform where you will run everything (even if it’s what they are also selling).
But again, the major product around delta also depends on vendor adoption so let’s see how fast the data community will adopt it!
Resources :
DATA+AI Keynotes, https://www.youtube.com/playlist?list=PLTPXxbhUt-YWquxdhuhXGU8ehj3bjbHFE
Mehdi OUAZZA aka mehdio 🧢
Thanks for reading! 🤗 🙌 If you enjoyed this, follow me on 🎥 Youtube, ✍️ Medium, or 🔗 LinkedIn for more data/code content!
Support my writing ✍️ by joining Medium through this link