For anyone in the area of Data & Analytics, busy in building a unified Data platform on the cloud within your organization, who is not aware of what Databricks is, it’s time that you check it out before doomsday.
With almost 15 years in the Data Industry, having seen everything from the days of Traditional RDBM systems alongside low code or no code ETL tools to Hadoop and now Cloud-based Data platforms, Databricks has caught my attention and has excited me ever since I started using it 3 years back.
With more than 5000 customers, Databricks has its origins in the open-source community by the original creators of Apache Spark, Delta Lake, and MLFlow, Databricks starting way back in 2013 and brings together data engineering, science, and analytics on an open and unified platform which caters to every from a Data Engineer to a Data Scientist taking the Business users and Data Analysts alongside.
But wait, that’s all good and awesome !! But why am I so excited and bullish on the future of Data Platforms within organizations with Databricks in my arsenal. Let me tell you my top 5 reasons, out of many thousand other reasons
Delta Lake
I have never been a fan of Databases for building Data & Analytics platforms. I believe that databases should be using only for reading the transformed data available in it and the compute of your Data Platforms should be kept away from it. Well, that was way back in the early 2000s. With the Cloud and the decoupling of compute and storage, things have changed a lot. But still, I think the day when I learned about Delta Lake and started using it, was the day when it changed the game for me to build Data Platforms. Delta lake is an open-source storage layer that provides ACID transactions on a platform where data is stored in form of files in Object-based storage. Essentially within Azure, for instance, the data sits in the storage accounts as files but with the ability to update, delete upserts capabilities. Delta Lake is available as part of the Databricks and provides your unlimited compute power to process your data even at a petabyte-scale. I think I would elaborate on my affinity for Delta lake in a separate article. For now, Delta lake is my favorite part of Databricks.
SQL Analytics
Okay !!! So you can write a SQL Query but not the complex Python and R codes? And has that been stopping you from exploring the data in Delta Lake? Well not anymore. In Data & AI Summit 2020, Databricks announced the release of SQL Analytics. In addition, to support your existing BI tools, SQL Analytics offers a full-featured SQL-native query editor that allows data analysts to write queries in a familiar syntax and easily explore the data stored in Delta Lake. In addition to this, SQL Analytics provides granular Data lake administration capabilities and works in tandem with Delta lake providing reliability and governance for Data Lakes. Well, I am yet to lay my hands on this amazing addition to the Databricks stack, but the release event in Data & AI Summit got me excited enough.
Cloud Agnostic
With Databricks available on both AWS and Azure, you are not locked into a relationship with a cloud vendor. Move your big data workloads between cloud platforms without having to worry a lot about migration strategies, cost, and efforts.
Unified Data Platform
Now, this is one of the marketing aspects of Databricks which is quite prominent on their website and all the marketing collaterals. But what makes me believe in it? Well, you build one data platform using Databricks which can be used by everyone. Whether it be a Data Engineer who can write ETL workloads in Python or Scala or a Data Scientist who loves R or a Data Analyst who knows only SQL queries or a Business user who can only play with the data using PowerBI, it’s for everyone. With SQL Analytics the Databricks platform is no more a Technical users only tool.
Apache Spark
With support for both batches as well as real-time analytics and data processing workloads, Apache Spark is an open-source analytics engine used for big data workloads and is one of the reasons why I prefer using Databricks within my Data platform architectures and design. With in-memory computing, Apache spark provides speed which when coupled with the flexibility to write applications either in Python or Scala or Java along with support for SQL queries, stream data processing, is a powerhouse that you cannot avoid.
If you are planning to modernize your Data Platform, Databricks is one of the options you should consider strongly as part of your technology stack to build a scalable and unified data platform which can help you perform Data exploration, Advanced Analytics and BI reporting for varied sets of users in your organization.