A love-hate relationship with Databricks Notebooks

On this post I give my opinion about this tool and the various ways you can use it to deliver solutions depending on the skillset and maturity of your team.

David Suarez
Towards Data Science

--

Since I’ve started working with Databricks around two years ago, I always had this strange love-hate relationship with Notebooks. Some of you might have relate to this as well.

Photo by Kelly Sikkema on Unsplash

Notebooks are democratising “Big Data”

Notebooks are here to stay in the data landscape. From the classic Jupyter Notebooks to the “run it in production” kind of approach pushed by Databricks and other SaaS solutions like Azure Synapse Analytics.

I used to see notebooks as a super powerful tool that empower developers with none or very little programming knowledge to build complex datasets and ML models.

It’s awesome to see how well integrated they are in some cloud providers like Azure with tools like Data Factory, making ETL orchestration super easy, visual, and straightforward. Special mention to notebooks parameters and outputs that can really help to modularize data pipelines.

Databricks Notebooks can easily become the de facto way of running data processing code on the cloud by most of the non-advanced data users. The truth is that Databricks eliminates most of the frictions and complexity of getting code running on the cloud, because a user working in Databricks is already working on it. Thus, no more “but it works on my laptop” kind of excuses.

In summary, thanks to notebooks, users can deliver value very quick without the engineering bottleneck. Unfortunately, all these good features come with a price to pay.

Ease of use is a double-edged sword

In order to be able to run the code on your notebooks, you’ll need a cluster to compute. So even when users only want to write some simple data processing code using Pandas, you’ll have to pay for cluster VMs and DTUs. If anyone in the team has the right to create clusters, then it’s also very easy to end up with oversized clusters and end up increasing the costs for no reason.

“This dataset is huge, like 4GB or so… But don’t worry, I’ve created this cluster with 3 workers so I think it’s enough to start, but we can add more later if needed. By the way, I am using Pandas because I don’t feel like learning PySpark :D”

We have to remember that Databricks was primarily built as a managed platform for running Spark on the cloud, and not for running code on a single node. Up until one year ago there were not even an option for Single Node clusters avaialble (only the workaround of setting workers to 0).

On the other hand, when you require to do more than just SQL queries or some simple scripts you start to feel the pain of using notebooks.

Trying to apply good programming practices into notebooks is quite hard and frustrating. There is no way to import your own classes with the basic “import” command, and you are forced to use official hacky workarounds like “%run” to load code from other notebooks. Debugging encapsulated code is just a nightmare since there is no debugger, and the only way of doing so is by using print statements (welcome to the 80's). And about unit testing, well… you will need to get very creative here!

At this moment, you will start considering about jumping into a proper IDE like PyCharm or VS Code (in case of Python) and start writing robust software again. Probably a good decision. Unfortunately, once you make this step, the setup complexity grows, and as a result, you might lose some people along the way. Not everyone has the Software Engineering skills needed to cope with this complexity.

Finding the sweet spot

The chosen way of writing code will depend on the maturity and skills of the development team. The more mature and advanced, the more they will be able to deal with complexity. Start with an assessment to measure the team skills at the moment. Are most team members mainly SQL developers with years of experience working with relational databases but very little or no programming skills at all? Or are most of the team members very well versed in Software Engineering and CI/CD practices? Maybe somewhere in between?

Image by Author

Finding the middle-ground way of working that fits most of the competences of all team members is a must if you want to build a future proof project. Otherwise things can go wrong. It wouldn’t be the first time that a project gets discontinued because the person who built it left the company and nobody else in the team has the skills to keep it alive.

Choosing the approach that fits your team

After being involved in different projects with people of different skillsets and analyzing different possibilities, I came up with the following set of approaches that you can apply depending on how much complexity your team can handle.

Option 1: Only Notebooks

The out of the box code development experience in Databricks. Just write your code in cells, combining languages if necessary.

Providing a set of Notebook templates can be very helpful to accelerate the development the solutions. Moreover, it will help setting a standard code organization that helps to keep the code in order and easy to read.

  • Pros: Straightforward and easy for all data developers.
  • Cons: No IDE development, no code reuse, trial and error debugging only, no unit testing, code can get messy quickly.

Option 2: Notebooks + Set of Utility Functions as Notebooks

Same as the previous option, with the addition of a set of Utility Functions bundled in Notebooks that can be imported (%run) from other notebooks. These functions can include functionalities to read or write data from SQL databases or Delta tables… Whatever functionality that can be easily reused.

Almost the same complexity as before, everything remains within the Databricks Workspace, and the Utilities Functions are totally transparent to all developers and can be easily changed on the spot if needed.

In this case, it’s very helpful to provide some notebooks with examples about how to use this Utility Functions, or even better, incorporate them into the notebook templates.

  • Pros: Straightforward and easy for all data developers, code reuse for Utility Functions, Utility Functions are transparent and easy to change on the spot.
  • Cons: No IDE development, trial and error debugging only, no unit testing, running notebooks for importing code doesn’t feel right.

Option 3: Notebooks + Set of Utility Functions as Code Package

The main difference compared to the previous approach is the way of bundling Utility Functions. In this case, developing a a custom Code Package that can be imported using “import” statements.

In this case the complexity grows way more, since the code of the solution is not only living in the Databricks Workspace. The Code Package needs to be developed on a local machine, opening the possibilities of using an IDE, a proper debugging engine, and even including Unit Tests. This also means, commiting code to Git repository and including a CI/CD pipeline to release the Package so it can be used from the Databricks Notebooks.

Regarding the rest of developers, they would just have to trust that those Utility Functions do their job, and focus on the rest of the logic. Again, providing some usage examples will be very helpful with the Code Package adoption.

  • Pros: Code reuse for Utility Functions with IDE development, proper debugging, and Unit Testing, import code with “import” statement.
  • Cons: Code package needs to be built and released through CI/CD, code becomes less transparent and users just have to trust in its functionality.

Option 4: Notebooks as “main” function + Transformation & Utility Functions as Code Package

In this case we go a step further on complexity. The logic of data transformations now can be part of the code package as well. Use Notebooks as a “main” code for orchestrating the run of functions and injecting parameters.

This approach brings full testing possibility, since all the code can be developed on local machine. On the other hand, data developers with lack of Software Engineering skills are excluded of the development. However, they can still provide SQL queries to pass as parameter to spark.sql().

Another draw back of this setup is losing the connection to the Hive Metastore while working on local machine. So if your organisation uses it heavily, this might be big issue. The workarounds are using Databricks Connect to issue your code to a Databricks Cluster (if you can live with the known limitations), or otherwise, store samples of the datasets in your local machine, and maybe mock the Hive Metastore with the samples as well.

  • Pros: Code reusability, IDE development, proper debugging, and Unit Testing of all your code, use of Notebooks as “main” that can be used for orchestration.
  • Cons: Code package needs to be built and released through CI/CD, not suitable for people without Software Engineering skills, no access to Databricks Hive Metastore from local machine.

Option 5: Code Package only

This is the hardcore way. Write all the logic in a Code Package from your local machine. Do you remember the spark-summit command? It always worked during the Hadoop ages, so why should you change it?

In this case all the problems of the previous approach also apply. The same code should work summited to any Spark Cluster, no matter where it is hosted in Databricks, Kubernetes, or anywhere else.

  • Pros: Code reusability, IDE development, proper debugging, and Unit Testing of all your code, same code can run in any Spark Cluster (no matter where it’s hosted).
  • Cons: Code package needs to be built and released through CI/CD, not suitable for people without Software Engineering skills, no access to Databricks Hive Metastore from local machine.

Moving forward

I hope you can take the previous approaches as a source of inspiration and not as the only options. I’m sure you can come up with better options that can fit your team’s needs as well.

Of course, choosing one of the options doesn’t mean you have to stick with it forever. As long as the team grow, the skillset will expand, and the more likely the team will be able to move to more complex and robust setups.

In my opinion, I would recommend to start with Option 2, since it’s a quite straight forward way of writing code with some reusability, while involving SQL developers. The reduced complexity can help to deliver an MVP sooner. This will satisfy the stakeholders, who will then give feedback and ask for new requirements. Later on, you can put the focus on working on the robustness of code during the “industrialization” phase, moving to Option 3 and 4, while the team keeps expanding and developing the skills required to deal with more complexity.

Conclusions

The notebook fashion is here to stay, not only Databricks, but other SaaS alternatives like Azure Synapse Analytics are pushing in deploying notebooks into production. They expand the landscape of data developers, allowing data stewards with non programming skills to leverage the possibilities of “Big Data” technologies.

If you have a good base of Software Engineering you will soon find limitations on Notebooks and you might want to run away from them sooner or later. Fortunately, there are some middle-ground solutions that can help to take the best of both worlds without leaving anyone behind, and you can always fine tune depending on your team skillset.

Thanks for reading, and I hope you find this is helpful!

--

--