Firsthand Experience with Polynote — A Better Notebook for Data Engineers

Lim Yow Cheng
Towards Data Science
3 min readOct 27, 2019

--

Netflix recently announced Polynote, which looks promising for Data Engineers working on Spark and Scala. There are many great features, which includes

  • Interactive autocomplete and parameter hints for Scala
  • Configure dependencies from Maven repository as notebook level setting
  • Scala and Python code in a same notebook with variable shared between them (a.k.a Polyglot)

Let’s try Polynote with a classic word count example in Spark. In this example, we read text from Netflix’s Polynote medium post , and used it to plot a word count bar chart with python matlablib. Code can be found here

Installation

The installation process was pretty straightforward by following it’s guide. In addition, I also installed matplotlib

pip3 install matplotlib

If you‘re intending to try it’s Polyglot feature (i.e shared variable between Scala and Python), you need to include one more environment variable:

export PYSPARK_ALLOW_INSECURE_GATEWAY=1

If not, you will be greeted with:

Editing experience

Pulling dependencies from maven repository could be easily configured using notebook level “Configuration & dependencies” settings. Let’s include Requests-Scala to get our text from Netflix’s blog using HTTP get.

Data engineers finally have an easy way to share their code written in Scala or Java to notebook users through a centralized maven repository!

Autocomplete works for libraries pulled from maven repository:

However, autocomplete for lambda functions does not seems to work yet:

Spark Integration

In this word count example, we get the text from HTTP, tokenize it, and keep all tokens that are more than 4 characters in size.

Spark could also be configured easily with “Configuration & dependencies” settings:

One glitch is that you need minimally one item in the Spark Config setting for spark to work, which is not obvious to figure out.

Switching to Python

Now, we switch gear to python, and use panda and matplotlib to plot our bar chart, taking only the top ten words.

It works magically. However, this exception pops out sometimes while running cell with python code:

When this happen, the interface stop working, and the only workaround was to kill the Polynote process and rerun it.

That’s it

Polynote is the best notebook for Spark and Scala I had tried so far. There are some minor glitches but I believed they will be flatten out in no time.

I’m excited to see how it can work in real life setting. E.g integrating with Kerberozied spark cluster, and running spark in cluster mode instead of standalone. Maybe more on that in the next blog.

Thanks for reading, happy coding!

--

--