The world’s leading publication for data science, AI, and ML professionals.

Bayesnote: Redefine Notebook

Data scientists and data engineers are not happy about the platform/infrastructure/tools even with easy access to all the managed service…

Data scientists and engineers deserve a better notebook

Data scientists and data engineers are not happy about the platform/infrastructure/tools even with easy access to all the managed services provided by the cloud (e.g. AWS). Their questions(complaints) range from the lowest to the highest of the technology stack: "How can I spin up a cluster with Spark that has the same external dependency as on my laptop?". They also feel headache from the development to the operation of data application: "How can I automatically rerun many notebooks dependent on each other if one of them failed?"

The challenges of modern data science and engineering are rooted in its diverse nature, in the sense that

  1. the deliverables and the form of deliverables are diverse: script/code, dataset, report, slides, chart, dashboard, machine learning model, API, web app, etc.
  2. the tools of processing and store data are diverse: there are ancient Oracle databases that have been running for more than 20 years; there are open source tools like Hadoop and Spark; Oftentimes multiple languages and their development environments are used for 1 project, Python, SQL, R, Scala, and even Matlab.
  3. the skills of data team members are diverse: on the one side of the engineering skill spectrum, some engineers make contributions to Spark Core and can manage a Kubernetes cluster; on the other side, there are business analysts who are SQL and Excel master with deep domain knowledge.
  4. the engineering bandwidth and the team structure of data teams are diverse: there are "full-fledged" data teams that consist of software engineers, data platform engineers, data engineers, machine learning engineers, data scientists, and business analysts; there are startups who cannot afford hire 1 platform engineer in San Francisco.

This diverse nature is entangled with external issues from engineering and business teams like broken upstream data, lacking operation support, and ambiguous business requests resulted in the data science and engineering become more complex, difficult and challenging.

The advancement of computer hardware, software, and service alleviated such issues in the past decade. All cloud providers offer high-memory instance with open source tools pre-installed, removing the burden of purchasing and maintaining and configuration of hardware. The adoption of Apache Spark significantly improves the productivity of data teams by providing one unified tool to read data, process data, and build a machine learning model in memory. A San Francisco based startup offer service that building and maintaining SQL-ready tables.

In light of such changes, the bottleneck of data science and engineering is not the 24-TB ultrahigh-memory instance or in-memory processing framework that has already improved processing speed by 10x. The bottleneck is gradually shifted away from hardware and system-level software to tools built for data scientists and data engineers:

  1. The operation is ignored. For example, when a data scientist is building a report in a notebook that refreshes at 6 am every day, s/he has to at least set up a cron job that runs at 6 am, rerun if failed, send the report via email. Or S/he can learn workflow tools that require the operation of workflow tool itself by engineers, e.g. Airflow, designed primarily for data engineers. Or S/he can go to the office early every morning and run everything manually.
  2. Switching back and forth between multiple development environments, laptop and servers slow down development. Data processing often requires multiple languages, SQL, Python, R, and Scala, etc, and multiple development environments, including IDE, notebook, SQL console, and visualization tools are switched for the iterative nature of data science and engineering. Unexpected issues arise when moving prototyping notebooks from laptop to clusters for production.
  3. Skill gaps in reality. Ideally, a data scientist who is building a machine learning model, without exceptional high requirements on scalability should deploy and monitor her/himself to improve the speed of iteration cycle. However, in reality, data scientists whose expertise is in statistics or business believe that the learning cost of such engineering skills might be too high. The dilemma is if no engineering bandwidth is allocated, the model in data scientists laptop are of little impact to the business, while extra engineering bandwidth often comes at a high cost, especially in San Francisco Bay Area.

Jupyter Notebook, Zeppelin Notebook, and other notebooks have been highly popular among data scientists and data engineers. However, these notebooks are built with a limited goal: providing an interactive computing environment that only solves a fraction of problems of modern data teams, especially the operation of notebooks is missing. The development and operation of data application, including dashboard and machine learning models, is slow, costly, and painful.

Imagine with the help of an integrated notebook environment, a data scientist could deliver dashboard without round trips to IDEs, SQL consoles, dashboard tool, Cloud web console, and terminals; a data scientist could set up the pipeline to refresh report built by multiple notebooks with dependency on each other without learning another python framework/library and struggling for help from engineering. A data scientist could develop and deploy a machine learning model and iterate the model whenever s/he wants without handing off the model to the engineering team.

In addition to an interactive computing environment, we aim to build a frictionless integrated notebook environment (INE) that expands the goal of predecessors:

  • Truly end-to-end (e.g. from spin up clusters to deliver analytics to business)
  • Combines development and operation of notebooks (e.g. run notebooks with dependency on a fixed schedule)
  • Deliverable-oriented, build dashboard and deploy machine learning model right from notebooks

To achieve this goal, we build Bayesnote that introduces:

  • Supports multiple languages in one notebook with in-memory variable sharing across cells (Python, SQL, R, Scala)
  • Built-in Docker containers as the development and operation environment
  • A workflow component built around notebooks (think about Airflow for notebooks)
  • A dashboard component and a machine learning component that integrated well with notebooks.
Before & After Bayesnote. Created by Teng Peng
Before & After Bayesnote. Created by Teng Peng

Bayesnote extends the goal its predecessors mainly from 3 ways:

  1. Development only -> Development & Operation
  2. Notebook only -> Notebook + workflow tool + dashboard tool + machine learning tool + environment tool
  3. Single language -> Multiple languages

Bayesnote at the core is an interactive computing environment similar to its predecessors Jupyter Notebook, Zeppelin notebook, etc. The unified notebook layer of Bayesnote is built upon other notebooks as a backend so that 1. Other components, e.g. workflow component, of Bayesnote would interact with this unified notebook layer rather than Jupyter notebook, Zeppelin notebook, etc. 2. Other data platform tools, e.g. Airflow, could also interact with the unified notebook layer which is difficult, if not impossible with other notebooks.

The workflow component of Bayesnote is where operation meets data science. This workflow shares the same philosophy that "workflow-definition-as-code", following the best engineering practice, with other workflow systems built for engineers. However, it also acknowledges that an intuitive user interface is one of the most important design consideration of tools built for data scientists. The workflow system is built around notebooks rather than functions which is the scheduling unit of other workflow systems designed for Data Engineering, e.g. Airflow, reducing learning cost to almost 0.

Workflow example. Created by Teng Peng
Workflow example. Created by Teng Peng

Bayesnote offers a dashboard and Machine Learning component that allows data scientists to build dashboard and iterate machine learning models right from notebooks. The traditional dashboard tool only offers a SQL console so that data scientists are forced to switch back and forth between other data processing tools for scripting languages and the SQL console, reading and writing data to disk for sharing data between tools. This is is unproductive and tedious, slowing down the iteration cycles of dashboard building. Data scientists are facing a similar situation when it comes to the iterating machine learning model.

With Bayesnote, data scientists and data engineers, at most, could become full-stack engineers with little learning costs, and at least, would be 10x less painful and 10x more productive.

0.1-alpha has been released. Checkout screenshots here. Star our repo to support us. https://github.com/Bayesnote/Bayesnote


Related Articles