The world’s leading publication for data science, AI, and ML professionals.

Working With Object Storage And Jupyter Notebooks

Understanding the concepts of Object Storage and Jupyter Notebook. Learning the best platform to integrate them in LakeFS.

Object storage and Jupyter Notebooks for general-purpose usage are on the rise. The demand for big data and data elements is continuously increasing. While object storage is required to store the enormous data elements on a continuous rise, Jupyter Notebooks can compute and analyze these datasets and data.

In this article, we will cover the concepts of object storage and acknowledge its benefits. Then, we will look into Jupyter Notebooks and the scope they provide in modern-day computing. Finally, we will learn about LakeFS, one of the best platforms to integrate object storage and Jupyter Notebooks.

Understanding Object Storage:

Image Source
Image Source

Data is a valuable resource, and there is lots of data (usually unstructured) and datasets available on the internet. The best way to store these data elements in a highly scalable manner is with object storage. It offers high durability because the information stored is secure, and the data is not lost if a particular disk fails.

Object storage manages the data elements in the architecture as objects. It differs from other storage architecture like file systems or the block storage system. The former utilizes files to store data, while the latter uses blocks within sectors and tracks.

Thus, it is secure, flexible, and scalable. It is one of the best options for data recovery or backup systems. The capability of object storage allows its users to store large amounts of data, especially useful in Artificial Intelligence, analytics, and cloud computing. When combined with Jupyter Notebooks, it can be quite valuable.

Large companies and tech giants utilize object storage to allow the retention of large amounts of data. Object storage has its unique use cases in specific companies like storing photos, pictures, and images on Facebook, collection of movies on Netflix, collection of songs on Spotify. And Dropbox, an online collaboration service, makes use of object storage for storing multiple files.

Working With Jupyter Notebooks:

Photo by Ashley West Edwards on Unsplash
Photo by Ashley West Edwards on Unsplash

Project Jupyter is a powerful open-source environment for programmers because it is highly interactive and supports a dynamic range of programming languages, the main ones being Python, R, and Julia.

One of the most essential features of Project Jupyter is the aspect of Jupyter Notebooks. These notebooks allow you to develop various programming projects, analyze data, create many machine learning and deep learning models, and so much more.

Jupyter Book allows the users to build books, content, and documents from computational material. The best part about Jupyter Notebooks is the numerous options and eccentricity it provides to the users. You can build your code blocks with a mixture of Markdowns to explain your code and make it more representable. You can also add additional mathematical equations to your Jupyter Notebooks as well as reStructuredText. Finally, you can download these Notebooks in various formats, including output formats like PDF files, HTML web pages, Python files, or Notebook files.

If you don’t want to install the version of Jupyter Notebook on your system, you can also utilize the free version of these notebooks, available on the Google Colaboratory. Colaboratory (also known as Colab) allows the users to develop Python, Data Science, and visualization projects with the assistance of the free Jupyter Notebooks provided. These Notebooks run on the cloud and have free resources like GPU and TPU available. You can save these Notebooks on your Google Drive.

Integrating Jupyter and Object Storage with LakeFS:

Photo by Claudia Chiavazza on Unsplash
Photo by Claudia Chiavazza on Unsplash

Object Storage combined with Jupyter Notebooks produces excellent results. Object storage and Jupyter Notebook can have significant interconnectivity to achieve better and more productive results.

Since the data in object storage is managed in terms of objects, they can store a high amount of information, which can be employed in Jupyter Notebooks to create constructive, high-quality projects. But, what is the best platform to integrate the components of Jupyter Notebook and object storage?

One of the best platforms to integrate the qualities of object storage and utilize it with the Jupyter Notebook is offered by the company LakeFS.

LakeFS is an open-source platform to create a custom system for managing your data elements effectively and efficiently. It provides features of object storage and grants you access to various resources to utilize its features to perform complex tasks, including data science and analytics. The below image representation is a representation of its schematics.

Image Source
Image Source

LakeFS aims to solve some of GitHub’s issues related to more extensive data storage. You can use the cloud for greater object storage than what a website like GitHub would allow you to. There are mainly two cloud platforms supported, namely, Amazon Web Services (AWS) and Google Cloud Platform for object storage, and these are in-turn integrated to Spark, ML flow, Jupyter notebook, etc.

With the help of this architecture, you can utilize the properties of object storage alongside your Jupyter Notebooks. These have various benefits, including storing larger datasets and documents for various computational problems, including big data, machine learning, deep learning, computer vision, natural language processing, and so much more.

Since Jupyter Notebooks support Python, we will look into some Python code for LakeFS. To get started with the installation procedure, you can install the necessary dependencies by typing the following line in your command prompt.

pip install bravado==10.6.2

Once you have installed all the required dependencies, you can start working on a variety of tasks. With the help of the Bravado package, you can generate a specific dynamic client at the runtime. This generation is done through the lakeFS server by an OpenAPI definition, and it provides you a JSON extension for the specified URL. If you choose to run the platform locally on your system, you can do so at http://localhost:8000/swagger.json.

Let us look at a couple more task-specific code blocks to understand their use. Firstly, let us try to generate a Python Client with Bravado. With the installation of the Bravado package, the only other requirement is to obtain an URL from an OpenAPI. After we achieve these two requirements, you can use the following code block to generate a client:

from bravado.requests_client import RequestsClient
from bravado.client import SwaggerClient
http_client = RequestsClient()
http_client.set_basic_auth('localhost:8000',
                         'AKIAIOSFODNN7EXAMPLE',
                         'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY')
client =  SwaggerClient.from_url('http://localhost:8000/swagger.json',
                        http_client=http_client,
                        config={"validate_swagger_spec": False})

Once you have the client object ready, you can use it to interact with the API. Just to state a simple example for the creation of listings and repositories, you can make use of the following code block.

client.repositories.createRepository(repository={
       'id': 'example-repo',
      'storage_namespace': 's3://storage-bucket/repos/example-repo',
      'default_branch':'main'
      }).result()
# output:
# repository(creation_date=1599560048, default_branch='main',            id='example-repo', storage_namespace='s3://storage-   bucket/repos/example-repo')

We have covered a couple of generic codes in this article. For a more detailed insight into using the Python API supported on Jupyter Notebooks in LakeFS, I would highly recommend checking out the following website.

Conclusion:

In this article, we understood the concepts of object storage and the Jupyter Notebook. We then looked up at one of the best ways to integrate both these aspects into one component for better functioning and faster computation with the help of LakeFS.

I would highly recommend checking out the following website guide for understanding more about the theoretical and technical aspects of LakeFS through their documentation. It should be a great starting point for further interpretation of the objectives that can be accomplished with the LakeFS platform.

If you have any queries related to the various points stated in this article, then feel free to let me know in the comments below. I will try to get back to you with a response as soon as possible.

Check out some of my other articles that you might enjoy reading!

How To Read And Understand Python Code Faster

5 Reasons Why You Should Code Daily As A Data Scientist

Answering 10 Most Commonly Asked Questions About Artificial Intelligence

10 Best Free Websites To Learn More About Data Science And Machine Learning!

Thank you all for sticking on till the end. I hope you guys enjoyed reading this article. I wish you all have a wonderful day ahead!


Related Articles