The world’s leading publication for data science, AI, and ML professionals.

Cogram.ai: A Coding Assistant for Data Science and Machine Learning

Codex powered autocompletions for data science and machine learning that run on jupyter notebooks

Photo by Markus Winkler on Unsplash
Photo by Markus Winkler on Unsplash

Since the publication and dissemination of GPT-3, coding assistants like Github copilot, powered by OpenAi’s codex API have been on the radar of the machine learning community for quite a while. Recently, I came across this tool called Cogram, which seems to be a type of evolution of autocompletion, specialized for Data Science and machine learning that runs directly on Jupyter Notebooks.

In this article, I will show you how this tool works and share a little bit of my experience with it so far, generating machine learning code on Jupyter Notebooks.


Getting Started with Cogram

First things first, to get set up with Cogram you have to head out to their website, there you sign up for a free account and get access to an API token. After that all you have to do is install Cogram with:

pip install -U jupyter-cogram

Enable it as a jupyter notebook extension:

jupyter nbextension enable jupyter-cogram/main

Finally, you set up your API token with:

python -m jupyter_cogram --token YOUR_API_TOKEN

Now that you are all set up, you can start getting completions directly on your jupyter notebook.

With the most recently available version, Cogram is enabled by default. The user can turn Cogram on and off in the menu via this icon,

Image by author
Image by author

and can also customize Cogram, to select how many suggestions Cogram shows, and how creative Cogram is.

There are two ways to use Cogram: you can use the autosuggest mode, where Cogram makes suggestions whenever you stop typing,

Autosuggest example from Cogram; Image by cogram.ai (obtained with permission)
Autosuggest example from Cogram; Image by cogram.ai (obtained with permission)

or when you go to a new line. You can also use the manual completion option, triggered with the Tab key.

Image by cogram.ai (obtained with permission) Manual completion option triggered with the Tab key
Image by cogram.ai (obtained with permission) Manual completion option triggered with the Tab key
Manual completion option triggered with the Tab key; Image by cogram.ai (obtained with permission)
Manual completion option triggered with the Tab key; Image by cogram.ai (obtained with permission)

The user can switch between these options via the Autosuggest tick-box in the Cogram menu.


Autocompletions on Jupyter Notebook

I generated a set of completions with Cogram to test its usability, here are some of my favorites:

From writing this:

# plot sin(x) from 0 to pi

It generated this:

# plot sin(x) from 0 to pi

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, np.pi, 100)
y = np.sin(x)

plt.plot(x, y)
plt.show()
Image by author
Image by author

I was happy that the first example I tried actually generated code that ran without any editing or debugging! Now, another simple example, from this:

# plot a histogram of points from a poisson distribution

It generated this:

# plot a histogram of points from a poisson distribution

import numpy as np
import matplotlib.pyplot as plt

x = np.random.poisson(5, 1000)

plt.hist(x)
plt.show()
Image by author
Image by author

Again, it ran without issue! Very cool so far. Let’s step it up a notch and create a fake dataset to fit a simple linear regression model:

# create a fake dataset and run a simple linear regression model

The output:

# create a fake dataset and run a simple linear regression model

import numpy as np
import matplotlib.pyplot as plt

x = np.random.randn(100)
y = 2 * x + np.random.randn(100)

plt.scatter(x, y)
plt.show()
Image by author
Image by author

Ok, almost. The code actually ran and it makes sense in the context of what I was trying to achieve, but the model did not generate the code for the actual linear regression. Let’s see if we can change that by adding a little bit more description in our initial comment:

# write a linear regression model with sklearn

The output:

# write a linear regression model with sklearn

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

x = np.random.randn(100)
y = 2 * x + np.random.randn(100)

model = LinearRegression()
model.fit(x.reshape(-1, 1), y.reshape(-1, 1))

plt.scatter(x, y)
plt.plot(x, model.predict(x.reshape(-1, 1)))
plt.show()
Image by author
Image by author

Freaking cool! It actually created everything including the imports! How about plotting from a dataframe?

I wrote:

# load a csv file and plot the numerical data using the date column as the x axis

The output:

# load data.csv and plot the "data" column with the "date" column as index
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv("data.csv", index_col="date", parse_dates=True)
data.plot()
plt.show()
Image by author
Image by author

It did not plot the "data" column specifically but it did set the "date" column as index which is really cool.

Now, let’s try a slightly more advanced example, let’s write a simple perceptron from scratch using the NumPy package:

# write a perceptron from scratch in numpy

The output:

# write a perceptron from scratch in numpy

import numpy as np

class Perceptron(object):
    """Perceptron classifier.

    Parameters
    ------------
    eta : float
        Learning rate (between 0.0 and 1.0)
    n_iter : int
        Passes over the training dataset.

    Attributes
    -----------
    w_ : 1d-array
        Weights after fitting.
    errors_ : list
        Number of misclassifications in every epoch.

    """
    def __init__(self, eta=0.01, n_iter=10):
        self.eta = eta
        self.n_iter = n_iter

    def fit(self, X, y):
        """Fit training data.

        Parameters
        ----------
        X : {array-like}, shape = [n_samples, n_features]
            Training vectors, where n_samples is the number of samples and
            n_features is the number of features.
        y : array-like, shape = [n_samples]
            Target values.

        Returns
        -------
        self : object
         """
        self.w_ = np.zeros(1 + X.shape[1])
        self.errors_ = []
        for _ in range(self.n_iter):
            errors = 0
            for xi, target in zip(X, y):
                update = self.eta * (target - self.predict(xi))
                self.w_[1:] += update * xi
                self.w_[0] += update
                errors += int(update != 0.0)
                self.errors_.append(errors)
                return self

    def predict(self, X):
        """Return class label after unit step"""
        return np.where(X >= 0.0, 1, -1)    

X = np.array([[0, 0, 1], [1, 1, 1], [1, 0, 1], [0, 1, 1]])
y = np.array([0, 1, 1, 0])
model = Perceptron(eta=0.1, n_iter=10)
model.fit(X,y)
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

TypeError: only size-1 arrays can be converted to Python scalars

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)

<ipython-input-31-38e27395dd1c> in <module>
     60 y = np.array([0, 1, 1, 0])
     61 model = Perceptron(eta=0.1, n_iter=10)
---> 62 model.fit(X,y)

<ipython-input-31-38e27395dd1c> in fit(self, X, y)
     48                 update = self.eta * (target - self.predict(xi))
     49                 self.w_[1:] += update * xi
---> 50                 self.w_[0] += update
     51                 errors += int(update != 0.0)
     52                 self.errors_.append(errors)

ValueError: setting an array element with a sequence.

Now, there is a lot to unpack here, although the code came with a few bugs and did not run right out of the box, it did write extremely compelling code that after a few edits would be ready to run.

One of the coolest things I noticed is that the model also writes the comments for the functions which is interesting given the contextual complexity that writing documentation presents. Besides that, Cogram is also context-aware (like Github copilot in VSCode), so if you write a function, variable or class, it can remember it.


Concluding thoughts on Coding Assistants for Data Science and Machine Learning

The point I would like to make is that, ever since the discussion around software 2.0 started (probably even before that) coupled with the advancement of extremely powerful language models like GPT-3, which now evolved to be the Codex engine, this style of writing software is becoming more and more ubiquitous and for a good reason, what we ultimately care about is writing solutions for problems, and not writing each line of code ourselves to solve problem x.

That does not mean we should trust language models and go wild on autocompletions, but it seems clear to me that a smart and well thought out symbiosis between man and machine might be happening in the context of code writing across platforms, Programming languages and maybe it would make sense to reflect on how you can integrate that into your own workflow.


If you liked this post, join Medium, follow, subscribe to my newsletter. Also, connect with me on Twitter, LinkedIn, and Instagram! Thanks and see you next time! 🙂


Disclaimer: This article was not sponsored nor did I receive any compensation for writing it


Related Articles