Creating a Data Science Python Package Using Jupyter Notebook

Learn how object-oriented programming can help you build your first Python package using Jupyter notebook.

Abid Ali Awan
Towards Data Science

--

Image by Author | Elements by katemangostar

Introduction

Have you wondered how Python packages like Scikit-learn, pandas, and NumPy are built? They are all based on Object Orient Programming (OOP) to create complex and easy-to-use packages. For a data scientist it's a necessity to learn OOP so they can use it in development of production ready products.

We are going to use the cloud Jupyter Notebook to ease the setting up of the environment and completely focus on creating a package. The project includes fundamentals of OOP like Inheritance, objects, class, and magic functions. The project is highly influenced by AWS Machine Learning Foundations course, and it took me ten minutes to recreate the package once I knew how to build it.

Distributions

Parent Class

let’s dive into the coding and discuss our Parent class “Distributionwhich will be used by both Gaussian and Binomial classes. We will be using the Jupyter notebook magic function %%witefile to create python files.

%%writefile distributions/general.py

The code above will create a python file into the distributions folder and to make things simple you need to create a test folder that should contain your test files, a distribution folder that should contain all your package’s files, and a data folder that contains a .txt.

The Distribution class takes two arguments, mean and standard deviation, it also contains the read_data_file() function that is used to access data file.

__init__ function initiate the variables.

Testing Distribution Class

Everything in this class is working smoothly. We have added mean, standard deviation, and loaded the random.txt file to test ours Distribution class.

Gaussian Distribution

The Gaussian distributions is important in statistics and are often used in social sciences to represent real random variables whose distributions are unknown. — Wikipedia

Figure1 | Wikipedia

Mean

The mean of a list of numbers is the sum of all the numbers divided by the number of samples. Mean — Wikipedia

Standard Deviation

This is a measure of variation with in data. The BMJ

Probability density function

The parameter mu is the mean, while the parameter sigma is the standard deviation. The x is the value in a list.

Gaussian Class

We are going to inherit values and function from parent class Distribution and use python the magic functions.

  1. Initialize the parent class Distribution
  2. Create plot_histogram_pdf function → the normalized histogram of the data and visualize the probability density function.
  3. Create magic function __add__ add together two Gaussian distributions objects.
  4. Create magic function __repr__output the characteristics of the Gaussian instance.

Experimenting

Testing __repr__ magic function.

Initializing gaussian1 object with 25 mean and 2 standard deviations and then reading random.txt file from data folder.

Calculating probability function on 25 means and 2 Stdev on data. Then, calculating the mean and Stdev of random.txt. The new mean is 125.1 and Stdev to 210.77 which have changed our probability density function value from 0.19947 to 0.00169.

Plotting histogram and line plot of probability density function.

The unittest

The unittest, is a testing framework which was originally inspired by JUnit and has a similar flavor as major unit testing frameworks in other languages. It supports test automation, sharing of setup and shutdown code for tests, aggregation of tests into collections, and independence of the tests from the reporting framework, see the Documentation.

Creating Gaussian Class Test File

Test-First is a great tool. It creates better understanding and productivity in the team. The result is high-quality code — both in terms of early success in finding bugs and implementing features correctly. — Gil Zilberfeld

We are going to use the unittest library to test all our functions so that in the future if we make any changes, we can detect errors within few seconds.

Creating TestGaussianClass that has all the functions to test the functions in Gaussian class. We have used the assertEqual method to hack the validity of functions.

I have tested these values myself and then added them individually to test every possibility.

Running Test

Let’s run our test file from the test folder using !python.

As you can see all tests have passed. In beginning I got multiple and debugging those issues have helped me better understand how the Gaussian class is working at each level.

Binomial Distribution

The binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes or no question, and each with its own Boolean-valued outcome: success (with probability p) or failure (with probability q = 1 − p). Binomial distribution — Wikipedia

Binomial Distribution figure | Wikipedia

Mean

Variance

Standard Deviation

Probability Density Function

Binomial Class

We will be using the mathematical functions mention above to create mean, standard deviation, and probability density functions. We have done the challenging work in previous class, now we are going to use similar pattern to code Binomial class.

  1. Initialize the probability and size variable → p,n
  2. Initialize the parent class Distribution → calculating mean and Stdev and adding it to the parent class.
  3. Create replace_stats_with_data functionThat will calculate probability, size from imported data. The new mean and standard deviation will be updated.
  4. Create plot_bar function → display bar chart using the matplotlib library.
  5. Create pdf function → calculate the probability density function of data using mean and stdev.
  6. Create plot_bar_pdf function → plot the pdf of the Binomial distribution.
  7. Create magic function __add__add together with two Binomial distributions objects.
  8. Create magic function __repr__output the characteristics of the Binomial instance

Experimenting

Testing __repr__ magic function

Testing Binomial object and read_data_file function.

Testing pdf of the initial value of p 0.4 and n 20. We will be using replace_stats_with_data to calculate p and n of data and then recalculating PDF.

Testing bar plot

Testing Probability Density Function bar plot.

Binomial Class Test Function

We are going to use the unittest library to test all our functions so that in the future if we make any changes, we can detect errors within few seconds.

Creating TestBinomialClass which has all the functions to test the Binomial class.

Running Test

Running the test_binomial.py shows that no error was found during testing.

Creating __init__.py Function

We need to create __init__.py file in the distributions folder to initialize the classes within the python file. This will help us call specific classes directly.

We have initiated both Binomial and Gaussian class.

Creating setup.py Function

This setuptools is required for building python package. The setup function requires package information, version, description, author name and email.

Directory

The image below shows the package directory contain all required files.

Installing distributions Package

(venv) [email protected]:~/work # pip install -U .

Using pip install . or pip install -U . to install the python package which we can use in any project. As we can see our distribution package is successfully installed.

Processing /work
Building wheels for collected packages: distributions
Building wheel for distributions (setup.py) ... done
Created wheel for distributions: filename=distributions-0.2-py3-none-any.whl size=4800 sha256=39bc76cbf407b2870caea42b684b05efc15641c0583f195f36a315b3bc4476da
Stored in directory: /tmp/pip-ephem-wheel-cache-ef8q6wh9/wheels/95/55/fb/4ee852231f420991169c6c5d3eb5b02c36aea6b6f444965b4b
Successfully built distributions
Installing collected packages: distributions
Attempting uninstall: distributions
Found existing installation: distributions 0.2
Uninstalling distributions-0.2:
Successfully uninstalled distributions-0.2
Successfully installed distributions-0.2

Testing our Package

We will be running Python kernel within Linius terminal and then test both classes.

Well done you have created your first Python package.

>>> from distributions import Gaussian
>>> from distributions import Binomial
>>>
>>> print(Gaussian(20,6))
mean 20, standard deviation 6
>>> print(Binomial(0.4,50))
mean 20.0, standard deviation 3.4641016151377544, p 0.4, n 50
>>>

If you are still facing problems, check out my GitHub repo or Deepnote project.

You can follow me on LinkedIn and Polywork where I publish articles every week.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Originally published at https://www.analyticsvidhya.com on July 30, 2021.

--

--