What are the 10 most popular standard libraries in python?

Explore popular python standard libraries based on a sample dataset of GitHub repositories.

Arghavan Moradi
Towards Data Science

--

Photo by Jessica Ruscello on Unsplash

Python is known for its useful libraries and packages that make programming possible even for people without a software engineering background.

Nowadays, Python is one of the most popular programming languages in AI and Machine Learning. Python is known for its useful libraries and packages that make programming possible even for people without a software engineering background. Python has a group of standard libraries that are distributed with Python language like DateTime, math, or random. In this article, we aim to find the 10 most useful standard libraries in python repositories in GitHub. To fulfill our goal, we study different python repositories in GitHub and collect their used-in libraries to answer this question. To start our study, first, we collect the last year's commits of 5 big and famous python repositories in GitHub. Then, we parse python source files in these repositories and collect the libraries that are used in their commits. Finally, we visualize the 10 most popular python standard libraries which are used in the commits of these GitHub repos.

The structure of this article is as follow:

1. How to collect data?

2. How to parse Python source code?

3. What are the 10 most popular python libraries in GitHub repositories?

4. Conclusion

5. Reference

  1. How to collect data?

There are different ways to have access to data in GitHub repositories such as GitHub torrent, Git API calls, or google big query. But, in this article, we want to try a new and very useful python package, called, Pydriller, to collect our required data. Pydriller is fast and easy to use. I got familiar with this interesting package through my Ph.D. studies. You can check the documentation for Pydriller here. To start with Pydriller, first, we install the package:

pip install pydriller

In GitHub, each commit can change one or more than one source file. In version control systems such as GitHub, there is a file per each commit, named “diff”. It stores the changes which are applied in source files by submitting a specific commit. One of the methods to find the libraries in commits of GitHub repositories is to search a regular expression in “diff” files. But again, we want to try something different in this article. We compare two different versions of a source file “before“ and “after” applying a commit and then we collect the difference between these two files in the library names. With this method, we can find how frequent the libraries are used in different commits. The good news is Pydriller gives us access to the version of a source file before applying a commit and after applying that. Here is the required code to collect the data:

Using Pydriller to collect 5 famous python repositories in GitHub

We collect last year commits of 5 big python projects on GitHub: Django, Pandas, NumPy, Homeassistant, system-design-primer. RepositoryMining is one of the main API calls in Pydriller. We can define a period of time to collect commits in different repositories with two arguments in RepositoryMining: since and to. Also, we consider commits on all source files that their name end with “.py” since there are source files with other programming languages in these repositories, but we focus on python libraries. We collect three features: ‘commit.hash’, ‘source_code_before ‘, ‘source_code’. The commit.hash in Pydriller returns the commit id, source_code_before is the version of a source file before applying a commit and source_code shows the content of the source file after submitting a commit. Here is the header of our collected data:

“Image by Author”: tf_source.head()

So far, we collect the data that we need to start the journey. In the next section, we learn how to explore libraries in these source files.

2. How to parse python source code?

One of the methods to extract information in source codes is to convert them into Abstract Syntax Tree (AST). Then, we can walk through the tree and collect the target nodes. But, an important point is, we only want to collect the python standard libraries not all packages used in repositories such as locally defined libraries that have meaning only inside the repository. Python standard libraries are distributed with python language. Thus, to separate the standard packages from others, we need a pull of all valid standard libraries in python. Then, we can write a function to collect library names in the source codes. We can divide this section into two steps:

2.1. Collect a list of all available standard libraries in python

2. 2. Build a function to collect libraries name based on AST

2.1. Collect a list of all available standard libraries in python

On the python website, there is a list of all standard libraries in python with a small description. This page sort all python standard libraries in alphabet name and helps us to have a pull of all standard libraries in python. I put a list of all python standard libraries here in a .csv format.

2. 2. Build a function to collect libraries name based on AST

Now that we have a list of all standard python libraries, we need to collect library names in our sample dataset from python GitHub repositories. As we mentioned, one of the methods is walking through AST. In this article, our target nodes are “import” and “ImportFrom”. We want to have a function that walks through the parsing tree, finds the target nodes, and returns the library's name. Below is a class to do so.

A class to collect library names in python code

To understand better how this class works, here is a simple code. This sample code only has two lines which are importing two different libraries. One of these libraries is python standard library: tokenize and the other one is a local library: assistant.

import tokenize as tz
import assistant as ass

Below is a dump of the parsing tree of this sample code. As you can find, what we need to collect is as a “name” argument in alias class. Also, we need to check if the library's names are in the list of all standard libraries that we collect from the python original website. We save the .csv file in a list named “api_name”. If we apply this class, FuncParser, on this sample code, it will return only “tokenize” because the other library, assistant, is not available in the list of python standard libraries.

Module(body=[Import(names=[alias(name='tokenize', asname='tz')]), Import(names=[alias(name='assistant', asname='ass')])])

3. What are the 10 most popular standard libraries based on GitHub commits in python repositories?

So far we collect a sample dataset of 5 famous python repo in GitHub and build a class to collect library's names in python codes. Now, we need to apply this function to our sample data from GitHub and find the top 10 libraries which are used in commits of these repositories. As we discussed earlier, we compare the AST of the source file before submitting the commits with the AST of the same source file after submitting the commits. Then we collect the different library nodes. First, I will show you a step-by-step sample on how to compare these two AST, and in the end, I put all codes next to each other to loop through the whole dataset and calculate the number of occurrences of each library.

3.1. Collect a list of library names before submitting the commit

As you can check in section 1, we store our sample dataset in “tf_source”. I choose the first row of this dataset to explain the whole process. tf_source[‘Commit_before’][0] returns the content of code before applying the first commit in our sample dataset. Then, we apply FuncParser() to collect all library names in this source file and return the result in the file_contents list. We create a data frame named tokens_before and store this list.


text_before=str(tf_source[‘Commit_before’][0])

bf_obj = FuncParser()
bf_tree = ast.parse(text_before)
file_contents = []
bf_obj.visit(bf_tree)
dtobj_before = pd.DataFrame(file_contents, columns=[‘token’])
tokens_before =pd.DataFrame(dtobj_before[‘token’].value_counts())

3.2. Collect a list of library names after submitting the commit

We repeat the same process as 3.1., but this time on the content of the source file after submitting the commit, tf_source[‘Commit_after’][0]. Also, we store the result in a data frame, called, tokens_after.

text_after=str(tf_source[‘Commit_after’][0])


aft_obj = FuncParser()
aft_tree = ast.parse(text_after)
file_contents = []
aft_obj.visit(aft_tree)
dtobj_after = pd.DataFrame(file_contents, columns=[‘token’])
tokens_after =pd.DataFrame(dtobj_after[‘token’].value_counts())

3.3. Extract the differences between the two lists

In this step, we subtract the tokens_before from tokens_after to calculate their differences.

diff = tokens_after.subtract(tokens_before)
diff_token = diff[(diff.select_dtypes(include=[‘number’]) != 0).any(1)]
diff_token=diff_token.fillna(0)
diff_token= diff_token.abs()
diff_token = diff_token.reset_index()

3.4. Calculate the number of libraries

At the end, we count the number of times that each library occurs in diff_token data frame. To do so, we create a dictionary, named py_lib, and count the occurrence of libraries.

py_lib={}
j=0
for j in range(0,len(diff_token)):
word = diff_token['index'][j].lower()
if word in py_lib:
py_lib[word]+=diff_token['token'][j]
else:
py_lib[word]=1
j+=1

To apply the above steps in the whole sample data that we collect in section 1, I add a loop at the beginning of the steps. Here is the code:

Collect libraries in the whole sample dataset

Now that we collect all libraries and their frequencies in commits of python repositories in GitHub, we want to find the top 10 libraries in the py_lib dictionary. We can collect the top 10 values in a dictionary with the below code. We can see that libraries such as ‘warnings’, ‘sys’ or ‘datetime’ are in the list of top 10 python standard libraries based on our sample dataset.

from operator import itemgetterd=sorted(py_lib.items(), key=itemgetter(1),reverse=True)[:10][('warnings', 96.0),
('sys', 73.0),
('datetime', 28.0),
('test', 27.0),
('os', 22.0),
('collections', 18.0),
('io', 16.0),
('gc', 10.0),
('functools', 9.0),
('threading', 7.0)]
“Image by Author”: Top 10 standard libraries in python based a sample dataset from GitHub

Also, we can plot the word cloud map of python libraries and their frequencies.

import matplotlib.pyplot as plt
from wordcloud import WordCloud

wordcloud = WordCloud(background_color='black',max_font_size = 50)
wordcloud.generate_from_frequencies(frequencies=py_lib)
plt.figure(figsize=(8,6))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
“Image by Author”: A word cloud map of popular python libraries based on a sample dataset from GitHub

4. Conclusion

In this article, we try to collect the 10 most popular python libraries based on a sample dataset. This dataset contains the last year commits of 5 famous python repositories in GitHub. We use Pydriller to collect data from GitHub. We compare the AST of source files before and after submitting a commit and collect a list of libraries used in these commits. Then, we plot the most popular python libraries in a word cloud map.

All the codes to replicate this article is available here on GitHub.

5. References

[1] Mou, L., Li, G., Zhang, L., Wang, T., & Jin, Z. (2014). Convolutional neural networks over tree structures for programming language processing. arXiv preprint arXiv:1409.5718.

[2] Spadini, D., Aniche, M. and Bacchelli, A., 2018, October. Pydriller: Python framework for mining software repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (pp. 908–911).

[3] Schuler, D., & Zimmermann, T. (2008, May). Mining usage expertise from version archives. In Proceedings of the 2008 international working conference on Mining software repositories (pp. 121–124).

[4] https://github.com/ishepard/pydriller

[5] https://docs.python.org/3/py-modindex.html#cap-p

--

--