Popular NumPy Function and Where to Find Them

Muntasir Wahed
Towards Data Science
6 min readOct 3, 2020

--

Most Popular NumPy Functions (Image by Author)

Explore, or Exploit?

If it works, then it’s enough. If you can get things done, why look for other ways to solve the same problem? That’s one way to look at things. The argument against this is that you miss out on much more efficient and readable alternatives this way.

Even after working with NumPy, Pandas, and other related libraries for nearly three years, I often find alternative ways to solve problems that reduce the runtime significantly or are more readable, to say the least.

So, should we diligently explore every other function before starting to work? Absolutely not! Going through the entire documentation will take a lot of time.

Then what should we do?

I decided to look at the most commonly used functions and find out whether I knew of them. The hypothesis is that the most useful functions would probably be used by most people.

Let’s find out what these functions are! We will do so in three steps.

  1. Use the Github Search API to find the repositories that use NumPy
  2. From these repositories, download the relevant files
  3. Go through the codebases to find the most commonly used functions

Use the Github Search API to find the repositories that use NumPy

To use the Github API, you will first need to create an API token. We will put this token into the header of our request.

# I put the API token in a txt file, which I read in the next line
with open('../../api_keys/github.txt', "r") as f:
API_KEY = f.read()

headers = {'Authorization': 'token %s' % API_KEY}

Let’s declare some variables now.

# We will look for python codebases that use the NumPy library
LIBRARY = ‘numpy’
LANGUAGE = ‘python’
# This is how the basic search URL looks like. We need to append the page number with it in order to get the paginated search results
URL = 'https://api.github.com/search/repositories?q=%s+language:%s&sort=stars&order=desc&page=' % (LIBRARY, LANGUAGE)

Now, we will use the request library to send a GET request and then inspect the response.

r = requests.get(URL + '1', headers=headers)
json_response = r.json()
print(json_response.keys())
print('Total Repositories:', json_response['total_count'])
print('Total number of items in a page:', len(json_response['items']))
print('Keys in a item:', json_response['items'][0].keys())

Output:

dict_keys(['total_count', 'incomplete_results', 'items'])
Total Repositories: 10853
Total number of items in a page: 30
Keys in a item: dict_keys(['id', 'node_id', 'name', 'full_name', 'private', 'owner', 'html_url', 'description', 'fork', 'url', 'forks_url', 'keys_url', 'collaborators_url', 'teams_url', 'hooks_url', 'issue_events_url', 'events_url', 'assignees_url', 'branches_url', 'tags_url', 'blobs_url', 'git_tags_url', 'git_refs_url', 'trees_url', 'statuses_url', 'languages_url', 'stargazers_url', 'contributors_url', 'subscribers_url', 'subscription_url', 'commits_url', 'git_commits_url', 'comments_url', 'issue_comment_url', 'contents_url', 'compare_url', 'merges_url', 'archive_url', 'downloads_url', 'issues_url', 'pulls_url', 'milestones_url', 'notifications_url', 'labels_url', 'releases_url', 'deployments_url', 'created_at', 'updated_at', 'pushed_at', 'git_url', 'ssh_url', 'clone_url', 'svn_url', 'homepage', 'size', 'stargazers_count', 'watchers_count', 'language', 'has_issues', 'has_projects', 'has_downloads', 'has_wiki', 'has_pages', 'forks_count', 'mirror_url', 'archived', 'disabled', 'open_issues_count', 'license', 'forks', 'open_issues', 'watchers', 'default_branch', 'permissions', 'score'])

We see that the response is a dictionary containing three keys: total_count, incomplete_results, and items.

We observe that there are 10853 repositories that match our query. We will not dig into so many repositories though! Let’s say we will look into only the N most popular ones. How can we find the most popular ones? Well, we have already specified in our URL to sort the results according to stars, in the descending order.

URL = 'https://api.github.com/search/repositories?q=%s+language:%s&sort=stars&order=desc&page=' % (LIBRARY, LANGUAGE)

Now we just need the URL of these repositories so that we can clone these. You can see that each item has a “clone_url” key, which serves this purpose. We will keep a few additional keys for each repository, just in case we need them later. And, for now, we will iterate the first 35 pages.

keys = ['name', 'full_name', 'html_url', 'clone_url', 'size', 'stargazers_count']
NUMBER_OF_PAGES_TO_ITERATE = 35
# We will declare a dictionary to store the items
repo_dict = dict([(key, []) for key in keys])

We need to send the requests for each page, and save the results! Don’t forget to wait for a couple of seconds between each request, to not overwhelm the API.

for page_num in tqdm(range(0, 35)):
r = requests.get(URL + str(page_num))
contents = r.json()

for item in contents['items']:
for key in keys:
repo_dict[key].append(item[key])

if page_num % 5 == 0:
time.sleep(60)

Now that we have the repository information, let’s save it in a DataFrame, which we will use later.

repo_df = pd.DataFrame(repo_dict)
repo_df.to_csv('../../data/package_popularity/numpy/repo_info.csv')
repo_df.head()

In the following gist, you can have a look at the repositories returned by the search queries.

Downloading the relevant files from these Repositories

If you run the following command, you will see that some of the repositories are there more than once. I have not yet found out why this happens. Please let me know if you know anything about this.

repo_df[‘full_name’].value_counts()

For now, we shall only consider one of these repositories, the one that comes first.

You can either write a bash script to clone these repositories or use the Github library. To use the Github library, you will have to provide the Github API.

with open('../../api_keys/github.txt', "r") as f:
API_KEY = f.read()

g = Github(API_KEY)

Among the files, we will only download those with .py or .ipnyb extensions.

ext_set = set(['ipnyb', 'py'])
# The directory where we will store the repositories
REPO_DIR_PARENT = ‘../../data/package_popularity/numpy/clones/’

Next, we will only consider the repositories that have at least 100 stars.

repo_df = repo_df[repo_df['stargazers_count'] >= 100]

To get the files of a repo, we can use the following code snippet. This will return all the files in the directory given as the argument int get_contents function. For example, the following code snippet will return the files in the root directory.

repo = git_client.get_repo(full_name)
contents = repo.get_contents("")

We will need to recursively collect all the files from the directory structure. Let’s write a helper function to handle this.

def get_relevant_files (full_name, git_client, ext_set):
repo = git_client.get_repo(full_name)
contents = repo.get_contents("")
files = []

while contents:
file_content = contents.pop(0)
if file_content.type == "dir":
contents.extend(repo.get_contents(file_content.path))
elif file_content.name.split('.')[-1] in ext_set:
files.append((file_content.name, file_content.download_url))

return files

Let’s call this on a repository and see what happens.

files = get_relevant_files('ddbourgin/numpy-ml', git_client, ext_set)
print(len(files))
print(files[0])

Output:

89
('setup.py', 'https://raw.githubusercontent.com/ddbourgin/numpy-ml/master/setup.py')

The repository has 89 files with either .py or .ipnyb extension. The list contains the URL of the files, that we can easily use to download using the requests library.

for name, download_url in files:
r = requests.get(download_url, allow_redirects=True)

Finally, we will need to save the content of the files in our local directory. We will simply use the full name of the repository to create a directory, then put all the files in that particular directory.

os.path.join(REPO_DIR_PARENT, '_'.join(full_name.split('/')))

You can find the full code in the following gist.

Exploring the Repositories

Now we will dig deeper into the downloaded files. First, let’s look at the import statements to find out how the NumPy library is usually imported. We already know about the popular “import numpy as np” statement. But is there someone who imports it as pd, or even pandas? 🤔

Looking at the import statements, I observed that there are three most frequently used types.

  1. import numpy
    Handling this is straight-forward. We will just look at the statements in the form of numpy.*
  2. import numpy.abc
    This one is straight-forward too.
  3. from numpy import abc
    We will handle this by treating every instance abc.* as numpy.abc.*
  4. from numpy.abc import xyz
    We will treat xyz.* as numpy.xyz.abc.*

All these statements can be modified with an “as” like “import numpy as np” or “from numpy import abc as def.” We need to handle that too!

We will keep a dictionary, where we will keep track of these shorthands and what they stand for. Then when we see def, we will replace it with numpy.abc, and so on.

The imported modules instance can be of two types.

  1. They can be functions, which we can identify by looking for opening parenthesis.
  2. They can be classes, which we can identify by checking whether their attributes are accessed.

We are now pretty close to the final solution.

For each file, we will first build the set of imported instances.

Then we will iterate through each line, and check for “numpy.xyz.abc.*(”, which is a simple regex expression, where the asterisk can be replaced by any number of characters.

If we find a line containing “numpy.xyz.abc.*(”, we will know that this line uses the “numpy.xyz.abc.*()” function.

Limitation: We are only looking at single-line statements. If a function call or import statement span multiple lines, this code will not count that. I haven’t handled some edge cases. Feel free to modify the codes if you want to!

Now we have the 20 most frequently used functions!

  1. numpy.array()
  2. numpy.arange()
  3. numpy.zeros()
  4. numpy.ones()
  5. numpy.testing.assert_array_equal()
  6. numpy.dtype()
  7. numpy.random.uniform()
  8. numpy.asarray()
  9. numpy.empty()
  10. numpy.testing.assert_equal()
  11. numpy.linspace()
  12. numpy.all()
  13. numpy.sum()
  14. numpy.random.randint()
  15. numpy.random.rand()
  16. numpy.allclose()
  17. numpy.random.random()
  18. numpy.testing.assert_almost_equal()
  19. numpy.dot()
  20. numpy.testing.assert_allclose()
Figure 1. Most Frequently Used NumPy Functions (Image by author)

Turns out, I have used 18 of them. I didn’t know about the numpy.all_close() function and it’s assert variant. Looks pretty helpful!

How many did you know about?

Thanks for reading!

--

--