
Originally proposed by J.E. Hirsch, the h-index of an author is believed to be a reliable indicator of their scholarly achievement. In a common scholarly parlance, the h-index refers to the highest number ‘h’ such that the author has h publications with at least h citations. The members of Medium community can interpret it this way: it is the maximum number ‘h’ such that a writer has written at least h blogs that have received at least h claps.
Let us understand the concept with an example. An author has published five blogs (or research articles) and received the following claps (citations): [10, 1, 5, 3, 15]. The h-index is 3 because there are at least three blogs with at least 3 claps. Another writer may have received [100, 2, 1, 1, 1] claps, with h-index of only 2. While not a perfect measure of someone’s writing and scholarly abilities, it is claimed to capture the overall writing quality of an author fairly well in most circumstances.
Hypothetical application of h-index
Suppose, there are many prolific writers and each of them has published hundreds of blog posts on variety of topics on Medium.com, including all of its sister publications. Naturally, for every writer, some articles have been more popular (measured in terms of claps received) than the others. Now, the editor wants to measure their performance in terms of their h-indices and reward the most productive and quality writers accordingly – higher the value of h-index of a writer, greater the reward.
Now, the editor has asked us to write the fastest program to compute the h-index. After a brainstorming session and a quick internet search, we have identified four different ways of deriving the h-index, as mentioned below:
- For-loop (our idea)
- Numpy broadcasting (our idea)
- Python package (someone else’s idea)
- A good algorithm (someone else’s idea)
The next sections explain each of these four methods in more details and calculate their performance on our test case.
For-loop
Calculations involving one or more for loops are computationally inefficient in many cases, yet they remain the most intuitive and easy way of doing things. Before understanding the for loop calculations, let us introduce a new term k-index, which refers to any k such that k articles have k citations. The maximum value of k is h-index, implying that an author can have multiple k-indices but only one h-index. The default value of h-index is zero and it cannot be greater than the number of articles.
The h-index code based on for loops is shown below. The for loop technique finds all possible k-indices by iterating sequentially over a range, starting from one to the total number of articles. The highest value of k is considered the h-index of the given array of citations.
In the code, I have ensured that only necessary calculations are performed by ‘breaking’ out of both the loops. In the first loop, as soon as h articles are found with h citations, the loop stops iterating over the remaining values. The second loop continues only up to a point when the subsequent h is smaller than the previous value, indicating that we found our maximum h.
Note: I tried non-optimized (no breaks) for loops and it was extremely inefficient, so I excluded it from the results.
Array operations
Being aware of inherent slowness of for loops, we decided to perform array operations. We can replace the inner and outer for loops and associated if-statements with Numpy broadcasting operations. Instead of calculating k-indices one by one, arrays allow us to determine all k’s simultaneously. The code block is shared below.
Scholarmetrics package
We have searched online to find whether there exists a standard Python package that calculates the h-index, and as we know, there is always a Python library available. Scholarmetrics package written by Michael Rose and its code is shared below. Please notice that slightly modified version of the code (Method 2) that uses Numpy sum and not Python’s in-built sum function.
Expert algorithm
We have also looked elsewhere and found that some experts have already proposed a simple algorithm to derive h-index. This algorithm sorts the citations in reverse order (descending) and compares it with the corresponding integer sequence in the ascending order. The integer value at the intersection of both curves is our answer, as coded below:
Validation
Before we compare the four solutions, we need to ensure that they are producing the same results. So, we can generate randomly a sample and validate the three methods:
The above code produces the following results. All techniques are giving the same output which we know is the correct, which validates our code. So, we can now move ahead with our comparison test.
Comparison
All experimental runs were repeated three times with 100 runs per repeat and the timings in seconds were noted. As recommended by the Python documents, we used only the minimum running time. The results are shown below:

As expected, the expert algorithm showed the best performance followed by the Python package code. Interestingly, we tweaks on using np.sum() instead of sum() for Scholarmetrics package improved the speed and performed nearly identical to the best technique.
Surprisingly, for-loop did better than the array operations as the size increased. This happened because even though we eliminated the for loops, we did calculations. There were no if-statements to minimize the redundant steps and the code had to process the entire input array.
Final thoughts
In this blog, we have examined the computational efficiency of different methods of calculating h-index. There are three points to remember. First, there is no alternative to efficient Algorithms. Second, carefully written for loops can also perform at acceptable speed. Third, it is faster to use Numpy functions on Numpy arrays rather than using in-built Python functions.