The world’s leading publication for data science, AI, and ML professionals.

Demystify Iterators and Generators in Python

Learn an efficient way to work with large datasets

Image by congerdesign on Pixabay
Image by congerdesign on Pixabay

When you have a large dataset like a big CSV file or a big SQL table, it’s inefficient or even impossible to load all data into memory. Your computer will get stuck and your program will crash. It can be time-consuming and also quite frustrating to debug. Fortunately, iterators and generators are there to help and work as great tools for this type of problem. Besides, understanding generators can be helpful for learning more advanced features like asyncio which are gaining great popularity these days.


range in Python

Before getting started, let’s take a look at the special range function which returns an Iterable that yields a sequence of integers. Iterable, as the name suggests, is something that can be iterated. Or you can understand it as something that can be used in a for loop. Let’s check it out with some simple code:

From this simple code snippet, we can know that:

  • Technically, range is a class even though it starts with a non-Pythonic lower case letter. The object returned by range is of type range.
  • A range object is iterable and can be iterated.
  • However, a range object is not an iterator. We need to use the iter function to turn an iterable into an iterator.

Iterators

An iterator is an object that implements the magic __next__ method and thus can be used in the next function to produce the next element of the data stream as shown above. To understand how an iterator works, let’s create a class that mimics the behavior of the range function.

We need a bit of code to mimic the behavior of positional arguments of range. Importantly, we need to have a state variable counter to keep a record of which state the custom iterator is in and which value to generate next time.

Let’s try the custom iterator with the next function:

Yes, it works as expected. Now, let’s try to use it in a for loop and see what will happen.

Hmm, a bit weird, isn’t it? MyRangeIter is an iterator, but NOT iterable. An iterator is not useful if it can only be used with the next function, but not in a for loop. Actually, to make an iterator iterable, we need to implement the __iter__ magic method, which makes it iterable and can be used with the iter function demonstrated above. If you use iter on r_iter now, you will also see an error saying it’s not iterable. Let’s add the __iter__ method now:

As we already know, the iter function calls the __iter__ method of the underlying class and returns an iterator. In this example, the iterator returned is itself. Yes, it’s odd, but it’s how it works. Actually, if you realize that to make the class able to work with the next and iter functions, the magic __next__ and __iter__ methods must be implemented, respectively, then it won’t be so difficult to understand.

Now the variable can be used in a for loop. You can try it out yourself.


Generators

As you see above, quite some boilerplate code is needed to create an iterator. We need to create a class and implement the magic __next__ and __iter__ methods. There is a better way to do it in Python, and this is where generators shine.

To create a generator, we don’t need to create a class and implement the magic __next__ and __iter__ methods. A generator is simply defined by a generator function:

All the magic of the generator functions lies in the yield keyword, which yields the data and also the control back to the caller but keeps the state of the function. When it’s iterated again, the function is resumed and a new value is yielded based on the latest state. The state is realized with the simple counter in this example. If you change yield to return, then it’s a regular function and only one value will be returned. Actually, without the yield keyword, it’s not a generator at all and cannot be iterated.

Let’s try out our generator:

Similar to an iterator, the StopIteration exception will be raised when the generator is exhausted. This exception is handled automatically by the for loop.

Besides, it should be noted that the return keyword inside a generator function raises the StopIteration exception and the returned value would be used as the message for the exception. This is important to understand the type annotation for a generator, as will be introduced soon.


Generator comprehension

Before we get to the more advanced send method of a generator, let’s learn something simple but handy first. Similarly to list comprehension, we can use generator comprehension to create a generator with one line of code. The only difference is that we need to change brackets to parentheses:

As we see, generator comprehension works very similarly to list comprehension. You just need to change brackets to parentheses. However, with list comprehension, all the data is loaded to memory, as can be reflected by the size of the variable created. On the other hand, it’s not the case for generator comprehension which makes it much more memory-efficient.


Understand the send method of generators

In most cases, you don’t need to use the send method of a generator. I have never got the chance to use it yet in years of coding with Python. However, understanding how it works is helpful to add type annotations for generators. Besides, it’s also important if you want to understand how the asyncio library works in Python because coroutines are realized by generators under the hood.

Let’s update our generator to let it accept a value that is sent by the user. The value will be used to change the stop variable inside the generator function so we can yield more values:

Note that the value sent to the generator function is received by the yield statement. You can assign the returned value of the yield statement to a variable and use it accordingly. To send a value to the generator, just call the send() method on the generator:

Note that you can only send data to a generator after it has already yielded something, otherwise, you will see a TypeError:


Type annotation for generators

Finally, let’s add type annotations to our generator functions created above. Adding type annotations to your functions can make your code more robust and much easier to understand. You can get to know the return type without reading the function body.

If a generator function contains both the yield and return keywords and can also accept a value that’s sent from outside, then we need to use the generic type Generator[YieldType, SendType, ReturnType]:

Note that for *args, we only need to add type annotation for one argument. For more details, please refer to this discussion.

In most common cases, our generator would only yield values. In this case, we can set the SendType and ReturnType to None:

Alternatively, when a generator only yields values, we can annotate the return type as either Iterable[YieldType] or Iterator[YieldType], which is more concise and less confusing for those who don’t understand the send method of generators:


In this post, we have introduced the technical details of iterables, iterators, and generators, which can let you work efficiently with large data sets requiring a lot of resources, especially memory. Some simple code snippets are provided which can help you understand the magic methods of iterators and generators which you normally use as a black box. More focus is put on generators because it’s simpler and more widely used. We have demystified the send method and the type annotation for it. With this knowledge, you are also prepared to understand more advanced features like asyncio in Python.



Related articles:


Related Articles