
When you have a large dataset like a big CSV file or a big SQL table, it’s inefficient or even impossible to load all data into memory. Your computer will get stuck and your program will crash. It can be time-consuming and also quite frustrating to debug. Fortunately, iterators and generators are there to help and work as great tools for this type of problem. Besides, understanding generators can be helpful for learning more advanced features like asyncio which are gaining great popularity these days.
range in Python
Before getting started, let’s take a look at the special range
function which returns an Iterable that yields a sequence of integers. Iterable, as the name suggests, is something that can be iterated. Or you can understand it as something that can be used in a for
loop. Let’s check it out with some simple code:
From this simple code snippet, we can know that:
- Technically,
range
is a class even though it starts with a non-Pythonic lower case letter. The object returned byrange
is of typerange
. - A
range
object is iterable and can be iterated. - However, a
range
object is not an iterator. We need to use theiter
function to turn an iterable into an iterator.
Iterators
An iterator is an object that implements the magic __next__
method and thus can be used in the next
function to produce the next element of the data stream as shown above. To understand how an iterator works, let’s create a class that mimics the behavior of the range
function.
We need a bit of code to mimic the behavior of positional arguments of range
. Importantly, we need to have a state variable counter
to keep a record of which state the custom iterator is in and which value to generate next time.
Let’s try the custom iterator with the next
function:
Yes, it works as expected. Now, let’s try to use it in a for
loop and see what will happen.
Hmm, a bit weird, isn’t it? MyRangeIter
is an iterator, but NOT iterable. An iterator is not useful if it can only be used with the next
function, but not in a for
loop. Actually, to make an iterator iterable, we need to implement the __iter__
magic method, which makes it iterable and can be used with the iter
function demonstrated above. If you use iter
on r_iter
now, you will also see an error saying it’s not iterable. Let’s add the __iter__
method now:
As we already know, the iter
function calls the __iter__
method of the underlying class and returns an iterator. In this example, the iterator returned is itself. Yes, it’s odd, but it’s how it works. Actually, if you realize that to make the class able to work with the next
and iter
functions, the magic __next__
and __iter__
methods must be implemented, respectively, then it won’t be so difficult to understand.
Now the variable can be used in a for
loop. You can try it out yourself.
Generators
As you see above, quite some boilerplate code is needed to create an iterator. We need to create a class and implement the magic __next__
and __iter__
methods. There is a better way to do it in Python, and this is where generators shine.
To create a generator, we don’t need to create a class and implement the magic __next__
and __iter__
methods. A generator is simply defined by a generator function:
All the magic of the generator functions lies in the yield
keyword, which yields the data and also the control back to the caller but keeps the state of the function. When it’s iterated again, the function is resumed and a new value is yielded based on the latest state. The state is realized with the simple counter in this example. If you change yield
to return
, then it’s a regular function and only one value will be returned. Actually, without the yield
keyword, it’s not a generator at all and cannot be iterated.
Let’s try out our generator:
Similar to an iterator, the StopIteration
exception will be raised when the generator is exhausted. This exception is handled automatically by the for
loop.
Besides, it should be noted that the return
keyword inside a generator function raises the StopIteration
exception and the returned value would be used as the message for the exception. This is important to understand the type annotation for a generator, as will be introduced soon.
Generator comprehension
Before we get to the more advanced send
method of a generator, let’s learn something simple but handy first. Similarly to list comprehension, we can use generator comprehension to create a generator with one line of code. The only difference is that we need to change brackets to parentheses:
As we see, generator comprehension works very similarly to list comprehension. You just need to change brackets to parentheses. However, with list comprehension, all the data is loaded to memory, as can be reflected by the size of the variable created. On the other hand, it’s not the case for generator comprehension which makes it much more memory-efficient.
Understand the send
method of generators
In most cases, you don’t need to use the send
method of a generator. I have never got the chance to use it yet in years of coding with Python. However, understanding how it works is helpful to add type annotations for generators. Besides, it’s also important if you want to understand how the asyncio library works in Python because coroutines are realized by generators under the hood.
Let’s update our generator to let it accept a value that is sent by the user. The value will be used to change the stop
variable inside the generator function so we can yield more values:
Note that the value sent to the generator function is received by the yield
statement. You can assign the returned value of the yield
statement to a variable and use it accordingly. To send a value to the generator, just call the send()
method on the generator:
Note that you can only send data to a generator after it has already yielded something, otherwise, you will see a TypeError
:
Type annotation for generators
Finally, let’s add type annotations to our generator functions created above. Adding type annotations to your functions can make your code more robust and much easier to understand. You can get to know the return type without reading the function body.
If a generator function contains both the yield
and return
keywords and can also accept a value that’s sent from outside, then we need to use the generic type Generator[YieldType, SendType, ReturnType]
:
Note that for *args
, we only need to add type annotation for one argument. For more details, please refer to this discussion.
In most common cases, our generator would only yield values. In this case, we can set the SendType
and ReturnType
to None
:
Alternatively, when a generator only yields values, we can annotate the return type as either Iterable[YieldType]
or Iterator[YieldType]
, which is more concise and less confusing for those who don’t understand the send
method of generators:
In this post, we have introduced the technical details of iterables, iterators, and generators, which can let you work efficiently with large data sets requiring a lot of resources, especially memory. Some simple code snippets are provided which can help you understand the magic methods of iterators and generators which you normally use as a black box. More focus is put on generators because it’s simpler and more widely used. We have demystified the send
method and the type annotation for it. With this knowledge, you are also prepared to understand more advanced features like asyncio in Python.
Related articles: