Python’s Generator Expressions: Fitting Large Datasets into Memory

Luciano Strika
Towards Data Science
4 min readAug 20, 2018

--

Don’t forget to stay hydrated while you code. Source: Pixabay

Generator Expressions are an interesting feature in Python, which allow us to create lazily generated iterable objects. If your data doesn’t fit in memory, they may be the solution.

This article is a follow up on the one I made introducing List Comprehension expressions, and I recommend you read it before this one if you’ve never tackled that subject before.

What are Generator Expressions?

In order to create an Iterable with a Generator, all you have to do is write a List Comprehension, but replace the enclosing square brackets with parentheses. All the syntactic rules about List Comprehensions apply here: you can filter a Generator with an if clause at the end, and make a Generator from a matrix with two nested for-loops.

The interesting feature that comes with generators though, is that they generate their Iterable Object in a lazy way: the i-th element in your Iterable won’t be created (and thus won’t occupy precious virtual memory) until it is necessary. As a catch you cannot index, or slice, a Generator as you would a List — instead of retrieving arbitrary elements from the Iterable, you can only iterate it in order. That’s also the reason why you can’t call the len function on a Generator.

The advantage of using Generators: a simple experiment

In order to prove why generators can be useful, I ran the following experiment:

As you can see, the Generator stores ‘the same’ information, using only 80 bytes, whereas the list takes over 80Mb. The Generator also loaded a lot faster, though we’re talking about a couple of seconds here. It becomes clear then that, in any problem where memory could be scarce, replacing Lists with Generators may be a smart choice, as long as we keep the aforementioned caveats in mind (no arbitrary retrieval, no len checking).

Generators as Iterators

To those coming from a Java/C++ background, it may be interesting to know that a Generator can be used with an interface similar to an Iterator. This is done by use of the next method in Python 2, and the next function in Python 3+. Here’s an example on how we would iterate a Generator in Python 2.7:

We will normally just iterate it like any other Iterable: using a for-loop. However, given non-trivial conditions for the end of the loop, or for its continuation, we may end up in a situation where we’d like to iterate it manually. To do this, we will just call the next method (Python 2) or function (Python 3) until it throws a StopIteration exception. Note that the time taken for generating each element individually on retrieval will end up adding up to take as much as the time for initializing the whole list in a non-lazy manner. Finally, given a Generator, we can always cast it into a plain old non-lazy list by calling list(our_generator), paying the whole initialization cost.

A common use of Generators you may have missed

One of my awesome readers submitted another way Generators are used. You are likely familiar with the way we open a file and iterate its lines in Python:

That snippet actually loads the file lazily, line by line, using a Generator. There you have it, we’d been using generators all along! How about that for the next Shyamalan movie.

So to sum up, we can use Generators in any case where we will only need to iterate their result, and don’t care about slicing, indexing or going back. It is generally good to use them in those cases, as we will be able to fit very big datasets in memory, without losing expression power or computational time — as long as we only need to iterate them, one object or row at a time.

That was my introduction to Generator Expressions, I hope you’ve found it useful. If there’s any use case you feel I should have covered, or any important feature you think I should’ve mentioned, as well as anything you found just plain wrong, please let me know! I’d also be glad to know if you’ve applied Generators anywhere in your code after reading this.

Finally, there is an O’Reilly book I love and I found it very useful when I started my Data Science journey. It’s called Data Science from Scratch with Python, and it’s probably half the reason I got my job. If you read this far, you may enjoy it!

You can see what I’m working on and my most recent articles and notes in my personal site.

As always, keep coding!

--

--

B. Sc.+M. Sc. Computer Science, Buenos Aires University. Software Engineer at Microsoft