Randomizing Very Large Datasets

Consider the problem of randomizing a dataset that is so large, it doesn’t even fit into memory. This article describes how you can do it easily and (relatively) quickly in Python.

Douglas Blank, PhD
Towards Data Science

--

These days it is not at all uncommon to find datasets that are measured in Gigabytes, or even Terabytes, in size. That much data can help tremendously in the training process to create robust Machine Learning models. But how can you randomize such large datasets?

Photo by Jess Bailey on Unsplash

Imagine that you have a very large dataset with one item per line in a file. The details of the data are irrelevant for our goals here. The dataset could be lines in a CSV (comma-separate values) or TSV (tab-separated values) file, or each line could be a JSON object, or a list of X, Y, Z values of a point in a large point cloud. All we need is that the dataset is formatted with one item per line.

For files containing smaller datasets, one could randomize the file (called “shuffling”) in memory using a simple Python function like this:

import random

def shuffle_in_memory(filename_in, filename_out):
# Shuffle a file, line-by-line
with open(filename_in) as fp:
lines = fp.readlines()
# Randomize them in place:
random.shuffle(lines)
# Write the new order out:
with open(filename_out, "w") as fp:
fp.writelines(lines)

The shuffle_in_memory() function takes an input filename and an output filename, shuffles the lines in memory using the builtin random.shuffle(), and writes the randomized data out. Like its name implies, this function requires that all of the lines of the file be loaded into memory at once.

To test this function out, let’s make some test files. The function make_file() takes the number lines that you would like in the test file. The function will create the file and return the filename.

import os

def make_file(lines):
filename = "test-%s.txt" % lines
print("Making test file '%s'..." % filename)

with open(filename, "w") as fp:
for i in range(lines):
fp.write(f"Line {i}\n")

print("Done!")
return filename

For example, to make a file named “test-1000.txt” with 100 lines in it looks like this:

filename_in = make_file(1000)

After running this function, you should find in your current directory you should find a file named “test-1000.txt” with 1,000 lines of text, like:

Line 0
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
Line 8
Line 9
...

To test out our shuffle_in_memory() function, we’ll name an output file, save the string in the variable filename_out, and call the function:

filename_out = "test-randomized-1000.txt"
shuffle_in_memory(filename_in, filename_out)

Now, there should be a second file in your directory named “test-randomized-1000.txt”. It should be exactly the same size as “test-1000.txt” with exactly the same lines, but in a randomized order:

Line 110
Line 592
Line 887
Line 366
Line 52
Line 22
Line 891
Line 83
Line 931
Line 408
...

Okay, now for the big question: what to do if we have a very large file? Let’s make a medium-sized one with, say, 10 million lines. (For most computers this is still small enough to randomize in memory, but it is of sufficiently large enough size to practice with.) As before, we make the input file with a call to make_file():

filename_in_big = make_file(10_000_000)

This will take a few seconds. Afterwards you should have a file named “test-10000000.txt” in your directory. It should begin as before, but will have 10 million lines. The file is about 128 MB.

How to randomize it? If we don’t want to use all of our RAM, or we don’t have enough RAM, we can use the hard disk instead. Here is a recursive algorithm based on a similar problem, sorting. The following function shuffle() is based on the merge sort algorithm.

First, it checks to see if a file is small enough to be shuffled in memory (the base case in recursive function parlance). The parameter memory_limit is given in bytes. If a file size is smaller than memory_limit, then it will be shuffled in memory. If it is too big, then it is randomly split into a number of smaller files, and each of those is recursively shuffled. Finally, the contents of the smaller shuffled files are merged back together.

Here is the function:

import tempfile

def shuffle(filename_in, filename_out, memory_limit, file_split_count,
depth=0, debug=False):
if os.path.getsize(filename_in) < memory_limit:
if debug: print(" " * depth, f"Level {depth + 1}",
"Shuffle in memory...")
shuffle_in_memory(filename_in, filename_out)
else:
if debug: print(
" " * depth, f"Level {depth + 1}",
f"{os.path.getsize(filename_in)} is too big;",
f"Split into {file_split_count} files..."
)
# Split the big file into smaller files
temp_files = [tempfile.NamedTemporaryFile('w+', delete=False)
for i in range(file_split_count)]
for line in open(filename_in):
random_index = random.randint(0, len(temp_files) - 1)
temp_files[random_index].write(line)

# Now we shuffle each smaller file
for temp_file in temp_files:
temp_file.close()
shuffle(temp_file.name, temp_file.name, memory_limit,
file_split_count, depth+1, debug)

# And merge back in place of the original
if debug: print(" " * depth, f"Level {depth + 1}",
"Merge files...")
merge_files(temp_files, filename_out)

If this was a sorting algorithm, we would merge the files back together in a careful manner to create a sorted ordering. However, for shuffling, we don’t care about merging them in a particular order as we want them randomized. The merge_files() function therefore looks like:

def merge_files(temp_files, filename_out):
with open(filename_out, "w") as fp_out:
for temp_file in temp_files:
with open(temp_file.name) as fp:
line = fp.readline()
while line:
fp_out.write(line)
line = fp.readline()

Note that we are careful to not read all of the lines of the files into memory all at once. Let’s test this out by giving the limit for memory shuffling to be exactly the size of the file. Since the file size is not smaller than 128,888,890 it will be broken into a number of smaller files. For this example, let’s break the big file into 2, each of which will be small enough to shuffle in memory:

filename_out_big = "test-randomized-10000000.txt"
shuffle(filename_in_big, filename_out_big, 128_888_890, 2, debug=True)

This call results in the following output:

 Level 1 128888890 is too big; Split into 2 files...
Level 2 Shuffle in memory...
Level 2 Shuffle in memory...
Level 1 Merge files...

And the contents of the resulting “test-randomized-10000000.txt” file should have 10 million lines, all randomized. A better test would be to reduce the memory needed to be much smaller than the file to randomize, and to break the too-big files into more than 2. Let’s say we only want to use about 1 MB of RAM, and break the files into 20 smaller files:

shuffle(filename_in_big, filename_out_big, 1_000_000, 20, debug=True)

This example will use no more than 1 MB of RAM, and recursively decompose the subfiles that are bigger than that, 20 at a time.

This algorithm will work on files of any size (well, you need enough disk space!). The more memory you can allocate for the shuffle_in_memory() the faster it will run. If the number of smaller files is too large, then you’ll spend too much time opening and closing files. You can try different numbers for memory_limit, but I’ve had good luck with between 20 and 200. The larger the initial file, you’ll probably want more subfiles.

There are other algorithms that you can also use. I had high hopes for writing all of the lines into a SQLite database, SELECTing them in a random order, but it was no faster that the above code.

import sqlite3

def shuffle_sql(filename_in, filename_out, memory_limit, depth=0, debug=False):
if os.path.getsize(filename_in) < memory_limit:
if debug: print(" " * depth, f"Level {depth + 1}",
"Shuffle in memory...")
shuffle_in_memory(filename_in, filename_out)
else:
if debug: print(
" " * depth, f"Level {depth + 1}",
f"{os.path.getsize(filename_in)} is too big;",
f"Writing to SQLite database..."
)
temp_db = tempfile.NamedTemporaryFile(delete=False)
connection = sqlite3.connect(temp_db.name)
cursor = connection.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS lines (
line TEXT
);
""")
with open(filename_in) as fp:
line = fp.readline()
while line:
cursor.execute("INSERT INTO lines (line) VALUES (?);", [line])
line = fp.readline()
connection.commit()
with open(filename_out, "w") as fp:
for line in cursor.execute("""
SELECT line FROM lines ORDER BY random();
"""):
fp.write(line[0])
shuffle_sql(filename_in_big, filename_out_big, 1_000_000, debug=True)

Can you beat the recursive shuffle algorithm, staying purely in Python? If so, I’d love to hear about it!

Interested in Artificial Intelligence, Machine Learning, and Data Science? Consider a clap and a follow. Let me know what you are interested in!

--

--

Professor Emeritus of Computer Science at Bryn Mawr College, Head of Research at Comet.com. Researcher in AI, ML, and Robotics.