The Python Standard Library — modules you should know as a data scientist

with usage examples

Amanda Iglesias Moreno
Towards Data Science

--

The Python Standard Library contains a wide range of modules to deal with everyday programming and is included with the standard version of Python, meaning no additional installation is required. It provides modules for such tasks as interacting with the operating system, reading and writing CSV files, generating random numbers, and working with dates and time. This article describes 8 modules of the Python Standard Library that I am sure you will come across when programming in Python. Let’s get started! 🙌

Photo by Chris Ried on Unsplash

1. Zipfile

The zipfile library provides tools to easily work with zip files. It allows you to create, read, and write zip files directly in Python, without being necessary to use an external program.

Read a zip file

To open a zip file in Python, we use the zipfile.ZipFile class, providing as input the file path and the opening mode. Since we want to open the file in reading mode, we provide mode=’r’.

The ZipFile constructor returns a ZipFile object which we assign to the variable myzip. Then, we use the .extractall(path=None, members=None, pwd=None) method to extract all members from the zip file to the current working directory. To extract not all, but some members of the zip file, we can provide their names to the argument members.

After running the code above, we can open the current directory to check that the files were correctly extracted. Alternatively, we can use the os.listdir() function (defined in the os module) to obtain a list with the entries contained in the working directory.

Finally, we close the zip file using the .close() method. It is important to remember to close all files when we no longer need them to avoid running out of file handles.

Since it is easy to forget to close a file when working in Python, we can use the with statement which automatically closes the zip file after the nested block of code is executed.

If we want to obtain a list of members contained in the zip file without extracting them, we can use the .namelist() method as follows.

Write a zip file

We create and open a new zip file in Python using the zipfile.ZipFile class in writing mode mode=’w’. After creating the zip file, we call the .write() method to add a file as follows.

The code above creates a new zip file (new_file.zip) in the current working directory, containing two .txt files (file_1.txt and file_2.txt).

2. Random

Random numbers play an important role in artificial intelligence and data science. We use random numbers to shuffle the training data prior to each epoch, to set initial weights in a neural network, to separate training and testing data, or to conduct A/B testing.

The Python Standard Library provides a wide range of functions for generating random numbers. Let’s see some of them!

random.randint

The random.randint(a,b) function returns a random integer between a and b (both included).

Numpy also provides a function for generating random integers (numpy.random.randint). However, unlike random.randint, the upper bound is not included.

random.choices

The random.choices(population, weights=None, *, cum_weights=None, k=1) function returns a list of k elements randomly selected from the population with replacement. We can weigh the possibility for each value to be selected using the weights parameter.

random.shuffle

The random.shuffle(x[,random]) function shuffles the sequence x in-place, returning None. For this reason, the input of the function must be a mutable object. If we provide an immutable sequence as input (e.g. a string), an exception (TypeError) is raised.

To shuffle immutable objects, we can use the random.sample(population, k) function. This function returns a list of k elements randomly selected from the population without replacement. By setting the second argument of the function k=len(population), we obtain a new list with all the elements randomly shuffled.

As shown above, we can convert the shuffled list to a tuple using the built-in function tuple().

3. Os

The os module provides functions for interacting with the operating system. It contains numerous tools for working with directories, paths, and files. In this article, we will only cover some of the functionalities provided by os.

Get the current working directory

The os.getcwd() function returns the path of the current working directory.

As shown above, the function takes no arguments and returns a string data type.

Change the current working directory

The os.chdir() function modifies the current working directory to the given path, returning None.

After changing the directory, we can verify the modification using the os.getcwd() function.

List of files and directories

The os.listdir(path) function returns a list of all files and directories in the specified path. If path is omitted, the os.list() function returns a list of entries in the current working directory.

We can employ the os.listdir() function in combination with other functions to filter the returned list. As shown below, we use the string method .endswith() to obtain a list of text files present in the current working directory.

Rename a file o directory

Python allows you to rename a file or directory programmatically using the os.rename(src,dst) function. This function renames the file or directory src to dst and returns None. If you try to rename a nonexistent file or directory Python raises an OSError exception.

The following block code renames the file file_1.txt to file_new_name.txt.

As shown above, we use the os.listdir() function to check that the modification was carried out correctly.

Make a new directory

The function os.mkdir(path[,mode]) allows you to create a directory named path with a numeric mode. The mode represents the file’s permission using octal numbers (who can read, write, or execute the directory). If the parameter mode is omitted then the default value 0o777 is used, enabling all three user groups to read, write, and execute the directory.

In the following block of code, we create a new directory called new_dir in the current working directory.

The function os.mkdir() creates a directory in an existing directory. If you try to create a directory in a place that does not exist, an exception is raised. Alternatively, you can use the function os.mkdirs() (with an s) that creates intermediate folders if they do not exist.

Start a file

The os.startfile(path[,operation]) starts a file with its associated application. The default operation is ‘open’ which works like when you double-click a file, opening it on your computer.

The following code will open the .txt file (file_2) on your machine with it’s associated program (in my case Windows Editor).

These are only a few of the functions available in the os module. Read the documentation to discover more of them!

4. Time

Measure execution time

We can measure the execution time of a block of code using the time.time() function. This function returns the number of seconds passed since the epoch. The epoch is the point where time begins (time.time() would return 0) and is platform dependent, being on Windows systems January 1, 1970, 00:00:00 (UTC).

We calculate the execution time (wall-clock time), substracting the time before and after the block of code in the following manner.

If we want to calculate the CPU time instead of the elapsed time, we use the time.clock() function. The elapsed time is generally longer than the CPU time, since the CPU may also execute other instructions while the block of code is running.

Pause the execution

The time.sleep(secs) pauses the execution of the program for the given number of seconds secs. This function comes in handy when scraping a web page in Python.

Most of the time, we want to execute the code as quickly as possible. However, in web scraping, it is recommended to pause the execution of the program between requests so that the server is not overwhelmed.

As shown above, the execution is suspended 2 seconds between requests.

5. Datetime

The datetime library provides many tools for working with dates and times in Python. We can easily get the current time, subtract two dates, or convert dates into custom-formatted strings using the datetime module.

After importing the library, we create a datetime object with the datetime.datetime() function. This function requires three arguments: (1) year, (2) month, and (3) day, being hour, minute, second, microsecond, and time zone optional arguments.

As shown above, the function returns a datetime object. This object has the following attributes: year, month, day, hour, minute, second, microsecond, and tzinfo. We can access them using dot notation, or alternatively using the getattr() function.

Next, we explain in detail how to get the current date, subtract two dates, and convert a datetime object into a string, and vice versa.

Current local date and time

We can easily obtain the current local date and time using the datetime.datetime.now(tz=None) function.

As you can observe, the function returns a datetime object with the following format: YYYY-MM-DD HH:MM:SS:MS.

Subtract two dates

We can subtract two datetime objects in Python, obtaining as a result a timedelta object. This object represents the time span between both dates.

As shown below, we calculate how long George Orwell lived by subtracting his death date and birth date.

Convert a datetime object to a string

The .strftime(format) method converts a datetime object to a string, accepting a single string argument (format) as input. This argument specifies which part of the datetime object we want to return and in what format.

The following table contains some of the directives available in Python.

Next, we create different formatted strings to represent George Orwell’s birth date, using the directives listed above.

Convert a string to a datetime object

The datetime.strptime(date_string, format) function creates a datetime object from a string, being the opposite to the .strftime() method. To work correctly, the date_string passed as input needs to match the specified format. If not, an exception (ValueError) is raised.

In the example below, we obtain a datetime object from a string representing George Orwell’s birth date.

6. Csv

A comma-separated values (CSV) file is a common format used to transfer information. The information is structured as a table where each row contains a record and each column a field, being the fields separated by commas. Although commas are the most common separator, we can use other delimiters such as spaces or tabs.

The following figure shows a CSV file that contains information about students. As shown below, the first line of the file contains the field names (column headers).

Since a CSV file is a plain text file, we can create and open it using a text editor like Microsoft Notepad.

The Python Standard Library provides a built-in module that contains classes to read, process, and write CSV files. Although this module is quite helpful for simple manipulations, it is recommended to use Pandas for more complex numerical analysis.

Read a CSV file — reader function

After importing the library, we open the CSV file with the built-in function open. Next, we passed the file object to the csv.reader() function, storing the output of the function in a variable called reader. Then, we can access each line of the file by iterating over the reader object using a for loop.

As shown above, each row returned by the reader object is a list of strings.

Alternatively, we can use the next() function every time we want to access the next line. This function returns the next element from the iterator reader by calling its __next__() method.

By default, the separator is a comma. However, if another delimiter is used, we have to specify it with the delimiter argument as follows.

Read a CSV file — DictReader function

The csv.DictReader() function returns the rows of the CSV file as dictionary objects (specifically as Ordered Dictionaries) rather than as list objects. The keys of the dictionaries are specified in the fieldnames parameter. If fieldnames is omitted, the values contained in the first row are used as keys.

An ordered dictionary consists of a dictionary that remembers the order in which its contents are added. In Python 3.6+ dictionaries are also insertion ordered, meaning they remember the order of the inserted items.

Write a csv file — writer function

To write data to a CSV file, first, we open the CSV file in writing mode with the built-in function open. Then, we provide the open file object returned by the open function as input to the csv.writer() function, obtaining a writer object.

The writer object supports two methods for writing data to a CSV file.

  • csvwriter.writerow(row) → This method writes a row of data to the CSV file.
  • csvwriter.writerows(rows) → This method writes all given rows ( an iterable of row objects) to the CSV file.

Next, we write a CSV file containing information about works of art using the writerows method. As you can observe, this method takes as input a nested list.

After running the code above, a CSV file (works_of_art.csv) is created in the current working directory.

Write a csv file — Dictwriter function

The csv.DictWriter() function returns an object that supports the aforementioned methods (writerow and writerows) for writing data to a CSV file. This function requires as input the header information (argument fieldnames) as well as the open file object.

In this case, the writerows method takes as input a list of dictionaries instead of a nested list.

After running the code above, a CSV file (works_of_art_2.csv) is created in the current working directory.

7. Glob

The glob module allows you to find a list of files and paths matching a given pattern by using wildcard notation. A wildcard is a special character we can use to select multiple similar names.

The following table shows the most common wildcards you can use with glob.

Next, we employ this notation with the glob.glob() function to obtain a list of matching files and paths from the current working directory (shown below).

  • List of all CSV files.
  • Files and directories that begin with the word students.
  • Txt files: file_2.txt, file_3.txt, and file_4.txt.

As you can observe, the glob function provides more flexibility than the listdir function (defined in the os module) for listing files in a directory.

8. Difflib

The difflib module contains a variety of functions and classes for comparing sequences, being especially helpful for computing differences between texts and strings.

The difflib.get_close_matches(word, possibilities, n=3, cutoff=0.6) function comes in handy for detecting misspellings and typos, matching a word against a set of possibilities. This function takes two optional arguments n and cutoff. N refers to the number of matches to return, being 3 by default. Cutoff ranges from 0 to 1 and represents the minimum ratio of similarity to return an element from the list of possibilities, meaning the element is similar enough to the word.

Next, we use this function to obtain the population of a country introduced by the user. First, we obtain the data from the following webpage using the libraries Request and Pandas.

Then, we ask the user to introduce a country to access its population. As shown below, if you make a typo, the program suggests a similar country name.

This module provides many more functions for identifying differences and similarities between texts. As always refer to the documentation for more details of how to use this module! :)

Interesting websites

Besides the official Python documentation, there are multiple web pages where you can find detailed explanations of how to use the modules available in the Python Standard Library.

In this article, we have explained a limited number of modules, since the Python Standard Library contains more than 200! That means we will cover the Python Standard Library again in future articles. 💪

Amanda 💜

--

--