The world’s leading publication for data science, AI, and ML professionals.

Python Pandas Reading a CSV

Learn how to read a CSV file and create a Pandas DataFrame

Photo by Mika Baumeister on Unsplash
Photo by Mika Baumeister on Unsplash

Introduction

As a Data Analyst or Data Scientist, you will frequently have to combine and analyse data from various data sources. A data type I commonly get requested to analyse is CSV files. CSV files are popular within the corporate world as they can handle powerful calculations, are easy to use and are often the output type for corporate systems. Today we will demonstrate how to use Python and Pandas to open and read a CSV file on your local machine.

Getting Started

You can install Panda via pip from PyPI. If this is your first time installing Python packages, please refer to Pandas Series & DataFrame Explained or Python Pandas Iterating a DataFrame. Both of these articles will provide you with the installation instructions and background knowledge for today’s article.

Syntax

The most challenging part for me when learning Pandas was the number of tutorials there was for Pandas functions such as .read_csv(). However, the tutorials tended to miss the intricacies you needed when dealing with real-world data. In the beginning, I often found myself having to post questions on StackOverflow to learn how to apply specific parameters. Below we have included all the parameters along with examples for the more conceptually complex.

The above syntax might seem complicated at first; however, we would only set a handful of parameters as the majority are assigned default values. Nevertheless, the number of parameters bode to the powerful and flexible nature of Pandas .read_csv().

Parameters

  • filepath_or_buffer: You can pass in a string or path object that references the CSV file you would like to read. The parameter also accepts URLs that point to a location on a remote server.

  • sep & delimiter: The delimiter parameter is an alias for sep. You can use sep to tell Pandas what to use as a delimiter, by default this is ,. However, you can pass in regex such as t for tab spaced data.
  • header: This parameter allows you to pass an integer which captures which line the CSVs header names are on. By default, header is set to infer which means Pandas will take the headers from row 0. If you intend on renaming the default headers, then set header to 0.
  • name: Here you have the opportunity to override the default column headers. To do this, first set header=0, then pass in an array which contains the new column names you would like to use.

  • index_col: For those of you that are new to the DataFrame object, DataFrame rows each have a label known as an index. You can pass a column name or integer if the CSV file contains a column representing an index. Alternatively, you can pass False to tell Pandas not to use an index from your file. If False is passed, Pandas will create an index using a sequence of incrementing integers.
  • usecols: You can use this parameter to return a subset of all the columns in the file. By default, usecols is set to None which will result in Pandas returning all columns in the DataFrame. This comes in handy when you are only interested in processing certain columns.

  • squeeze: When dealing with a single column CSV file, you can set this parameter to True which will tell Pandas to return a Series as opposed to a DataFrame. If you are unfamiliar with Pandas Series, you can refer to Pandas Series & DataFrame Explained for an overview.
  • prefix: Here you can set column label prefixes if you haven’t specified any to use. The default behaviour, when columns aren’t specified, is to use an integer sequence to label them. Using this parameter, you could set columns 0, 1, and 2 to column_0, column_1 and column_2.

  • mangle_dupe_cols: If the CSV file you are reading contains columns with identical names Pandas will add an integer suffix to each duplicate. In the future mangle_dupe_cols will accept False which will cause the duplicate columns to overwrite each other.
  • dtype: You can use this parameter to pass a dictionary that will have column names as the keys and data types as their values. I find this handy when you have a CSV with leading zero-padded integers. Setting the correct data type for each column will also improve the overall efficiency when manipulating a DataFrame.

  • engine: Currently, Pandas accepts c or Python as the parsing engine.
  • converters: This follows similar logic to dtype, however, instead of passing data types, you can pass functions that will manipulate values within particular columns on read.

  • true_values & false_values: This parameter is quite nifty. Say for example within your CSV you had a column that contained yes and no, you could map these values to True and False. Doing this will allow you to clean some of your data when reading the file into Pandas.

  • skipinitialspace: You can set this parameter to True, to tell Pandas that there may be rows with leading spaces after the delimiter. Pandas will then drop any leading spaces after a delimiter and before any non-delimiter character.
  • skiprows: When dealing with system generated CSV files, sometimes the file can contain parameter lines at the beginning of the file. Often we will not want to process these lines, instead, skip them. You can set skiprows to an integer which will indicate the number of lines to skip before beginning reading. Alternatively, you can supply a callable which will cause Pandas to skip a row when the function evaluates to True.
  • skipfooter: Similiar to skiprows this parameter allows you to tell Pandas how many rows to skip at the end of the file. Again, this is handy if report parameters are at the end of the CSV file.
  • nrows: You can use this to set a limit to the number of rows collected from the CSV file. I find this handy during the exploratory phase when trying to get a feel for the data. It means that you can test your logic without having to load large files into memory.
  • na_values: By default, Pandas has an extensive collection of values that get mapped to NaN (Not a Number). If you have application-specific values that you need to clean and map, you can pass them to this parameter. Using this parameter means that you can capture all values that are NaN which can all be mapped to a default preprocessing.
  • keep_default_na: This parameter can either be set to True or False. If False and the CSV contains default NaN values then Pandas will retain the original NaN value. If True Pandas will parse the NaN value and mask with NaN in the DataFrame.
  • na_filter: You can set this to True when you would like Pandas to interpret your data for missing values. As a tip, set this parameter to False when reading large files that you know doesn’t have any missing values.
  • verbose: By default, this is set to False. Setting verbose to True will output additional data to the console, such as the number of NaN values or how long specific processes took.
  • skip_blank_lines: Sometimes, the data we receive may contain blank lines. By setting skip_blank_lines to True, Pandas will skip these rows as opposed to counting them as NaN values.
  • parse_dates: Use this parameter to tell Pandas how you would like dates within the CSV file to be interpreted. You can pass True, which will cause Pandas to parse the index as a date. Alternatively, you can pass a column name or a list of columns which Pandas will use to create a date.

  • infer_datetime_format: You can set this parameter to True which will tell Pandas to infer the date-time format. Doing this will lead to greater processing speed when combined with parse_dates.
  • keep_date_col: If you have set a value for parse_dates, you can use this parameter to retain the columns that created the data. The default behaviour is to drop these columns in place. If you don’t wish for this behaviour to occur set keep_date_col to True.

  • date_parser: If you already know the format for the date within your CSV you can pass a function to date_parser to format the date-time efficiently instead of inferring the format.
  • dayfirst: Pass True if your date-time format is DD/MM.
  • cache_dates: By default, this is set to True. Pandas will create a unique set of date-time string conversion to speed up the transformation of duplicate strings.
  • iterator: Setting this parameter to True will allow you to call the Pandas function .get_chunk(), which will return the number of records to process.
  • chunksize: This will allow you to set the size of the chunks within a DataFrame. Doing this comes in handy as you can loop with a portion of the DataFrame instead of lazy loading the entire DataFrame in memory.
  • compression: If the data that you are reading is compressed on disk, then you can set the type of compression for on the fly decompression.
  • thousands: This is the separator character for the thousands unit. In CSV files you can sometimes see one million represented as 1_000_000 as , is used as the delimiter. Setting thousands to _ will result in 1_000_000 reflecting as 1000000.
  • decimal: You can provide the character that represents decimals within the CSV file if it deviates from ..
  • lineterminator: If you have set engine to c you can use this parameter to tell Pandas what character you expect the lines to end using.
  • quotechar: This is the character used throughout your CSV file that signifies the start and end of a quoted element.
  • quoting: Here you can set the level of quoting you would like applied to your elements if any. By default, this is 0 which set quoting to minimal; you can also set this to 1 – quote all, 2 – quote non-numeric or 3 – quote none.
  • doublequote: You can use this parameter to tell Pandas what to do when two quote characters appear within a quote element. When True is passed, the double-quote characters will become single quote characters.
  • escapechar: String of length one, which Pandas will use to escape other characters.
  • comment: You can use this parameter to indicate that you don’t want the remainder of the line processed. For example, if comment was set to # and # appeared within the current line, Pandas would move to the next line after reaching #.
  • encoding: If you are consuming data other than English, set this value to the specific character encoding so the data can be correctly read.
  • dialect: A CSV dialect is a set of parameters that tell a CSV parser how to read a CSV file. Common dialects include excel, excel-tab and unix additionally, you can create your own and pass it to Pandas.
  • error_bad_lines: If Pandas encounters a line with two many attributes typically an exception is raised and Python halts the execution. If you pass False to error_bad_lines then any lines that would generally raise this type of exception will be dropped from the DataFrame.
  • warn_bad_lines: If you have set error_bad_lines to False, you can set warn_bad_lines to True which will output each line that would have raised an exception.
  • delim_whitespace: This parameter is similar to delimiter however it is only whitespace specific. If you would like spaces as the delimiter, then you can either set delimiter to s+ or delim_whitespace to True.
  • low_memory: By default, Pandas had this set to True which results in chunked processing, however, runs the risk of mismatched type inferencing. You can avoid possible type mismatching by ensuring you set the dtype parameter.
  • memory_map: If you have passed a file to filepath_or_buffer Pandas maps the file object in memory to improve its efficiency when processing larger files.
  • float_precision: Here you can set the appropriate converter for the c engine for float elements.
  • storage_options: You can use this parameter to pass specific options when reading a CSV file from a remote location.

Where to Next

Now that you have wrapped your head around how to use Pandas .read_csv() our recommendation would be to learn more about the Pandas data structures through Pandas Series & DataFrame Explained or learn how to navigate a DataFrame in Python Pandas Iterating a DataFrame. If you have a grasp of those concepts already, your next step should be to read either Pivoting a Pandas DataFrame or How to Combine Python, Pandas & XlsxWriter.

Summary

Learning how to use Pandas .read_csv() is a crucial skill you should have as a Data Analyst to combine various data sources. As you have seen above .read_csv() is an extremely powerful and flexible tool that you can adapt to various real-world situations to begin your data collection and analysis.

Thank you for taking the time to read our story – we hope you have found it valuable.


Related Articles