Introduction
As a Data Analyst or Data Scientist, you will frequently have to combine and analyse data from various data sources. A data type I commonly get requested to analyse is CSV files. CSV files are popular within the corporate world as they can handle powerful calculations, are easy to use and are often the output type for corporate systems. Today we will demonstrate how to use Python and Pandas to open and read a CSV file on your local machine.
Getting Started
You can install Panda via pip from PyPI. If this is your first time installing Python packages, please refer to Pandas Series & DataFrame Explained or Python Pandas Iterating a DataFrame. Both of these articles will provide you with the installation instructions and background knowledge for today’s article.
Syntax
The most challenging part for me when learning Pandas was the number of tutorials there was for Pandas functions such as .read_csv()
. However, the tutorials tended to miss the intricacies you needed when dealing with real-world data. In the beginning, I often found myself having to post questions on StackOverflow to learn how to apply specific parameters. Below we have included all the parameters along with examples for the more conceptually complex.
The above syntax might seem complicated at first; however, we would only set a handful of parameters as the majority are assigned default values. Nevertheless, the number of parameters bode to the powerful and flexible nature of Pandas .read_csv()
.
Parameters
-
filepath_or_buffer
: You can pass in a string or path object that references the CSV file you would like to read. The parameter also accepts URLs that point to a location on a remote server. sep
&delimiter
: Thedelimiter
parameter is an alias forsep
. You can usesep
to tell Pandas what to use as a delimiter, by default this is,
. However, you can pass in regex such ast
for tab spaced data.header
: This parameter allows you to pass an integer which captures which line the CSVs header names are on. By default,header
is set toinfer
which means Pandas will take the headers from row 0. If you intend on renaming the default headers, then setheader
to0
.-
name
: Here you have the opportunity to override the default column headers. To do this, first setheader=0
, then pass in an array which contains the new column names you would like to use. index_col
: For those of you that are new to the DataFrame object, DataFrame rows each have a label known as an index. You can pass a column name or integer if the CSV file contains a column representing an index. Alternatively, you can passFalse
to tell Pandas not to use an index from your file. IfFalse
is passed, Pandas will create an index using a sequence of incrementing integers.-
usecols
: You can use this parameter to return a subset of all the columns in the file. By default,usecols
is set toNone
which will result in Pandas returning all columns in the DataFrame. This comes in handy when you are only interested in processing certain columns. squeeze
: When dealing with a single column CSV file, you can set this parameter toTrue
which will tell Pandas to return a Series as opposed to a DataFrame. If you are unfamiliar with Pandas Series, you can refer to Pandas Series & DataFrame Explained for an overview.-
prefix
: Here you can set column label prefixes if you haven’t specified any to use. The default behaviour, when columns aren’t specified, is to use an integer sequence to label them. Using this parameter, you could set columns0
,1
, and2
tocolumn_0
,column_1
andcolumn_2
. mangle_dupe_cols
: If the CSV file you are reading contains columns with identical names Pandas will add an integer suffix to each duplicate. In the futuremangle_dupe_cols
will acceptFalse
which will cause the duplicate columns to overwrite each other.-
dtype
: You can use this parameter to pass a dictionary that will have column names as the keys and data types as their values. I find this handy when you have a CSV with leading zero-padded integers. Setting the correct data type for each column will also improve the overall efficiency when manipulating a DataFrame. engine
: Currently, Pandas acceptsc
orPython
as the parsing engine.-
converters
: This follows similar logic todtype
, however, instead of passing data types, you can pass functions that will manipulate values within particular columns on read. -
true_values
&false_values
: This parameter is quite nifty. Say for example within your CSV you had a column that containedyes
andno
, you could map these values toTrue
andFalse
. Doing this will allow you to clean some of your data when reading the file into Pandas. skipinitialspace
: You can set this parameter toTrue
, to tell Pandas that there may be rows with leading spaces after the delimiter. Pandas will then drop any leading spaces after a delimiter and before any non-delimiter character.skiprows
: When dealing with system generated CSV files, sometimes the file can contain parameter lines at the beginning of the file. Often we will not want to process these lines, instead, skip them. You can setskiprows
to an integer which will indicate the number of lines to skip before beginning reading. Alternatively, you can supply a callable which will cause Pandas to skip a row when the function evaluates toTrue
.skipfooter
: Similiar toskiprows
this parameter allows you to tell Pandas how many rows to skip at the end of the file. Again, this is handy if report parameters are at the end of the CSV file.nrows
: You can use this to set a limit to the number of rows collected from the CSV file. I find this handy during the exploratory phase when trying to get a feel for the data. It means that you can test your logic without having to load large files into memory.na_values
: By default, Pandas has an extensive collection of values that get mapped toNaN
(Not a Number). If you have application-specific values that you need to clean and map, you can pass them to this parameter. Using this parameter means that you can capture all values that areNaN
which can all be mapped to a default preprocessing.keep_default_na
: This parameter can either be set toTrue
orFalse
. IfFalse
and the CSV contains defaultNaN
values then Pandas will retain the originalNaN
value. IfTrue
Pandas will parse theNaN
value and mask withNaN
in the DataFrame.na_filter
: You can set this toTrue
when you would like Pandas to interpret your data for missing values. As a tip, set this parameter toFalse
when reading large files that you know doesn’t have any missing values.verbose
: By default, this is set toFalse
. Settingverbose
toTrue
will output additional data to the console, such as the number ofNaN
values or how long specific processes took.skip_blank_lines
: Sometimes, the data we receive may contain blank lines. By settingskip_blank_lines
toTrue
, Pandas will skip these rows as opposed to counting them asNaN
values.-
parse_dates
: Use this parameter to tell Pandas how you would like dates within the CSV file to be interpreted. You can passTrue
, which will cause Pandas to parse the index as a date. Alternatively, you can pass a column name or a list of columns which Pandas will use to create a date. infer_datetime_format
: You can set this parameter toTrue
which will tell Pandas to infer the date-time format. Doing this will lead to greater processing speed when combined withparse_dates
.-
keep_date_col
: If you have set a value forparse_dates
, you can use this parameter to retain the columns that created the data. The default behaviour is to drop these columns in place. If you don’t wish for this behaviour to occur setkeep_date_col
toTrue
. date_parser
: If you already know the format for the date within your CSV you can pass a function todate_parser
to format the date-time efficiently instead of inferring the format.dayfirst
: PassTrue
if your date-time format isDD/MM
.cache_dates
: By default, this is set toTrue
. Pandas will create a unique set of date-time string conversion to speed up the transformation of duplicate strings.iterator
: Setting this parameter toTrue
will allow you to call the Pandas function.get_chunk()
, which will return the number of records to process.chunksize
: This will allow you to set the size of the chunks within a DataFrame. Doing this comes in handy as you can loop with a portion of the DataFrame instead of lazy loading the entire DataFrame in memory.compression
: If the data that you are reading is compressed on disk, then you can set the type of compression for on the fly decompression.thousands
: This is the separator character for the thousands unit. In CSV files you can sometimes see one million represented as1_000_000
as,
is used as the delimiter. Setting thousands to_
will result in1_000_000
reflecting as1000000
.decimal
: You can provide the character that represents decimals within the CSV file if it deviates from.
.lineterminator
: If you have setengine
toc
you can use this parameter to tell Pandas what character you expect the lines to end using.quotechar
: This is the character used throughout your CSV file that signifies the start and end of a quoted element.quoting
: Here you can set the level of quoting you would like applied to your elements if any. By default, this is 0 which set quoting to minimal; you can also set this to 1 – quote all, 2 – quote non-numeric or 3 – quote none.doublequote
: You can use this parameter to tell Pandas what to do when two quote characters appear within a quote element. WhenTrue
is passed, the double-quote characters will become single quote characters.escapechar
: String of length one, which Pandas will use to escape other characters.comment
: You can use this parameter to indicate that you don’t want the remainder of the line processed. For example, ifcomment
was set to#
and#
appeared within the current line, Pandas would move to the next line after reaching#
.encoding
: If you are consuming data other than English, set this value to the specific character encoding so the data can be correctly read.dialect
: A CSV dialect is a set of parameters that tell a CSV parser how to read a CSV file. Common dialects includeexcel
,excel-tab
andunix
additionally, you can create your own and pass it to Pandas.error_bad_lines
: If Pandas encounters a line with two many attributes typically an exception is raised and Python halts the execution. If you passFalse
toerror_bad_lines
then any lines that would generally raise this type of exception will be dropped from the DataFrame.warn_bad_lines
: If you have seterror_bad_lines
toFalse
, you can setwarn_bad_lines
toTrue
which will output each line that would have raised an exception.delim_whitespace
: This parameter is similar todelimiter
however it is only whitespace specific. If you would like spaces as the delimiter, then you can either setdelimiter
tos+
ordelim_whitespace
toTrue
.low_memory
: By default, Pandas had this set toTrue
which results in chunked processing, however, runs the risk of mismatched type inferencing. You can avoid possible type mismatching by ensuring you set thedtype
parameter.memory_map
: If you have passed a file tofilepath_or_buffer
Pandas maps the file object in memory to improve its efficiency when processing larger files.float_precision
: Here you can set the appropriate converter for thec
engine for float elements.storage_options
: You can use this parameter to pass specific options when reading a CSV file from a remote location.
Where to Next
Now that you have wrapped your head around how to use Pandas .read_csv()
our recommendation would be to learn more about the Pandas data structures through Pandas Series & DataFrame Explained or learn how to navigate a DataFrame in Python Pandas Iterating a DataFrame. If you have a grasp of those concepts already, your next step should be to read either Pivoting a Pandas DataFrame or How to Combine Python, Pandas & XlsxWriter.
Summary
Learning how to use Pandas .read_csv()
is a crucial skill you should have as a Data Analyst to combine various data sources. As you have seen above .read_csv()
is an extremely powerful and flexible tool that you can adapt to various real-world situations to begin your data collection and analysis.
Thank you for taking the time to read our story – we hope you have found it valuable.