
Pandas is a highly-efficient and widely used Data Analysis tool. The core data structure of Pandas is DataFrame which represents data in tabular form with labeled rows and columns. There are many powerful and flexible functions and methods of DataFrame that ease and expedite the data cleaning and analysis process.
Note: It may not be proper English to type "DataFrame" but it exists as DataFrame in Pandas syntax. Therefore, throughout this post, I will type it as DataFrame so that you will get used to it. If you type it as "dataframe" or "Dataframe", you will get an error.
Data Science projects usually require us to gather data from different sources. Hence, as part of data preparation, we may need to combine DataFrames. In this post, I will explain the different ways to combine DataFrames.
Let’s first create two DataFrames:


Concat
One way to combine or concatenate DataFrames is concat() function. It can be used to concatenate DataFrames along rows or columns by changing the axis parameter. The default value of the axis parameter is 0, which indicates combining along rows.


As you can see in the first figure above, indices of individual DataFrames are kept. In order to change it and re-index the combined DataFrame, ignore_index parameter is set as True.

join parameter of concat() function determines how to combine DataFrames. The default value is ‘outer’ returns all indices in both DataFrames. If ‘inner’ option is selected, only the rows with shared indices are returned. I will change the index of df2 so that you can see the difference between ‘inner’ and ‘outer’.


Pandas also provides ways to label DataFrames so that we know which part comes from which DataFrame. We just pass the list of combined DataFrames in order using keys parameter.

It also makes it easier to access different parts of DataFrames conveniently:

One important note about concat() function is that it makes a copy of the data. To prevent making unnecessary copies, the copy parameter needs to set as False. The default value is True.
append() function is also used to combine DataFrames. It can be seen as a particular case of concat() function (axis=0 and join=’outer’) so I will not cover it in detail but will just give an example to show the syntax.


Merge
Another widely used function to combine DataFrames is merge(). Concat() function simply adds DataFrames on top of each other or adds them side-by-side. It is more like appending DataFrames. Merge() combines DataFrames based on values in shared columns. Merge() function offers more flexibility compared to concat() function. It will be clearer when you see the examples.
Let’s first take a look at the initial DataFrames:

The on parameter selects which column or index level is used to merge.

The column names do not have to be the same. Our focus is the values in columns. Assume two DataFrames have common values in a column that you want to use to merge these DataFrames but the column names are different. In this case, instead of on parameter, you can use left_on and right_on parameters. To show the difference, I will change the column name in df2 and then use merge:


Although the returned values are the same in column_a and new_column_a, merged DataFrame includes both columns due to having different names.
You can also pass multiple values to on parameter. The returned DataFrame only includes rows that have the same values in all the columns passed to on parameter.

df1 and df2 are merged based on the common values in column_a. It’s time to introduce how parameter of merge(). As the name suggests, it indicates how you want to combine. The possible values for how are ‘inner’, ‘outer’, ‘left’, ‘right’.
- inner: only rows with same values in the column specified by on parameter (default value of how parameter)
- outer: all the rows
- left: all rows from left DataFrame
- right: all rows from right DataFrame
The concept of how is more clear in the figure below. If you are familiar with SQL, the logic is same as SQL joins.
The figures below represents the concept of how parameters more clearly.



‘outer’, ‘left’, and ‘right’ options include data that is not present in one of the DataFrames. The missing parts are automatically filled with NaN values. NaN is the standard representation of missing values in Pandas.


The default value of how is ‘inner’ so you don’t have to explicitly write in the function. ‘Inner’ only returns the rows with common values in column_a.

When ‘outer’ is chosen for how parameter, merged DataFrame includes all values of column_a in both DataFrames. However, common values (column_a = 1 and column_a = 2) are not duplicated.

When ‘left’ is chosen for how parameter, merged DataFrame includes all rows from left DataFrame. The columns that come from right DataFrame are filled with NaN values if the value in column_a (the column that is passed to on parameter) is not present in right DataFrame. ‘right’ option is rarely used because you can just change the order of DataFrames in merge function (instead of (df1,df2) use (df2,df1)).
You may have noticed that a suffix is added to column names that are same in both DataFrames. It is useful to distinguish which column comes from which DataFrame. You can specify the suffix to be added using suffixes parameter.

Thank you for reading. Please let me know if you have any feedback.