If you ask which Python library is most frequently used by data scientists, the answer is undoubtedly Pandas. Pandas is used for working with datasets via the functionalities as analyzing, cleaning, exploring, and manipulating data. Additionally, Pandas can be used to run descriptive statistical analysis. Data scientists who use Python for their projects become familiar with Pandas from day one. So, why am I discussing Pandas today?
In fact, there are several Pandas functions that many users tend to neglect or fail to explore fully. Hence, I’ll discuss these functions in today’s article.
Custom Functions apply() Method
The apply() method applies custom functions along the axis of a DataFrame or Series. This method is useful for complex computations where you need to manipulate data with user-defined functions and make your data transformation more versatile. For example, if you’d like to clean the dataset with messy product names and prices, you would need to align product names right, use the word "Inch" instead of the symbol, add appropriate spacing, preserve words in their correct cases, and remove dollar signs in the price column. You could manage all these tasks with loop statements to achieve the expected result.
import pandas as pd
import numpy as np
# Create data
data = {
'product_name': [
'LAPTOP HP 15" ',
' iphone 13 pro',
'Samsung TV 55inch',
' airpods PRO 2nd gen ',
'DELL monitor 27"'
],
'price_string': [
'$899.99',
'1099.00$',
'$799.50',
'$249.99 ',
'$399'
]
}
df = pd.DataFrame(data)
# Show original dataset
print("Original Data:")
print(df)
print("n" + "="*50 + "n")
# Define rules for special cases
UPPERCASE_WORDS = ['TV', 'HP', 'PRO']
PRESERVED_CASES = ['2nd']
# Method 1:
def process_method_1(df):
clean_names, clean_prices = [], []
for index, row in df.iterrows():
# Clean name: handle special cases using rules
name = row['product_name'].strip()
name = name.replace('"', ' Inch').replace('inch', ' Inch')
words = name.split()
cleaned_words = []
for word in words:
if word.upper() in UPPERCASE_WORDS:
cleaned_words.append(word.upper())
elif word.lower() in [w.lower() for w in PRESERVED_CASES]:
cleaned_words.append(word.lower())
else:
cleaned_words.append(word.title())
clean_names.append(' '.join(cleaned_words))
clean_prices.append(float(row['price_string'].strip().replace('$', '').rstrip('$')))
return pd.DataFrame({'clean_name': clean_names, 'clean_price': clean_prices})
# Show results from Method 1
print("Cleaned Data (Mehtod 1):")
print(process_method_1(df))
print("n" + "="*50 + "n")

However, we can also use apply() method to simplify the code as follows:
# Method 2:
def clean_name(name):
name = name.strip().replace('"', ' Inch').replace('inch', ' Inch')
return ' '.join(word.upper() if word.upper() in UPPERCASE_WORDS
else word.lower() if word.lower() in [w.lower() for w in PRESERVED_CASES]
else word.title()
for word in name.split())
def clean_price(price):
return float(price.strip().replace('$', '').rstrip('$'))
def process_method_2(df):
return pd.DataFrame({
'clean_name': df['product_name'].apply(clean_name),
'clean_price': df['price_string'].apply(clean_price)
})
# Show results from Method 2
print("Cleaned Data (Method 2):")
print(process_method_2(df))
We can observe three main differences after using the apply() method:
- Code Length: We can write fewer lines of processing code if using apply() method.
- Readability: The apply() method produces clear, single-purpose functions with straightforward transformations, while the other method requires nested loops and multiple temporary lists.
- Maintainability: The apply() version makes it easier to modify the cleaning rules by simply updating the straightforward functions.
Pandas Pivot_Table()
Pandas pivot_table() is another function that seems less popular than other aggregate functions like groupby(). It’s used to aggregate and summarize data by grouping values across multiple dimensions and creates a spreadsheet-style pivot table as a DataFrame.
I created a mock sales data for illustration. The dataset consists of columns sales
and units
which represent sales volume and units sold respectively. It also contains columns such as product_category and region for segmentation. For example, if we’d like to find out the average sale volume and total units sold for each region by categories, we can use the following method to achieve our goal.
import pandas as pd
import numpy as np
# Create sample sales data
np.random.seed(42)
data = {
'date': pd.date_range('2023-01-01', '2023-12-31', freq='D').repeat(3),
'product_category': np.random.choice(['Electronics', 'Clothing', 'Books', 'Home'], size=1095),
'region': np.random.choice(['North', 'South', 'East', 'West'], size=1095),
'sales': np.random.uniform(100, 1000, size=1095),
'units': np.random.randint(1, 50, size=1095)
}
df = pd.DataFrame(data)
df['month'] = df['date'].dt.strftime('%Y-%m')
print("Original dataset:")
print(df)
print("n" + "="*80 + "n")
# Method 1:
def analyze_method_1():
# Calculate average sales and total units by category and region
result = df.groupby(['product_category', 'region']).agg({
'sales': 'mean',
'units': 'sum'
}).reset_index()
# Reshape the data to get regions as columns for sales
sales_by_region = df.groupby(['product_category', 'region'])['sales'].mean().unstack()
# Reshape the data to get regions as columns for units
units_by_region = df.groupby(['product_category', 'region'])['units'].sum().unstack()
# Combine the results
final_result = pd.DataFrame({
'Avg_Sales_North': sales_by_region['North'],
'Avg_Sales_South': sales_by_region['South'],
'Avg_Sales_East': sales_by_region['East'],
'Avg_Sales_West': sales_by_region['West'],
'Total_Units_North': units_by_region['North'],
'Total_Units_South': units_by_region['South'],
'Total_Units_East': units_by_region['East'],
'Total_Units_West': units_by_region['West']
}).round(2)
return final_result
print("Results by Method 1:")
print(analyze_method_1())
print("n" + "="*80 + "n")
All the steps for aggregation above can be replaced by pivot_table() function.
# Method 2:
def analyze_method_2():
# Create a single pivot table for both metrics
result = pd.pivot_table(
df,
index='product_category',
columns='region',
values=['sales', 'units'],
aggfunc={
'sales': 'mean',
'units': 'sum'
}
).round(2)
# Flatten column names for better readability
result.columns = [f'{"Avg_Sales" if col[0]=="sales" else "Total_Units"}_{col[1]}'
for col in result.columns]
return result
print("Results by method 2:")
print(analyze_method_2())

_The more variables we need to run statistics on, the more benefits we can derive from the pivot_table() function._ Therefore, we can conclude that Pandas pivot_table() has such advantages:
- Code Efficiency: Pandas pivot_table() significantly reduces the length of code compared to multiple groupby operations while generating the same result.
- Flexibility: Pandas pivot_table() can easily change dimensions and has built-in handling of missing values.
- Performance: Pandas pivot_table() can optimize for large datasets and be more memory efficient.
- Features: Pandas pivot_table() are featured with automatic subtotals and margins, multiple value columns, custom aggregation functions and built-in data reshaping.
Integration with Various Data Formats
Pandas can efficiently read and write various data formats or data from external. Besides the easily consumable data formats such as CSV and Excel, there are several sources of data that are often integrated into Pandas.
- HTML Tables
# Create sample HTML
html_data = '''
<table>
<tr>
<th>City</th>
<th>Population</th>
<th>Country</th>
</tr>
<tr>
<td>New York</td>
<td>8400000</td>
<td>USA</td>
</tr>
<tr>
<td>London</td>
<td>8900000</td>
<td>UK</td>
</tr>
</table>
'''
with open('sample.html', 'w') as f:
f.write(html_data)
# Read HTML
df_html = pd.read_html('sample.html')[0]
print(df_html)

2. Google BigQuery
from google.cloud import bigquery
df = pd.read_gbq('SELECT * FROM dataset.table')
3. MongoDB
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
df = pd.DataFrame(list(client.db.collection.find()))
4. Parquet
import pandas as pd
import numpy as np
# Create sample dataset
np.random.seed(42)
dates = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D').repeat(3)
dates = dates[:1000]
data = {
'date': dates,
'product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard'], size=1000),
'category': 'Electronics', # All products are electronics
'sales': np.random.randint(300, 2000, size=1000)
}
df = pd.DataFrame(data)
# Write to Parquet
df.to_parquet('sales_large.parquet')
# Read from Parquet
df_parquet = pd.read_parquet('sales_large.parquet')
print("Parquet data sample (first 5 rows):")
print(df_parquet.head())
# Read with specific columns and show summary
df_parquet_summary = pd.read_parquet('sales_large.parquet', columns=['product', 'sales'])
print("nSales summary by product:")
print(df_parquet_summary.groupby('product')['sales'].agg(['count', 'mean', 'sum']))

5. SQL
# Create SQLite database
engine = create_engine('sqlite:///sales_large.db')
df.to_sql('sales', engine, if_exists='replace', index=False)
# Query showing monthly sales by product
query = """
SELECT
strftime('%Y-%m', date) as month,
product,
COUNT(*) as num_sales,
AVG(sales) as avg_sale,
SUM(sales) as total_sales
FROM sales
GROUP BY month, product
ORDER BY month, total_sales DESC
LIMIT 10
"""
df_sql_query = pd.read_sql_query(query, engine)
print("Monthly sales summary (top 10 results):")
print(df_sql_query)

- Apache Spark
from pyspark.sql import SparkSession
# Initialize Spark
spark = SparkSession.builder
.appName("PandasSparkIntegration")
.master("local[*]")
.getOrCreate()
# Convert Pandas DataFrame to Spark DataFrame
spark_df = spark.createDataFrame(df)
print("Converted to Spark DataFrame:")
spark_df.show(5)
# Do some simple analysis in Spark
spark_df.groupBy("product").sum("sales").show()
# Convert back to Pandas DataFrame
new_pandas_df = spark_df.toPandas()
print("nConverted back to Pandas DataFrame:")
print(new_pandas_df.head())
# Clean up
spark.stop()

Conclusion
Pandas excels in its capability of processing big data, handling data from various sources with different formats, and making conclusions based on statistical analysis. The three important features of Pandas introduced today let users stand out by improving code efficiency, readability and reliability and making more data sources accessible.
If you found this article helpful, please consider giving it some claps! Thank you for reading!