The world’s leading publication for data science, AI, and ML professionals.

Data Analytics – Time Saving Tips

Tips and tricks you can use to save time as a Data Analyst and increase your productivity.

Photo by Marc Sendra Martorell on Unsplash
Photo by Marc Sendra Martorell on Unsplash

Introduction

Today I am going to share with you several tools, packages and code snippets that I have used and developed during my time as a Data Analyst. In our roles as Data Analysts, there are going to be times where you are required to rerun the same report, run a similar report with different parameters or apply the same statistical analysis over differing datasets. Below I will give you a brief overview of some of the tools that you will be able to incorporate into your workflow as a Data Analyst to increase your productivity.

Version Control

For version control, I am currently using Atlassian Bitbucket. Bitbucket is a git repository management platform which you can use to track your code changes or collaborate as a team on the same code base. At the moment, I prefer using Bitbucket as it has an incredibly powerful search function which allows you to input query styled searches with logical operators. Having version control has sped up my productivity when it comes to rerunning the same reports. You can quickly locate the EOFY reports that you run each year instead of rewriting them.

Pandas

Python Pandas is a tool synonymous with Data Science and Data Analytics. The Pandas package provides a broad set of flexible tools that you can use to collect, join, manipulate and analyse a vast range of data types and sources. For today’s story, we aren’t going to learn about Pandas; instead, I will show you some snippets that I regularly use in my workflow. If you are unfamiliar with Pandas, please jump to my profile where I have written many in-depth guides and stories for Towards Data Science to get you started with Pandas.

The above Python snippet uses a custom package that I have written to simplify my data collection. In my role, the majority of data comes from directly querying the Ascender Payroll database. The first line creates a database connection and returns a cx_oracle cursor object. The second line opens the SQL that I intent to execute using the cursor. The third line creates a Pandas DataFrame object using the SQL records returned from the cursor.

The above snippet shows the structure of data_utils.build_data_frame. It accepts a database cursor, an SQL string that is to be executed and optionally a list of arguments. The argument parameter allows you to pass values to the SQL using bind parameters. I use this when I need to run generic SQL for a specific condition such as for an individual staff member.

This function also includes clean_column_names; this removes special characters from any column names and maps similar column names to a consistent reference such as emp_no and emp# to employee_no. As a precaution I have included check_for_forbidden_fields which strips highly confidential personally identifiable information. Whilst we are vigilant in the columns selected, this function ensures that we never included this type of information in our DataFrames.

One thing that has caught me out in the past has been the presence of duplicates within my analysis. Whilst in cases there can be valid duplicates, I had introduced them through a poorly configured concatenation of DataFrames. The above snippet, whilst verbose, ensures that during my initial exploratory Data Analysis, I won’t miss identifying duplicates. The check_for_duplicates function can accept a list of column labels that will be passed to Pandas built-in function .duplicated(). If a list is not passed then the function will check the entire DataFrame for duplicated rows.

Pandas Profiling

The final tool we will look at today is [Pandas Profiling](https://towardsdatascience.com/learning-pandas-profiling-fc533336edc7). Pandas Profiling is an open-source tool written in Python that has the ability to generate interactive HTML reports which detail the types of data within the dataset; highlights missing values; provides descriptive statistics including mean, standard deviation and skewness; creates histograms and returns any potential correlations. I have written an in-depth guide for Pandas Profilingwhich you can use as a reference manual for understanding the statistical analysis returned.

Summary

As a Data Analyst, we have an incredibly broad range of tools that you can incorporate into your workflow to increase the speed and accuracy of our work. We can use Bitbucket to maintain our codebases collaboratively; we can utilise Pandas to speed up your data collection and apply Pandas Profiling to explore new datasets efficiently.

Thank you for taking the time to read our story – we hope that you have found it valuable!


Related Articles