The world’s leading publication for data science, AI, and ML professionals.

The Perfect Python Cheatsheet for Beginners

Cheetsheets are lifesavers for aspiring and early career Data Scientists.

Photo by Tsaiwen Hsu on Unsplash
Photo by Tsaiwen Hsu on Unsplash

Python is one of the easiest programming languages you can learn to become a Data Scientist (the other language is R), and there are plenty of free online resources to get you started. The process of becoming learning Python to become a Data Scientist can be broadly split into three subjects: learning a programming language (usually Python), data analysis (data visualization, math and statistics) and machine learning (algorithms that improve automatically through experience).

However, the first step, learning how to programme can take too long, and some aspiring Data Scientists may feel discouraged to achieve their professional goals. Why? They want to solve problems through data analysis and insights. You don’t want to spend hundreds of hours on repetitive Python exercises to achieve perfection; you want to go straight into the action. So, how can you advance and solve real-life data problems without achieving perfection in every Python command?

Well, the most common approach is to rely on Stack Overflow. But, at the beginning of the Data Science journey, people might not always find the answer they are expecting. This is because most solutions posted on Stack Overflow are provided by experts or experienced Data Scientists who forgot what it is like to start from scratch. Unfortunately, this is cognitive bias called the curse of knowledge, and it is common among experts trying to communicate with people with different backgrounds [1]. I often see unnecessarily advanced programming answers for straightforward questions. Consequently, Data Science students will spend a lot of time looking for a beginner-friendly solution, rather than working on what matters the most; solving problems.

So, to help you become more productive and spend less time searching for simple answers, I have decided to share a comprehensive Python cheatsheet for Data Science students and early career data professionals. I hope the cheatsheet below will allow you to invest your time in solving problems and generating data insights.

Content:

  1. Collections (Lists, Dictionary, Range & Enumerate)
  2. Types (String, Regular Expressions, Numbers, Datetime)
  3. Syntax (Lambda, Comprehension, Map Filter & Reduce, If-Else)
  4. Libraries(NumPy, Pandas)
  5. Where next?

1. Collections

Lists

A list is an ordered and mutable container; it is arguably the most common data structure in Python. Understanding how lists work becomes even more relevant as you work on data cleaning and create for-loops.

Dictionary

A Dictionary is a type of data structure in Python that use keys for indexing. Dictionaries are an unordered sequence of items (key-value pairs) and vital for Data Scientists, especially those interested in web scraping. For example: extracting data from YouTube channels.

Range & Enumerate

Range function returns a sequence of numbers and increments by 1 (by default), and stops before a specified number. The enumerate function, takes a collection (a tuple) and adds a counter to an enumerated object. Both functions are useful in for-loops.

2. Types

String

Strings are objects containing a sequence of characters. String methods always return new values and will not change the original string.

Don’t forget to use some other basic methods such as lower(), upper(), capitalize() and title().

Regular Expressions (Regex)

A regular expression is a sequence of characters that describes a search pattern. In Pandas, regular expressions are integrated with vectorized string methods, making finding and extracting patterns of characters easier. Learning how to use Regex make data cleaning less time-consuming for Data Scientists.

Unless you use the flags=re.ASCII argument, by default whitespaces, digits, and alphanumeric characters in any alphabet will be matched. Also, use a capital letter for negation.

Numbers

Math & Basic Statistics

Python has a built-in module that Data Scientists can use for mathematical tasks and basic statistics. However, the describe() function computes a summary of DataFrame columns’ statistics.

Datetime

The module Datetime provides date d, time t, datetime dt and timedelta td classes. These classes are immutable and hashable. This means that its value will not change. As a result, it allows Python to create a unique hash value and be used by dictionaries to track unique keys. The Datetime module is crucial to Data Analysts who frequently encounter data sets showing "time of purchase" or "how long users spent on a specific page."

3. Syntax

Lambda

A Python lambda function is an anonymous function that works just like a normal one with arguments. They are handy when Data Scientists need to use a function only once and don’t want to write an entire Python function – def.

Comprehension

Comprehensions are one-line codes that allow data professionals to create lists from iterables sources, such as lists. They are perfect for simplifying for-loops and map() while addressing one of the fundamental premises in Data Science: "readability counts." So, try not to overcomplicate your list/dictionary comprehensions.

If-Else

Data Scientists use if-else statements to execute a code only if a certain condition is satisfied.

4. Libraries

Libraries are lifesavers for Data Scientists. Some of these libraries are massive and have been created specifically to address data professionals’ needs. Because both Numpy and Pandas allow multiple and essential features to Data Scientists, below, you will find some of the basic features useful to beginners and early career Data Scientists.

Photo by Michael D Beckwith on Unsplash
Photo by Michael D Beckwith on Unsplash

NumPy

Python is a high-level language as there is no need to manually allocate memory or specify how the CPU performs certain operations. A low-level language, such as C gives us this control and improves specific code performance (vital when working with Big Data). One of the reasons why NumPy makes Python efficient is because of vectorisation, which takes advantage of Single Instruction Multiple Data (SIMD) to process data more quickly. NumPy will become extremely useful as you apply linear algebra to your Machine Learning projects.

Bear in mind that a list in NumPy is called a 1D Ndarray whereas a 2D Ndarray is a list of lists. NumPy ndarrays use indices along both rows and columns and is the how you will select and slice values.

Pandas

Although NumPy provides crucial structures and tools which make work with data a lot easier, there are some limitations:

  • Because there is no support for column names, NumPy forces you to frame questions so that your answer is always multi-dimensional array operations.
  • If there is support for only one data type per ndarray, then it’s more challenging to work with data containing both numeric and string data.
  • There are many low-level methods – however, there are many common analysis patterns that don’t have pre-built methods.

Luckily, the Pandas library provides solutions to every issue mentioned above. Pandas is not a replacement for NumPy, but a massive extension of NumPy. I don’t want to state the obvious but, for the sake of completion: the main objects in pandas are Series and Dataframes. The former is equivalent to a 1D Ndarray while the latter is equivalent to a 2D Ndarray. Pandas is vital to data cleaning and analysis.

Series

A Pandas Series is a one-dimensional array that holds any type of data (integer, float, string, python objects). The axis labels are called index.

DataFrame

A Pandas DataFrame is a 2-dimensional labelled data structure whereby each column can have a unique name. You can think of Pandas DataFrames as a SQL table or a simple spreadsheet.

Merge, Join & Concat

These methods enable Data Scientists to expand their analysis by combining multiple datasets into a single DataFrame.

GroupBy

The groupby() method is useful to investigate datasets split into groups based on a given criteria.

So, Where Next?

It is part of Data Scientists’ learning curve to seek help with basic Python features, libraries, and syntax. However, unfortunately, some of the answers you will find out there are not as straightforward as you would expect. You can find the complete ‘cheatsheet’ in my GitHub. Also, DataCamp has published a long list of cheetsheets for Python Data Scientists [1]. Hope you will have some of your straightforward questions answered quickly, and now you are ready to start the fun part of Data Science: solving problems and generating insights.

Thanks for reading. Here are other articles you might like it:

Increase Productivity: Data Cleaning using Python and Pandas

Switching career to Data Science in your 30s.

Save Time Using the Command-Line: Glob Patterns and Wildcards

What Makes a Data Scientist Stand Out?


References:

[1] The Curse of Knowledge https://en.wikipedia.org/wiki/Curse_of_knowledge

[2] DataCamp website https://www.datacamp.com/community/data-science-cheatsheets?page=2


Related Articles