Python — from A to Z (1)

26 concepts you should know — Part 1: from A to J

Amanda Iglesias Moreno
Towards Data Science

--

Image by Alex Knight in Unsplash

Python is one of the leading programming languages across the globe. It is used in many contexts from data science, robotics, web development to gaming, and rapid prototyping. Its simple syntax makes Python programs really easy to read and write, ensuring a rapid learning curve. Additionally, Python has ‘batteries included’ — multiple libraries (standard and third-party) that will greatly facilitate your work as a programmer. Although a basic programming level can be reached faster than with other programming languages, it definitely takes some time to master Python. In this series of articles, we will explain in detail 26 Python features, helping you to discover the huge range of capabilities that Python has at its disposal.

Anaconda

Anaconda is a program to manage packages and environments with Python. It is widely used in the data science field since the most common packages used in data science are installed by default when installing Anaconda.

Anaconda also allows you to install any package, if needed, by typing the command conda install package_name in the anaconda prompt. When installing packages, it is also possible to install multiple packages at once, for example, the command conda install pandas numpy will install both pandas and NumPy simultaneously. Anacondas installs the most up-to-date version of the package; however, if you need a prior version for your project, you can specify it by adding the version number as follows: conda install pandas=0.22.

Anaconda can also be used to uninstall and update packages in your environment by using the commands conda remove package_name and conda update package_name respectively.

Besides a package manager, Anaconda can also be employed to create environments to isolate your projects. A virtual environment allows installing packages locally in an isolated directory for a particular project. This approach is particularly useful when we need to switch between different Python versions and install package versions.

Creating a virtual environment with Anaconda is pretty straightforward. You just need to type conda create -n environment_name [python==X.X] [list_of_packages]in the Anaconda prompt (the arguments provided in brackets are optional). For instance, you can easily create a virtual environment called myenv based on Python version 3.6 as follows: conda create -n myenv python==3.6. In this particular case, we did not specify a list of packages to be installed when creating the environment. Additional packages can also be installed later, once the environment has been activated.

After creating the new environment, you need to activate it by typing conda activate environment_name. Finally, to leave the current active environment, type the conda deactivate command in the prompt. It is important to bear in mind that this is just an introduction on how to use Anaconda as a package and environment manager. There are many more actions and commands that are not specified in this introduction due to its brief extension.

Beautiful Soup

Internet is one of the most important resources of information; however, in most cases, it turns out difficult to harvest data from web pages, since the information is embedded in the HTML code. It is not as easy as downloading a CSV file. For this reason, Python provides a wide variety of tools for facilitating the data extraction process from the Internet.

Beautiful Soup is a widely used third-party library to extract any data from an HTML document. It allows you to easily interact and explore the HTML code and get the specific information that you need (for instance all the images on a page). The library provides different methods and attributes to identify and extract the desired information from a web page — a task that will be highly complicated and time-consuming (but possible) using Python string methods.

Creating a Beautiful Soup object is the first step in any Beautiful Soup project. A Beautiful Soup object can be created with the BeautifulSoup constructor, passing a string (HTML code) as an argument. In the example below, we obtain the HTML code (as a string) of the webpage (https://en.wikipedia.org/wiki/Madrid) using the request library. Then, we parse the HTML code with BeautifulSoup to make it more accessible for picking out the information.

The extracted HTML code looks pretty messy, but BeautifulSoup converts this code to a tree structure that is easy to parse. As mentioned above, Beautiful Soup contains a couple of handy methods to extract information from the code. For instance, you can get all anchor tags in the document by using the find_all('a') method. Additionally, you can also find an HTML element by its ID or class name by adding the keyword argument id or class_ to the query.

There are many more functions and attributes that you can employ for extracting information from your HTML document (explained in detail in the documentation).

It is relevant to point out that for extracting a particular piece of information from the code, you can always use the developer tools of your browser to explore the HTML code interactively and find where the information is located, then you can write the most optimal Beautiful Soup query to retrieve this particular information.

Class

Object-oriented programming allows to group variables and functions into a data type called class. For large programs, OOP adds organization to your code, breaking down your program into smaller parts that are easier to understand and debug.

A class in Python is a blueprint to create objects, consisting of methods and attributes. To define a class, you will use the class keyword followed by the name of the class (using the CapWords convention). The code below defines the Circle class which is made up of two attributes and one method.

Class Circle (image created by the author)

The attributes describe the characteristics of the object (in this case, the color and the radius), and are defined inside the __init__ function. This function is placed at the beginning of the class and is automatically executed when the class is initiated.

The methods are actions that a class can take (e.g. the calculation of the area of the circle). They are pretty similar to functions (both use the def keyword), with the difference that methods are defined inside a class rather than outside.

After defining the class, you can create an object. The process of creating an object is called instantiation. In this case, an object would be a specific circle, for instance, a red circle of radius 5.

Different Circle objects (image created by the author)

Subsequently, we could use the Circle class again to create more instances of the class. All objects would have the same number of attributes and methods, basically because they come from the same blueprint — the Circle class.

Once the red_circle object is created, you can access its attributes using dot notation. Also using dot notation, you can call a method of a class, but in this case, you have to specify the input parameters inside the parenthesis.

Defaultdict

A defaultdict works pretty much like a Python dictionary, meaning both classes share methods and operations. The defaultdict function is defined in the module collections. This module is part of the Python Standard Library, meaning not additional installation is required. To access the function, we must include from collections import defaultdict at the beginning of the program.

In a dictionary, when we try to access a key that is not already defined (non-existing key), a KeyError exception is raised.

However, when using a defaultdict, a new key is going to be created (using the argument provided to the defaultdict constructor) without raising any exception. This argument must be a callable object or None.

In the code below, the function int is called when trying to access a non-existing key. The value returned by the function, in this case 0, is going to be assigned to the missing key.

The argument provided to the defautdict constructor can also be a user-defined function (in this case a lambda function), as shown below.

Encoding (character)

A character encoding provides the map (translation) between the bytes in the computer (raw zeros and ones) and real characters.

Encoding (image created by the author)

For many years, programming languages used ASCII for the internal representation of characters. This standard includes 128 characters by using 7 bits of information. It covers all English characters (it was originally developed for electronic communications in the USA); however, it fails to cover characters appearing in other world’s languages such as characters with accents. For this reason, over the past few years, there has been a shift to Unicode encoding. This standard contains a wider range of characters and symbols and uses 8, 16, or 32 bits of information depending on the encoding type, requiring naturally more space compared to ASCII.

Besides ASCII and Unicode, Python supports a wide variety of encodings. The completed list of encodings available in Python can be consulted in the Python Standard Documentation for the codecs module.

Encoding is something you have always to take into account when working with text data. It is important to bear in mind that the default encoding that Python will use for operations requiring encoding (such as reading a file) is UTF-8. It is a common problem when working with files to obtain incorrectly displayed characters or exceptions due to encoding inconsistencies. For this reason, when reading a file, you need to specify the encoding used (when it does not match the default encoding) by including the argument encoding in the open function.

Findall (re module)

Text processing can often be simplified by using regular expressions. Regular expressions are patterns used to match character combinations in strings. They are pretty useful in test manipulation, that is why the Python Standard Library has a module exclusively for working with regex patterns — the re module. This module provides a wide variety of functions, and one you surely come across when working with text data is the findall function.

The re.findall(pattern, string) function extracts all non-overlapping matches of a regular expression (pattern) from a string (string). The first parameter of the function is a regular expression, while the second parameter is the string where we want to search. The function returns a list of strings where each element is a non-overlapping match.

Findall function (image created by the author)

The following code extracts all numbers (including decimal numbers) from a string. We are not going to get into the details of how regex expressions work. For that, we would need to write another article 😛. You are going to blindly believe that the pattern works for extracting numbers.

As shown above, the findall function returns a list of strings, containing all non-overlapping matches. After extracting the data, we can easily convert each string of the list into a float using list comprehensions.

In this case, the pattern used is really simple; however, regex expressions can get much more sophisticated and complicated. They come in handy to extract webpages, emails, passwords, or phone numbers, from documents, being a really powerful tool for obtaining information from textual data.

Get (dictionary method)

A KeyError exception is a common exception when working with dictionaries in Python. This exception is raised when the user tries to access a key that is not present in a dictionary.

In the coding example below, a KeyError exception is raised when trying to access a non-existing key in the dictionary.

To handle this problem, besides using the try except block, a common solution is to employ the get method. This method returns the value found at the specified key (if the key is available). On the contrary, if the key is not available in the dictionary, the get method returns None or a custom value, never raising an exception.

Always remember, to avoid KeyError exceptions when using dictionaries, you can switch from directly accessing the key of the dictionary to using the get method. This will prevent unexpected exceptions during the execution of your code.

Help

While working in Python, it is sometimes challenging to remember the syntax of every function particularly those that you hardly use. The Python help function provides access to the documentation of a specific function, showing what the function does and its syntax.

The following code provides a summary of the hasattr function. As shown below, the help function prints basic info on how the function works and its definition.

Additionally to the help function, Python also provides more detailed documentation that can be consulted online.

IndexError

Ordered containers in Python (e.g. list or tuples) identify their elements by position (index). Python follows a convention called zero-based index, meaning the first element in an ordered container is located at index 0.

To access elements of a list, you use an index operator, consisting of a pair of square brackets ([]) and an index (starting at 0), specifying the element you want to retrieve.

Access elements of a list (image created by the author)

If you try to access an item of a list beyond the range of available elements, you will get an IndexError exception, as shown below.

Getting this kind of error is pretty common, especially if you have just started learning Python. If you are coming from R, wrong indexing is a common mistake you will face, because R, unlike Python, uses one-based indexing.

Join (string method)

The join method is used to concatenate the strings contained in an iterable into one string. The syntax of the method is shown below, where string represents the delimiter inserted between each element of the iterable and iterable is the sequence of strings we want to concatenate (required argument). This sequence can be e.g. a list, tuple, dictionary, set, or generator.

string.join(iterable)

The following block of code shows how we can use the join method in Python for concatenating strings. As you can observe, the join method returns a string consisting of the concatenation of the strings in the iterable (in this case a list).

All the elements of the iterable should be string-type. If not, a TypeError exception is raised, as shown below. To concatenate an iterator containing numbers, we should previously convert them into strings with the str() function, as Python does not do implicit string conversion.

It is important to bear in mind that join is a string method, not a list method, as we call it on a string rather than on an iterable (a list). The sequence of strings (iterable) is however the primary argument of thejoin method.

As you might know, we can also concatenate strings using the + operator; however, this is not an efficient way for joining a large number of strings. Mainly because the + operator requires the creation of a new object every time is used, leading to lower performance. If you want to know more about why the join method is preferred over the + operator, read the article below 💚

There are many more methods for string objects. For knowing more about string methods, you can consult the official Python documentation at:

This brief introduction shows some of the main capabilities and problems you may encounter while programming in Python. Its readability, coherence, and extensive library make Python one of the most important programming languages worldwide, being a fundamental asset any data scientist should have.

Amanda Iglesias

--

--