The world’s leading publication for data science, AI, and ML professionals.

Mastering Python Strings

In solving for real world Data Science challenges, there is no escaping dealing with strings. Many features in datasets have values in the…

Photo by Munro Studio on Unsplash
Photo by Munro Studio on Unsplash

In solving for real world Data Science challenges, there is no escaping dealing with strings. Many features in datasets have values in the form of texts content or strings as they are referred to. It becomes very important to be comfortable dealing with strings – creating, manipulating, modifying – even if you are not dealing with stuff like Natural Language Processing.

In this post, the objective is to take you through the basics of string objects in python and what you can do with it using the plethora of functions available so that you can start dealing with strings like a pro !.

So lets get started !

This post is divided broadly into the following topics:

  1. Creating a String object
  2. Analyzing a String
  3. String Operations
  4. Formatting String Outputs

  1. Creating String Objects

In Python we create string objects by enclosing the content in quotation marks, either single or double.

Image by the Author
Image by the Author

We can also convert virtually any python object to its string representation using the str() function. Here are a few trivial examples:

2. Analyzing Strings

Check Length: len()

The inbuilt len() function gives us the length of the string object which includes the space characters as well.

Check Case: Lower, Upper, Titlecase

We can check if a string object has all small case characters using the islower() in-built python function. isupper() does the job of checking for all caps. istitle()checks for capital letters at the beginning of each word.

Check the contents of string object

Is the string alphanumeric – does it contain alphabets as well as numbers? The isalnum() method gives the answer.

We can also check if the string has only numbers using the isdigit() method and check if it is a decimal string using isdecimal().

Search the string object contents

We can search for a string (a substring) within a larger string using the find() method of the string object. This returns the index position within the string of the first occurrence of the substring and if not found returns a value of -1.

The index() method does the same thing but returns a value error if the substring is not found.

Alternatively, we can use the in keyword to search for a string within another string object and it returns a boolean value, like so:

‘Search and replace’ can be achieved by using the replace() method. It takes in 3 parameters – the string to be replaced, the string that will replace the found string and an optional count parameter where we can indicate how many such search and replace operations have to be done.

Notice in the example below for the long_string object the last occurrence of long is not replaced as we specified only 3 counts for replacement.

rfind() works similar to find() but rfind() returns the index number of the latest occurrence of the substring and -1 if substring is not found. Here is the comparison of find and rfind.

As a special case of searching for substrings, if we want to check if a string object starts with or ends with a particular substring, then we can use the startswith() and endswith() methods.

As a natural extension of the topic of searching within strings,we can use the count() method of string object to count the number of occurrences of the substring. Note that this returns the number of non-overlapping occurrences of the substring.

3. String Operations

Now, let us look at methods available to modify string objects.

Earlier we checked for the case of the characters in string objects using isupper(), islower() and istitle() methods. The corresponding methods without the ‘is’ actually implement the changes.

Using swapcase() method we can switch between all caps and all smalls.

capitalize() lets us convert the first character of the string object into a capitalized one.

Splitting a string object

We can split a string object into a list of substrings using a specified splitting character. The default splitting character is space, which results in the string being broken into a list of ‘words’.

Joining a list of string objects

Now let us try joining together a list of string objects. We use the join() method for this. The join() method is called on a string object which we want to use in joining multiple string objects. In the below example we join the split string using the string -

Now let us get the original value of the my_str object by joining the list of strings together again separated by space using the join() method.

Note that the string operations are performed on a copy of the original string object and if you want the changes to be done on the original object you will have to reassign the output to the same string object name.

Trimming Leading and Trailing Characters in Strings

Normally this refers to trimming excess spaces before or at the end of the string, though the corresponding methods can also be used to drop any other character at the beginning or end. strip() removes the characters (space by default if nothing else is specified) at both ends, whereas lstrip() and rstrip() remove the space (`or other specified character) at the beginning and the end respectively of the string object on which they are called.

From python 3.9 onwards we have a removeprefix() and removesuffix() which are identical in action to rstrip() and lstrip().

Partitioning a String

The partition() method splits a string into 3 parts – the part before the first occurrence of the provided substring, the substring itself and the part after the first occurrence of the substring. The below examples should make it clear.

Splitting Multi-Line Strings

Similar to the split() method, we have the splitlines() method, which breaks up a multi-line string object into a list of strings from each line. The end of the line is defined as where the you have hit the ‘enter’ key while creating the string.

The resulting list elements can then be accessed based on their index number. Lets look at the how the output changes if we insert a few more blank lines just to be clear on how splitlines() works.

4. Formatting String Outputs

Frequently, we will find ourselves having to include string output of python code in our reports. This can be challenging to ensure that the alignment is proper to ensure better readability. The code generating this output can also get hairy at times. Fortunately for us, versatile string formatting options are available for us to manager this.

First let us look at code readability. In the earlier versions of python (pre version 3.7), one had to use either the % method or format method. From python 3.7 onwards we have the f-strings which makes things much simpler.

In the first approach, the %s and %d placeholders refer to the string and digit values that go into the string. The corresponding variables field and time are converted to string form if they are not already string objects and inserted into the placeholders in the output.

In the second approach, the format method called on the string object, passes the string form of the variables into the { } placeholders in the string object before the output is displayed. Alternatively the sequence number of the variable inside format can be mentioned within the { }. Otherwise, the values are inserted in the same sequence as passed to the format() method.

In terms of code readability, the third approach scores over the other two. The name of the variable to be inserted in the string output is directly mentioned within the {}. We need to add an f before the opening quotes to indicate to python that we want the string object to be treated as an f-string (for formatted string).

Aligning String Output

Within the formatted strings, a variety of formatting specification options are available to help us align the output neatly. We can specify the minimum number of slots the output should take, whether it has to be left,right or center aligned etc. Click here for the detailed documentation on this.

source: https://docs.python.org/3/library/string.html
source: https://docs.python.org/3/library/string.html

The below code and output gives a flavour of how this can be used. Let us create 3 variables – name, location, salary – and see how we use some of the formatting options to change the default output to align it better. Ofcourse, you have the option of putting these variables as columns in a Dataframe to format the output, but if that is not an option or not suitable for the given situation, then the below string formatting options are available.

The integers within the {} after the : indicates the minimum number of slots which the value of the output has to take. By default, values of string variables are left aligned and numeric variables are right aligned.

The default alignment behaviour can be changed using the < or > before the integer specification. The number formatting – integer or float, the float precision and comma separators (for thousands) can also be indicated after the integer within the {}.

Conclusion

Thanks for reading and hope you would now be comfortable dealing with Strings In Python. Would love to hear your feedback on [email protected] or we can connect on Linkedin. You may also be interested in my other articles on https://balaji4u.medium.com/


Related Articles