The world’s leading publication for data science, AI, and ML professionals.

5 Must-Know SQL Functions for String Manipulation

Working with textual data is an important part of data analysis

Photo by Aaron Burden on Unsplash
Photo by Aaron Burden on Unsplash

Sql is used for managing data in a relational database. However, it offers much more than querying a database. The SQL functions also allow for performing data analysis and manipulation operations efficiently.

Since a substantial amount of raw data comes in textual form, it is of great importance to have versatile ways to manipulate strings. Raw data is usually not in the most desired format. We manipulate them to create usable and informative features.

In this article, we will go over 5 SQL functions to work with strings. For the examples, I will be using a small part of the Melbourne housing dataset available on Kaggle.

I have created a table called melb. Let’s first take a look at the dataset.

(image by author)
(image by author)

The dataset contains some information about the houses that are for sale.

Note: There are several relational database management systems such as MySQL, SQL Server, PostgreSQL, and so on. Although they mostly adapt the same SQL syntax, there might be small differences. In this article, we will use PostgreSQL.


1. Concat

Concatenating a string is basically combining multiple strings into one. In our dataset, we can combine the address and region name columns to create a full address.

We can use the concat function to perform this task.

SELECT CONCAT(address, ', ' , regionname) AS full_address 
FROM melb 
LIMIT 5;
              full_address
-----------------------------------------
85 Turner St, Northern Metropolitan
25 Bloomburg St, Northern Metropolitan
5 Charles St, Northern Metropolitan
40 Federation La, Northern Metropolitan
55a Park St, Northern Metropolitan

We can also use the "||" operation for concatenating strings.

SELECT address || ', ' || regionname AS full_address
FROM melb 
LIMIT 5;

The limit keyword limits the number of rows to be displayed.


2. Split

A string might contain multiple pieces of information. In such cases, we may need to extract a piece from a string and use it as a separate feature.

The address column contains the house numbers. Let’s use it to create a new column called house number.

SELECT SPLIT_PART(address, ' ', 1) AS house_number 
FROM melb 
LIMIT 5;
house_number
--------------
85
25
5
40
55a

The split_part function takes 3 arguments. The first one is the column name. The second one is the character that defines the split point which is space in our case. The third argument indicates which part we need after the split. The house numbers are at the beginning of the address so we take the first part.


3. Replace

The replace function allows for replacing a part of string with a new set of characters. It comes in handy when we need to standardize some text.

As an example, we can replace the "st" characters in the address column with the word "street".

SELECT address, REPLACE(address, 'St', 'Street') AS new_address 
FROM melb 
LIMIT 5;
    address      |     new_address
-----------------+---------------------
85 Turner St     | 85 Turner Street
25 Bloomburg St  | 25 Bloomburg Street
5 Charles St     | 5 Charles Street
40 Federation La | 40 Federation La
55a Park St      | 55a Park Street

4. Left and right

The left and right functions can be used to take a part from a string. We define the part to be selected in terms of the number of characters. For instance, the left function with 5 will return the first 5 characters from the left (i.e. from the beginning).

SELECT LEFT(regionname, 5) AS region_5 
FROM melb 
LIMIT 5;
region_5
----------
North
North
North
North
North

The right function works similarly.


5. Like

The like is actually an SQL operator, not a function. However, it is a highly useful tool when working with strings. We can use it to filter strings that are "like" another string or a sequence of characters.

For instance, we can filter the addresses that contain the string "Park".

SELECT address 
FROM melb 
WHERE address LIKE '%Park%' 
LIMIT 5;
   address
----------------
55a Park St
49 Park St
159 Park St
17 Parkhill Dr
93 Parkmore Rd

The ‘%’ represents any character. Thus, we are filtering addresses that start and end with any character. It just has to contain "Park" somewhere.

The like operator is case-sensitive. We are likely to get different addresses with "park". Let’s try.

SELECT address 
FROM melb 
WHERE address LIKE '%park%' 
LIMIT 5;
    address
-----------------
18 Spark St
14 Riverpark Dr
1 Riverpark Dr
23 Oakpark Dr

Conclusion

Textual data is an essential part of data analysis. Thus, we need efficient methods and techniques to handle strings. What we have covered in this article help you manipulate or update strings. You can also extract a piece of information from a string using these functions.

Thank you for reading. Please let me know if you have any feedback.


Related Articles