Introduction
After reading this article we should have more confidence to extract the data insights contained within character variables. A character variable or string can be represented by text values such as a person’s; name, job, or address. Many of the ML models that are used require features to be in a numeric format to extract meaning. With string manipulation techniques data can be cleaned and converted into the necessary format needed.
The Sql code shown in this article was created using an instance of Microsoft SQL Server Management Studio 18. Each of the examples used aims to highlight the basic concepts of the different SQL string functions that are available.
Input data
For this piece of analysis, a sample of the IBM HR Analytics dataset has been taken from Kaggle¹. Within this dataset a number of character variables are available.
CHARINDEX
The character index method can be used to review the presence of a String value within a character column.
In the SQL code shown above, we are searching for the value "sales" within the first application of the method. The string search shows that the value is not case-sensitive, as the actual value in the column searched shows a camel case format. As the filter clause is applied to the dataset then only values returning a match to the filter expression will be shown in the query results.

As the output above shows, the two string values that have been searched for will return the index position that represents the first character which matches the overall string value being searched for. Should the string value not be present then a value of zero will be returned. We can see within the second row that the column value "Representative" will not match the "Exec" value that was searched for and so a zero is returned.
PATINDEX
When reviewing for a pattern contained within a column, the pattern index method is appropriate.
By making use of the percentage symbol we are providing a wildcard to the method. It is this wildcard that allows for any number of other values to be present between the pattern that is being searched for.

The results highlight that two values were present before the first letter "b" was found and then six values were present before the second letter "y".
Within this review of the job role column, we are looking for a list of values that are contained anywhere within the string column.

As can be seen from the SQL output only the letter "x" was identified within this search. However, should any or all of the letters from the list be present within the row, then these results would be shown.
LEFT & RIGHT
These two methods can be used to return a section of a string beginning from the start or the end of the string.
By choosing the left method, the first three character values from the start of the string are returned.

Whereas with the right method the final three character values have been returned. Each of these methods can help to return the results required if the beginning or end of the string is where your analysis is starting. However, if another starting position is required then the use of the substring method is much more practical.
SUBSTRING
Being able to identify the start and length positions of the string that is required is the core of the substring method. By having these two positional parameters a greater level of flexibility is available when reviewing string variables.
With the sentence variable that has been declared, there are a number of items that can be extracted. Using the substring method with start and length values from the index allows the user to find the two fruit items. As the substring index begins at one and the length position parameter is inclusive, "Apples" can be returned with the position parameter values of one and six.

The second fruit has been identified by extending the start position value to the appropriate position from the index. With this example, the length position refers to the length of the fruit that is required. If the length positional parameter was not included then all string values after the start position would be returned in the query result.
REPLACE
After identifying the string values that we are interested in, there may be times when this string value needs to be adjusted. For this task the replace method is available.
From the department column, the ampersand has been found but it is required to be changed to "and" instead of the "&" symbol. The parameters in the method are the old item for the second parameter and the new item for the third parameter.

As the results highlight the new department name has been created as expected. This is a small example of what can be achieved when identifying text that could be converted into another format.
Conclusion
A lot of additional information can be gained from data that is contained within text columns or features. Within this article, a number of different methods have been shown to help with this extraction process. Each method highlighted how understanding the index of the character string can provide opportunities to create new features.
In the next article, we will continue to explore the different SQL string methods that can be used.
Thanks very much for reading
[1] : Kaggle dataset IBM HR analytics attrition dataset from https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset, with a license agreement of https://opendatacommons.org/licenses/dbcl/1-0/