These functions make the process of data preprocessing simpler.
This article assumes the reader has a basic knowledge of Excel functions.

For non-programmers, Microsoft Excel is a great tool for preprocessing and handling structured data. Excel has functions and techniques which makes it easier to clean structured data. We’ll discuss a few of the many functions along with a few examples. Before proceeding further, we’ll discuss a few basic functions which will be a part of a larger formulae later in this article.
Basic Functions
- IF
This function checks a condition and returns a specified value accordingly. In the example below, the function checks a condition "is 2 greater than 3". If the condition is TRUE (which it isn’t) then the function returns "2 > 3" else it return "2 < 3".
=IF(2>3,"2 > 3","2 < 3")
Since, 2 isn’t greater than 3, so the condition is FALSE and the value returned will be "2 < 3".
- IFERROR
This function helps in error handling. In Excel, the most common errors are #N/A, #VALUE!, #DIV/0!, #REF!, #NAME?, #NUM!, to name a few. Let’s look at an example of error handling where we try to divide 1 by 0.
=IFERROR(1/0,-1000)
In the above example, we have divided 1 by 0. This calculation returns an error (#DIV/0!). The IFERROR function takes a calculation/formula as an input and returns the result of the calculation if it isnt’ an error. If the result is an error, the value we specified (-1000 in the above example) is returned.
If not handled, the formulae involving an erroneous calculation would also return an error. For example, if we add 10 numbers (using the SUM function) where #DIV/0! is one of the 10 numbers, then the SUM function would also return #DIV/0!.
- LEFT
This function returns first ‘n’ characters of a string. Let’s look at an example to extract the first three characters of the string "Excel". The output of the function will be "Exc".
=LEFT("Excel",3)
4. RIGHT
This function returns the last ‘n’ characters of a string. Let’s look at an example to extract the last three characters of the string "Excel". The output of the function will be "cel".
=RIGHT("Excel",3)
5. MID
This function returns a substring of a string if we input the starting number and number of characters to return. Let’s look at an example to extract the substring "xce" from the string "Excel". In this example, the starting number is 2 and the number of characters to return is 3.
=MID("Excel",2,3)
6. FIND
This function returns the position of the first occurrence of a substring/character in a string. Let’s look at an example to find the position of the substring "ce" in the string "Excel". The function returns 3, since, it is the position of the first and only occurrence of "ce".
=FIND("ce","Excel")
The FIND function also enables a user to control the starting point of the search. Let’s look at an example to find the position of the second occurrence of "a" in "abca". We’ll add a starting point argument, 2 (since we want to exclude the first ‘a’ from the search). The function returns 4.
=FIND("a","abca",2)
The search in FIND is case-sensitive, there is another function named SEARCH which is same as FIND but it’s search is not case-sensitive. FIND and SEARCH functions return a #VALUE! error if the specified substring/character is not found in a string.
7. ISNUMBER
This function returns TRUE if a given input is a number else returns FALSE. Let’s look at an example where we input "abc" and 123 to the function.
=ISNUMBER("abc")
=ISNUMBER(123)
In the above example, the first formula returns FALSE as "abc" is not a number. The second formula returns TRUE as 123 is a number.
8. ISTEXT
This function returns TRUE if the given input is a string/text else returns FALSE. Let’s look at an example where we input "abc" and 123 to the function.
=ISTEXT("abc")
=ISTEXT(123)
In the above example, the first formula returns TRUE as "abc" is a string/text. The second formula returns FALSE as 123 is a number.
9. SUBSTITUTE
This function replaces the ‘nth’ occurence of a substring in a string with a replacement string. This function is case-sensitive and ‘n’ is 1 by default. Let’s look at an example to replace "Excel" with "MS Excel" in the string "Excel is cool".
=SUBSTITUTE("Excel is cool","Excel","MS Excel")
Let’s look at another example to replace the second occurrence of "Excel" by "It" in the string "Excel is cool. Excel is developed by Microsoft". The formula returns "Excel is cool. It is developed by Microsoft".
=SUBSTITUTE("Excel is cool. Excel is developed by Microsoft","Excel","It",2)
10. TRIM
This function removes the leading and trailing whitespaces from a string.
=TRIM(" a ") returns "a"
11. COUNTA
This function returns the count of non-empty cells in a range of cells.
=COUNTA(A1:A10)
12. COUNTBLANK
This function returns the count of the empty cells in a range of cells. This function helps in finding the count of missing values in a column.
=COUNTBLANK(A1:A10)
13. COUNTIF
This function returns the count of a specified value in a range of cells. This is useful to find the count of missing values which are masked by values like 999, -1000, -, ?, etc. Let’s look at an example to count the number of occurrences of "-" in a range of cells. We can also specify a cell address containing the search criteria instead of "-".
=COUNTIF(A1:A10,"-")
=COUNTIF(A1:A10,B1)
A few data preprocessing examples
- Find the number of occurrences of a word in a string
We may need to count the number of occurrences of a word/character in a string. Let’s look at an example to count the number of times the word "is" appeared in a string.

=(LEN(A2)-LEN(SUBSTITUTE(LOWER(A2),"is","")))/LEN("is")
Let’s break the formula into parts to understand it better and apply it to the name in the cell A2 of the above image.
- LEN(A2) calculates the length of the string. This returns 44.
- LEN(SUBSTITUTE(LOWER(A2),"is", "")) removes the word "is" from the string and calculates its length. This returns 40. The string in A2 is converted to lowercase to make the SUBSTITUTE function case-insensitive.
- We subtract the length of the string without "is" from the length of the original string. This gives us the number of times the word "is" occured in the string (i.e. 2) multiplied by the length of the word "is" (i.e. 2) so, we get 2 * 2 = 4. So, we need to divide the result by the length of the search string (i.e. length of "is", which is 2) and we finally get 4 / 2 = 2.
2. Count the number of words in a string
We may need to find out the number of words in a string. Let’s look at an example from the Titanic dataset to count the number of words in a passenger’s name.

=LEN(A2)-LEN(SUBSTITUTE(A2," ",""))+1
Let’s break the formula into parts to understand it better and apply it to the name in the cell A2 of the above image.
- LEN(A2) calculates the length of the string. This returns 23.
- LEN(SUBSTITUTE(A2," ", "")) removes spaces from the string and calculates the length of the string. This returns 20.
- We subtract the length of the string without spaces from the length of the original string. This gives us the number of spaces in the string i.e. 3. Adding 1 to the number of spaces gives us the number of words i.e. 4.
3. Extract the first word of a string
There might be cases where we need to create additional features from an existing feature. Let’s look at an example to extract titles of people from their names. We’ll assume the first word of a name to be the title. A name without a space in it is assumed to have no title.

=LEFT(A6,IFERROR(FIND(" ",A6)-1,0))
Let’s break the formula into parts to understand it better and apply it to the name in the cell A2 of the above image.
- FIND returns the position of first space in the name i.e. 4. We subtract 1 from the position to exclude the space itself. Now it becomes 3.
- IFERROR handles an error returned by FIND if the name has no space in it. IFERROR returns 0 if the name has no space. In this case the name has a space in it. Hence, there is no error returned by FIND.
- LEFT returns the first ‘n’ characters as returned by FIND i.e. first 3 characters. If the name has no space (as in the cell A6 of the above image) the IFERROR returns 0 and LEFT returns first 0 characters i.e. nothing.
4. Extract the last word of a string
Let’s look at an example to extract last names of people from their names. We’ll assume the last word of a name to be the last name. When a name has no spaces in it, then the name is returned as it is.

=MID(A2,IFERROR(FIND("~",SUBSTITUTE(A2," ","~",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))))+1,1),LEN(A2))
Explaining the formula by breaking it into parts might make things complicated. So, let’s just discuss the process involved in it.
- The position of last occurrence of space is found out by finding the number of spaces in the string.
- The last occurrence of space is replaced by "~" (it can be any character which is not part of the string).
- We find the position of "~" in the string, this in turn is the position of the last occurrence of space in the string. Adding 1 to it increments the position by 1, so that the space is not included.
- IFERROR returns the length of the original string if the string has no spaces in it.
- MID returns the substring from the last occurrence of space till the end of a string. Since, the number of characters to be extracted is not known, we can give the maximum number possible i.e. the length of the original string.
5. Extract the ‘nth’ word of a string
Let’s look at an example to extract title of people from their names in the Titanic dataset. The second word of a name is the title. Since, the second word is the title in most of the names, using text to columns feature of Excel is recommended. However, we’ll discuss the formula as well.

=SUBSTITUTE(LEFT(MID(A2,IFERROR(FIND("~",SUBSTITUTE(A2," ","~",1))+1,1),LEN(A2)),IFERROR(FIND(" ",MID(A2,IFERROR(FIND("~",SUBSTITUTE(A2," ","~",1))+1,1),LEN(A2)))-1,LEN(A2))),".","")
Replacing 1 with ‘n-1’ in SUBSTITUTE(A2," ","~",1) extracts the ‘nth’ word of a string. In the above example, we extracted second word of a string so it is 1. The below formula extracts the third word of a string.
=SUBSTITUTE(LEFT(MID(A2,IFERROR(FIND("~",SUBSTITUTE(A2," ","~",2))+1,1),LEN(A2)),IFERROR(FIND(" ",MID(A2,IFERROR(FIND("~",SUBSTITUTE(A2," ","~",2))+1,1),LEN(A2)))-1,LEN(A2))),".","")

Explaining the formula by breaking it into parts might make things complicated. So, let’s just discuss the process involved to extract ‘nth’ word.
- The last ‘n-1’ occurrence of space is replaced by "~" (it can be any character which is not part of the string).
- We find the position of "~" in the string, this in turn is the position of the last occurrence of space in the string. Adding 1 to it increments the position by 1, so that the space is not included.
- IFERROR returns the length of the original string if the string has no spaces in it.
- MID returns the substring from the last occurrence of space till the end of a string. Since, the number of characters to be extracted is not known, we can give the maximum number possible i.e. the length of the original string.
- Extract the first word of the resultant string as seen in the example 3.
- Replace the periods (.) at the end of the title with a blank string ("").
6. Min-max normalization
Min-max normalization normalizes numeric features such that they remain in the range [0,1].

=(A2-MIN($A$2:$A$9))/(MAX($A$2:$A$9)-MIN($A$2:$A$9))
We can use the formula below to normalize the data within a range [new_min, new_max]. In the formula below new_max = 10 and new_min = 1. So, the data will be normalized such that they remain in the range [1,10].
=(((A2-MIN($A$2:$A$9))/(MAX($A$2:$A$9)-MIN($A$2:$A$9)))*(10-1))+1

7. Discretization
Discretization converts a numeric feature into a categorical feature. It converts a numerical feature into class intervals like 0–10, 11–20, etc. Once the class intervals/bins are decided, we can use Excel formulae to discretize the data. In the below example, we’ll discretize numeric values into 4 class intervals namely, "0 to 30", "31 to 60", "61 to 90" and "91 to 100". Any value above 100 is assigned the class "Undefined".
=IF(AND(A2>90,A2<=100),"91 to 100",IF(A2<=30,"0 to 30",IF(AND(A2>=31,A2<=60),"31 to 60",IF(AND(A2>=61,A2<=90),"61 to 90","Undefined"))))

These are a few Data Preprocessing techniques using Excel functions. There are other preprocessing/cleaning techniques which can be done in Excel without using functions, such techniques aren’t discussed in this article. If you have a programming background it is recommended to use a programming language like Python or R. For non-programmers, Excel is one of the best options to handle structured data.
Know more about my work at https://ksvmuralidhar.in/