The world’s leading publication for data science, AI, and ML professionals.

Mastering File And Text Manipulation With Awk Utility

Advanced text processing with just a one-line command

Source: Unsplash
Photo by Maarten van den Heuvel on Unsplash

As software developers, system administrators and data scientists, working with files and manipulating them is a frequent and important part of our day to day job. Knowing how to work with text files and applying desired changes to them in a quick and efficient way can save us a considerable amount of time.

In this article I’m going to introduce you to Awk command, a very powerful text-processing tool which can do complex text processing tasks with one or a few lines of code. You may tend to use your favorite Programming languages like Python, Java, C, … to do these kind of tasks, but after reading this tutorial you will realize doing many of them with Awk is just simpler and more efficient.

I’ll try to demonstrate Awk’s usage by providing basic examples of how to solve common text processing tasks with it.


Installation

Awk is available by default on most Unix distributions. In case you don’t have it, you can use the following commands to install it.

For Debian based distributions:

$ sudo apt-get install gawk

For RPM based distributions:

# yum install gawk

If you are using Microsoft Windows, check the GNU manual for installation instructions.


Workflow

Awk workflow is simple. It reads a line from input stream, executes specified commands on it, and repeats this procedure until the end of file.

There’s also BEGIN and END blocks, which you can use to execute some commands before and after the repeated procedure.

Flowchart by Author
Flowchart by Author

Let’s jump into it

Awk command’s basic structure is like this:

awk [options] file ...

Let’s show some examples.

Example 1: Just print each line as it is

Consider a text file input.txt with the following content:

John 23 Italy
David 18 Spain
Sarah 21 Germany
Dan 42 Germany
Brian 50 England
Lewis 37 France
Ethan 12 France

By running the command:

$ awk '{print}' input.txt

It will run {print} for each line which is just printing it. So the output will be same as the input.

Example 2: Print the first two columns

Here we want to only print the name (first word) and age (second word) of each person separated by a tab.

Command:

$ awk '{print $1 "t" $2}' input.txt

Output:

John    23
David   18
Sarah   21
Dan     42
Brian   50
Lewis   37
Ethan   12

In the above example, $1 and $2 represent the first and the second fields from each input line. $0 represents the whole line.

Example 3: Add line numbers at the start of each line

Here we define a variable as count, increment it when reading each line and print it at the first of the line.

Command:

$ awk -v count=0 '{print ++count " " $0}' input.txt

Note that we can also remove the -v count=0 part. It will be defined implicitly with the value of 0:

$ awk '{print ++count " " $0}' input.txt

Output:

1 John 23 Italy
2 David 18 Spain
3 Sarah 21 Germany
4 Dan 42 Germany
5 Brian 50 England
6 Lewis 37 France
7 Ethan 12 France

Example 4: Only print people who are older than 30

Awk programming language supports conditions too.

Command:

$ awk  '{if ($2 > 30) print $0}' input.txt

Output:

Dan 42 Germany
Brian 50 England
Lewis 37 France

Example 5: Generate a report of how many people are from each country

We can achieve this by using dictionaries and loops.

Command:

$ awk '{my_dict[$3] += 1} END {for (key in my_dict) {print key, my_dict[key]}}' input.txt

Here we have a dictionary named my_dict. For each line, the key is the third word (country name) and we increase its value by 1. After the END keyword we can write the END Block commands which we have explained at the workflow section. Here at the END block we loop on the dictionary and print its (key, value) pairs.

Output:

Spain 1
France 2
Germany 2
Italy 1
England 1

Example 6: Calculate the average age

Command:

$ awk '{age_sum += $2} END {print age_sum/NR}' input.txt

NR is a built-in variable which represents the current record number. So at the END block, it will be equal to the total number of lines.

Output:

29

You can see other Awk built-in variables here.

Example 7: Only print people whom their names contain ‘s’ character

We can use regular expressions with Awk.

Command:

awk '$1 ~ /[sS]/ {print $1}' input.txt

Here we specify that the first word $1 should match the regular expression [sS].

Output:

Sarah
Lewis

Example 7: What if the input file is in another format, like CSV?

You can set the regular expression used to separate fields using -F option.

Consider having this file as input.csv:

John,23,Italy
David,18,Spain
Sarah,21,Germany
Dan,42,Germany
Brian,50,England
Lewis,37,France
Ethan,12,France

Command:

awk -F "," '{print $1 " " $3}' input.csv

Output:

John Italy
David Spain
Sarah Germany
Dan Germany
Brian England
Lewis France
Ethan France

Example 8: By defining a function, add a new column showing if the person is younger or older than 20

In this example we will show how to create and use functions. Also we will add our code to a file named prog.awk instead of writing it as an input argument. We also print our output to a file named output.txt.

prog.awk:

# Returns if the person is younger or older than 20
function age_func(age) {
    if (age < 20) {
        return "younger"
    }
    return "older"
}
{print $0 " " age_func($2)}

Command:

awk -f prog.awk input.txt > output.txt

output.txt:

John 23 Italy older
David 18 Spain younger
Sarah 21 Germany older
Dan 42 Germany older
Brian 50 England older
Lewis 37 France older
Ethan 12 France younger

Awk has also some built-in functions which you can check here.


Summary

In this article we showed the Awk command’s workflow and by providing some examples we saw that Awk is a powerful and flexible text processing tool which can be used in many scenarios. You can read The GNU Awk User’s Guide for more detailed instructions.

Thank you and happy coding!


Related Articles