Photo by NATHAN MULLET on Unsplash

Bash for Data Scientists, Data Engineers & MLOps Engineers

The Comprehensive Guide to Bash Programming

Towards Data Science
18 min readJun 15, 2022

--

Introduction:

It is inevitable for data scientists, machine learning engineers, or data engineers to learn bash programming. In this article, I will walk through the basics, concepts, and code snippets about bash programming. If you are familiar with python or any other language, then it will be very easy to pick up bash programming. Again the article is more focused on the usage of bash by data scientists, data engineers, and ML engineers. Let us dive in.

Image by the author

Contents:

  1. Bash Overview
  2. File Management
  3. Data Analysis
  4. Understanding DockerFile-bash commands
  5. Conclusion

Bash Overview:

Tools: I have used the following tools to create the diagrams and code snippets.

>Excalidraw
->Gitmind
->Gist
->Carbon.now.sh

Data Set: The data set used in this article is Adult Data Set — UCI Machine Learning Repository.Adult. (1996). The Adult dataset, from the UCI Machine Learning Repository, has census information. The Adult dataset contains about 32,000 rows with 4 numerical columns.

Acknowledgment: Blake, C.L. and Merz, C.J. (1998). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

What is Bash Programming?

  • Bash is an acronym for “Bourne Again Shell,” developed in 1989.
  • It is used as the default login shell for most Linux distributions.
  • Data scientists use bash for preprocessing large datasets.
  • Data Engineers need to know bash scripting for interacting with Linux and creating data pipelines, etc.
  • Bash is used mostly in Linux and Mac OS. Windows uses a command prompt and power shell to perform the same operations. Now you can install bash in windows and perform the same operations as in Linux and Mac.

What are Command-Line Interface(CLI) and Shell?

It is a program that allows users to type text commands instructing the computer to do specific tasks. Shell is a user interface responsible for processing all commands typed on CLI. The shell reads the commands, interprets the commands, and makes the OS perform the tasks as requested.

Please check out this question for more details about CLI, Shell, and OS.

Image by the author

If you don’t have a Linux machine then you can try the following

Image by the author
  • Use parallel’s or Vmware to install Linux on your mac or windows machine. Check this article for detailed instructions.
  • If you have docker then you can run a Linux container. Please check this article on how to use a Linux container.
  • The easy way is to use one of the cloud providers AWS, GCP, or Azure.
Image by the author
  • If you see the prompt is $. In Linux and Mac OS the prompt is $ and in windows is >.
Image by the author

Simple Linux Commands to Start with:

Here is a list of some basic and simple commands to use in Linux

Image by the author
Image by the author

For example, find the available shells in the system:

$ cat/etc/shells
Image by the author
Image by the author

My first script-Hello World:

Image by the author

Steps:

  • Create a directory bash_script(mkdir bash_script)
    * Create a file hello_world.sh-touch hello_script.sh
    * Open the file hello_script.sh
    * Enter the shebang line
    * Enter the command — echo ‘Hello World’
    * Save the file
    * Go to the terminal and execute the file
    * Before executing make the file executable -> chmod +x hello_world.sh
    * Execute the file — ./hello_world.sh
Image by the author

What is Shebang Line?

Image by the author

Let's see the special characters used in bash :

We will understand the following special characters in this article.

Image by the author

Man command:

Man command in Linux is used to display the user manual of any command that we can run on the terminal. It provides a detailed view of the command, including NAME, SYNOPSIS, DESCRIPTION, OPTIONS, EXIT STATUS, RETURN VALUES, ERRORS, FILES, VERSIONS, EXAMPLES, AUTHORS and SEE ALSO.

For example, man ls shows the below output- help for ls command.

Image by the author

Bash command Structure:

command -options arguments

for example

$ls (options) (file name or directory)
$ls -lah ~

To check all the commands available in bash

*find the directory for example /usr/bin
* change to that directory cd /usr/bin
* then use the ls command ls -la

Image by the author

Then it lists all the available commands

Image by the author

You can use the man command to check the info.

Important 40 Commands you should know:

The below list contains the most important commands you should know if you work as a data engineer or data scientist. We can use the below commands later in this article.

Image by the author

What is Shell Piping?

Image by the author

A pipe connects the standard output of a command to the standard input of a command. There can be multiple pipes or a single pipe.

Please check this answer on StackOverflow to understand more about pipes.

For example

1. We have a file with a user names-cat command to view the contents of the file.
2. Sort the file-sort command to sort the file.
3. Remove all the duplicates-Uniq command to remove all the duplicates. The cat standard output is passed as input to the sort and then the standard output of the sort is passed as input to the uniq command. All are connected by the pipe command.

cat user_names.txt|sort|uniq
Image by the author

We will be using piping in later scripts.

Image by the author

What is Redirect?

  • The > is the redirect operator. This command takes the output of the preceding command and passes it to a file. For example
echo “This is an example for redirect” > file1.txt
Image by the author

Truncate Vs Append:

# In the below example the first line is replaced by the second line
$ echo “This is the first line of the file” > file1.txt
$ echo “This is the second line of the file” > file1.txt
# If you want to append the second line then use >>
$ echo “This is the first line of the file” > file1.txt
$ echo “This is the second line of the file” >> file1.txt

Also, you can do the redirect another way

#Redirect works bothways
$ echo " redirect works both ways" > my_file.txt
$ cat < my_file.txt
redirect works both ways
# which is equalto
cat my_file.txt

Bash Variables:

  • In bash, you don’t have to declare the type of the variable like string or integer, etc. It is similar to python.
  • Local Variable: Local variables are declared at the command prompt. It is available only in the current shell. They are not accessible by child processes or programs. All user-defined variables are local variables.
ev_car=’Tesla’
#To access the local variable use the echo
echo 'The ev car I like is' $ev_car
  • Environment variables: The export command is used to create the environment variable. Environment variables are available to child processes
 export ev_car=’Tesla’
#To access the global variable use the echo
echo 'The ev car I like is' $ev_car
Image by the author
  • There should not be space when you assign the value
my_ev= ‘Tesla’ # No space
my_ev=’Tesla’
Image by the author
  • It is best practice to use lowercase to declare local variables and uppercase to declare environment variables.

What is apt-get:

  • apt-get is a friendly command-line tool to interact with the packaging system.
  • APT (Advanced Package Tool) is the command-line tool to interact with this packaging system.
  • Some popular package managers include apt, Pacman, yum, and portage.

Let’s see how we can install, upgrade, update and remove the packages.

Image by the author

&& and ||:

  • && is the logical AND operator command.
$ command one && command two

Command two will be executed only when the first command one is successful. If the first command errors out then the second command is not executed.

For example, you want to do the following steps

$ cd my_dir         # change the directory my_dir
$ touch my_file.txt # now create a file my_file

In the above case, the second command will error out since there is no directory named my_dir exists.

Now you can combine both by the AND operator

$ cd my_dir && touch my_file.txt

In this case, the my_file.txt will be created only if the first command is successful. You can check the command success code by echo $?. if it is 0 then it is a success command and if it is non zero then the command failed.

Image by the author

Check this StackOverflow discussion on the && operator.

  • || is logical OR operator.
  • In the below example the logical operator || is used. The mkdir my_dir will be executed only when the first command fails. If there is no my_dir exists, then create the my_dir directory.
$ cd my_dir || mdir my_dir
Image by the author

For example, combining && and ||

cd my_dir && pwd || echo “No such directory exist.Check”
  • If the my_dir exists, then the current working directory is printed. If the my_dir doesn’t exist, then the message “No such directory exists. check” message is printed.

File Management:

Some basics

Image by the author

Let's see a few examples.

  1. To display all the files-use ls
Image by the author
Image by the author

To display the last 10 recently modified files. l-long listing format, t-sorted by time, and head-select the first 10 records.

ls -lt | head

To display the files sorted by file size.

$ ls -l -S

The options available for ls are

Image by the author

2. Create/Remove a directory:

Image by the author

3. Create/Remove a File:

Image by the author

4. To display the content of the file:

Image by the author

Head & Tails: To display the first few lines or last few lines of the file then use head or tail.The option -n sets the number of lines to be printed.

$ head -n5 adult_t.csv
$ tail -n5 adult_t.csv
Image by the author

CAT:

#concatenate the files to one file 
cat file1 file2 file3 > file_all
#concatenate the files to a pipe command
cat file1 file2 file3 | grep error
#Print the contents of the file
cat my_file.txt
#output to a file again
cat file1 file2 file3 | grep error | cat > error_file.txt
#Append to the end
cat file1 file2 file3 | grep error | cat >> error_file.txt
#Also read from the input
cat < my_file.txt
#is same like
cat my_file.txt
Image by the author

TAC: Tac is the exact opposite of CAT and it just reverses the order. Please check the below screenshot for details.

tac my_file.txt
Image by the author

Less: If the text file is large then instead of using cat, you can use less. Less shows a page at a time whereas in CAT the whole file is loaded. It is better to use less if the file size is large.

less my_file.txt

Grep:

  • GREP stands for “global regular expression print”.Grep is used to search for specific patterns in a file or program output.
  • Grep is a powerful command and is used heavily. Please check the below examples
Image by the author

5. Move files:

#move single file
$ mv my_file.txt /tmp
#move multiple files
$ mv file1 file2 file3 /tmp
#you can also move a directory or multiple directories
$ mv d1 d2 d3 /tmp
#Also you can rename the file using move command
$ mv my_file1.txt my_file_newname.txt
Image by the author

6. Copy Files:

Copy my_file.txt from /path/to/source/ to /path/to/target/folder/
$ cp /path/to/source/my_file.txt /path/to/target/folder/
Copy my_file.txt from /path/to/source/ to /path/to/target/folder/ into a file called my_target.txt
$ cp /path/to/source/my_file.txt/path/to/target/folder/my_target.txt
#copy my_folder to target_folder
$ cp -r /path/to/my_folder /path/to/target_folder
#Copy multiple directories- directories d1,d2 and d3 are copied to tmp.
$ cp -r d1 d2 d3 /tmp
Image by the author

7. Gzip/Tar :

Image by the author

Gzip Format

Image by the author

Tar format:

Image by the author

8. Locate and Find:

  • The find command is used to find files or directories in real-time. It is slow compared to Locate.
  • It searches for a pattern for example search for *.sh files in the current directory.
Image by the author
Image by the author
#Find by name
$ find . -name “my_file.csv"
#Wildcard search
$ find . -name "*.jpg"
#Find all the files in a folder
$ find /temp
#Search only files
$ find /temp -type f
#Search only directories
$ find /temp -type d
#Find file modified in last 3 hours
$ find . -mmin -180
#Find files modified in last 2 days
$ find . -mtime -2
#Find files not modified in last 2 days
$ find . -mtime +2
#Find the file by size
$ find -type f -size +10M
  • Locate is much faster. Locate is not real-time. Locate scans in a pre-build database and not in real-time. Locate is used to find the locations of files and directories.
  • If locate command is not available then you need to install it before using it. Check your Linux distribution and install it
$ sudo apt update          # Ubuntu
$ sudo apt install mlocate # Debian

The database has to be updated manually before using the Locate command. The database update happens every day.

$ sudo updatedb
Image by the author
# To find all the csv files.
$ locate .csv

Check this article for installing the locate utility.

9. Split a File: If you have a large file then you might have a requirement to split the larger file into smaller chunks. For splitting the file, you can use

Image by the author

Data Analysis:

  • I am using the below dataset for doing the EDA.
  • The data set is an Adult dataset from UCI.
  • This dataset is also known “Census Income” dataset.
  • Let’s try to do some EDA.
  • I chose the training dataset for the EDA.
  • The file name is adult_t.csv
  1. Check the first few lines of the dataset-use the head command.
head adult_t.csv
Image by the author

The output is not pretty. You can install csvkit. Please check the documentation for more info.

2. To check the names of the columns

csvcut -n adult_t.csv
Image by the author

3. To check only a few columns

csvcut -c 2,5,6 adult_t.csv
#To check by column names
csvcut -c Workclass,Race,Sex adult_t.csv

4. You can use the pipe command and check the first few rows of the selected columns

csvcut -c Age,Race,Sex adult_t.csv| csvlook | head
Image by the author

5. To check the bottom records then use tail

csvcut -c Age,Race,Sex adult_t.csv| csvlook | tail

6. Use grep to find a pattern. Grep command prints the lines which match the pattern. Here I want to select all the candidates who have doctorates, and a husband and race are White. Check the documentation for more information.

grep -i “Doctorate” adult_t.csv |grep -i “Husband”|grep -i “Black”|csvlook
# -i, --ignore-case-Ignore case distinctions, so that characters that differ only in case match each other.
Image by the author

7. Check how many have completed Ph.D. in the dataset. Use the command wc-word count. Use grep to search for Doctorate and then do the count the no of times doctorate appears in the .dataset using word count(wc). 413 people have Ph.D. in the dataset.

grep -i “Doctorate” adult_t.csv | wc -l
Image by the author

8. Stats of the data- To find the statistics use the csvstat which is similar to summary(). Here I am trying to find the stats for Age, Education, Hours/Week columns. For example, Age- gives the data type, contains null values, unique values, smallest value, etc. Please refer to the below screenshot.

csvcut -c Age,Education,Hours/Week adult_t.csv | csvstat
Image by the author

9. Use Sort and Unique: Sort the file and select only the unique records and then write it to a new file called sorted_list.csv. cat command selects everything from the file, then the file is sorted, then the duplicate records are deleted and then written to a new file called sorted_list.csv

cat adult_t.csv | sort | uniq -c > sorted_list.csv
Image by the author

The file is

Image by the author

9. Merging files: In a lot of cases, you need to merge 2 files. You can use the csvjoin. This is very useful.

Image by the author
csvjoin -c cf data1.csv data2.csv > joined.csv
#cf is the common column between the 2 files.
#data1.csv-file name
#date2.csv-file name
#use the redirect '>' to write it to a new file called joined.csv

Check out the csvkit documentation on merging multiple CSV files.

10. Find the difference between 2 files:

Image by the author

If you want to find the difference between 2 files then use the diff command. For example file 1 consists of customer ids and file 2 consists of customers ID. If you want to find the customers who are available in the file and not in file 2 then use diff file1 file2. The output will display all the customer's Id’s which are not available in file2.

11. AWK: We can also use AWK.AWK Stands for: Aho, Weinberger, Kernighan (authors).AWK is a scripting language and is used for text processing. Please check out the documentation for more information.

For example, if you want to print columns 1 and 2 and the first few records.

Image by the author

Here $1 and $2 are columns 1 and column 2. The output is

Image by the author
#Print$0 — prints the whole file.

Find out the no of hours/ week > 98.

awk -F, ‘$13 > 98’ adult_t.csv|head
Image by the author

Print who has Ph.D. in the list and print the first 3 columns

awk '/Doctorate/{print $1, $2, $3}' adult_t.csv

12. SED: SED stands for Stream Editor. It is used for filtering and transforming text.SED works on input streams or text files. We can write the SED output to a file.

Let's see some examples.

  • Print in SED
sed ‘’ adult_t.csv
#or - use the Cat and pipe command
cat adult_t.csv | sed ''
  • Substitute a text: For example in the file, I want to replace Doctorate with Ph.D.
sed ‘s/Doctorate/Phd/g’ adult_t.csv
#if you want to store the transformation in a new file
sed 's/Doctorate/Phd/g' adult_t.csv > new_phd.csv
#g - stands for globla change-apply to the whole file

For more information about SED, please check the documentation.

13. Transformation:

You can use the tr command. The most commonly used is converting to uppercase or lowercase. The below converts from uppercase to lowercase and again from lowercase to uppercase.

Image by the author

Another example:

# convert space with underscore.
$ echo 'Demo Transformation!' | tr ' ' '_'
Demo_Transformation!

Please check this StackOverflow discussion on transformation.

14. Curl:

According to the documentation-curl is a tool for transferring data from or to a server. It supports these protocols: DICT, FILE, FTP, FTPS, GOPHER, GOPHERS, HTTP, HTTPS, IMAP, IMAPS, LDAP, LDAPS, MQTT, POP3, POP3S, RTMP, RTMPS, RTSP, SCP, SFTP, SMB, SMBS, SMTP, SMTPS, TELNET or TFTP. The command is designed to work without user interaction.

The syntax is

curl [options / URLs]

Check for more information on curl. Curl is a useful command if you want to extract some data online.

15. csvsql:

csvql is used to generate SQL statements for a CSV file or execute those statements directly on a database. It also helps in creating the database table and query from the table. This helps a lot in the transformation of the data.

For example

csvsql --query "select county,item_name from joined where quantity > 5;" joined.csv | csvlook

We are using csvsql , the query is to select the county, and item_name fields from the joined.csv file where the quantity is greater than 5 and then output the result to the screen through csvlook. But querying from the CSV file is slow when compared to querying from the SQL table. The entire CSV file is loaded in the in-memory and if the dataset is large then it will affect the performance.

For more information on how to run a SQL query on the CSV file, please refer to the following documentation of the csvkit.

16. Truncating, and filtering the CSV columns: You can use the csvcut to select only the columns needed from the CSV file.

For example

Image by the author

Now you can select the columns you want and then write them to the file. For example, I need Age, Race, and Sex. I select the 3 columns and then write them to a new file called the csvcut file.

Image by the author

Dockerfile Analysis:

Let's check a Dockerfile and understand the bash commands used.

The docker file is available at this location.

Image by the author
  1. From:
Image by the author

2. apt-get and system utilities:

The code is

Image by the author
  1. The Dockerfile RUN command can execute command-line executables within the Docker image. The Run command is executed during the docker image build.
    2. The apt-get update-to update all the packages. Here the && logical AND operator is used. If the update is successful then apt-get install is executed. (command 1 && command 2 → 2 will be executed once the command 1 is successful)
    3. curl — -> curl (short for “Client URL”) is a command-line tool that enables data transfer over various network protocols. Here it is downloading apt-utils, apt-transport, etc.
    4.rm -rf /var/lib/apt/lists/* — rm stands for remove and it says to remove all the files from /var/lib/apt/lists/*.Only if the curl is successful then rm command will be executed. The && logical ANd operator is used.

3. Curl

Image by the author
  • RUN again the curl command. The URL is given and curl downloads the microsoft.asc file.
  • Then the |(Pipe) command is used. The apt-key add- adds the key from the file microsoft.asc.

4. Environment Variable:

Image by the author
  • First, install the updates using apt-get update.
  • Once the update is successful(&&-AND Operator), then set the environment variable value ACCEPT_EULA =Y.
  • Then call the apt-get install and install all the required packages.

5. RUN again:

Image by the author
  • We can see a pattern here. There is an apt-update first which is chained with apt-get install. Once the updates are done and successful, then installation happens.
  • \ escape character.
  • — no-install-recommends:apt-get install -installs only recommended packages but not suggested packages. With, only the main dependencies (packages in the Depends field) are installed.
  • rm -rf → removes the files and directories. options -rf → -f Forces the removal of all files or directories and →-rRemoves directories and their content recursively. The removal of the directories/files happens only after the successful completion of installation. The &&-logical AND operator is used.

6. Install locales and Echo:

Image by the author
  • apt-get install the locales-What is locales? Locales customize programs to your language and country. Based on your language and country then locales are installed.
  • After the successful installation of locals then the echo command is executed.
  • After the echo then the locale-gen is executed. Check this for more information on locales and locale.gen.
  • The main point is all the apt-get, echo, and locale-gen are chained together with && -logical AND operator.

7. Echo and Append:

Image by the author
  • pecl is used to install PHP drivers.
  • the echo command here appends >>(appends to php.ini file) /etc/php/7.0/cli/php.ini .

We saw how the echo,&&,>>,\, rm,rm-rf, the environment variable, pipe command, etc are used in the docker file. So knowing bash is very useful in creating the docker file and also understanding the docker file. MLOps engineers or data engineers will be creating the docker file often.

Conclusion:

Thank you for reading my article about bash and Linux. Again bash will be helpful in automating a lot of manual tasks and also can be used in data science activities like data preprocessing, data exploration, etc. CLI is very fast and easy to learn. Please free to get connected on Linkedin.

Reference:

  1. Acknowledgment-Data Set: The Adult dataset, from the UCI Machine Learning Repository, which has census information.Blake, C.L., and Merz, C.J. (1998). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
  2. Bibetex:@misc{C.J:1998 ,
    author = “Blake, C.L. and Merz”,
    year = “1998”,
    title = “{UCI} Machine Learning Repository”,
    url = “http://archive.ics.uci.edu/ml",
    institution = “University of California, Irvine, School of Information and Computer Sciences” }
  3. AWK Documentation-https://www.gnu.org/software/gawk/manual/gawk.html
  4. SED Documentation-https://www.gnu.org/software/sed/manual/sed.html
  5. DataScience at command line-https://github.com/jeroenjanssens/data-science-at-the-command-line
  6. csvkit-https://csvkit.readthedocs.io/en/latest/
  7. Dockerfile-https://docs.docker.com/engine/reference/builder/
  8. Linux-https://docs.kernel.org/

--

--

ML/DS - Certified GCP Professional Machine Learning Engineer, Certified AWS Professional Machine learning Speciality,Certified GCP Professional Data Engineer .