Linux Shell Commands can be your time saver

When it comes to file parsing or data preprocessing, what would be the first programming language that comes to your mind?

It might be Python, R, or some other similar scripting languages. Granted, these modern and high-level languages are very powerful and empower us to achieve our goals usually in less than a few dozens of lines of codes. However, Linux Shell commands seem to be a forgotten pearl because it is relatively old syntax and less intuitive tutorials online.

In this article, I am going to let you get a flavor about how Shell command can be super powerful in certain cases, and more importantly, how easy you can learn and directly adopt it in your own day-to-day work. I will just focus on one specific utility awk in this post. If you find these examples useful, I would like to refer you to my Github pages for more of them using other Linux utility functions.

Without further ado, let’s get started!

I have a CSV file, how to change the delimiter to tab?

It is quite common that the input file for a certain program needs to be a .tsv file or files demarcated by tab, whereas we only have a .csv file from Microsoft Excel. (Illustrated in the figures below)

Image for post
CSV file we have

You just need to type one line of code in the terminal:

awk '{$1=$1}1' FS="," OFS="\t" file4.csv > file4.txt

Now we have:

Image for post
tab demarcated file we want

Let’s hold off the explanation for one second, even though you don’t know anything about why it works, you can just type this command in your Mac terminal or a Unix terminal in Windows, then you will never worry about this type of conversion tasks anymore, isn’t it?

Now let’s delve a bit into the syntax itself, awk is a powerful file processing utility in the Linux Shell environment, what this command does is:

{$1=$1}1 : Reset the buffer (it’s OK if you don’t understand the details)

FS="," : Tell awk the current delimiter is ,

OFS="\t" : Tell awk the wanted delimiter is \t

Then just specify your input file and the output file path where the awk will dump the result, done!

It is not hard to do in Python or R I admitted, but think about how convenient it is to just type one command in your terminal instead of opening your Python IDE, reading the file, and rewrite your file in another delimiter. What’s cool about the Shell command is that usually, it is pre-installed on your PC, you don’t need to worry about setting up the environment, installing packages (python pandas package for instance). To conclude a bit here, learning Shell doesn’t aim to replace any other languages, it just brings you some very handy functions that can be a time saver in some cases, then why not using them?

How to make my horizontal outputs vertical

To illustrate the problem, let’s say you have the following file:

Image for post
original horizontal input

In order to make them vertically displayed, you only need one line of code:

awk '{for(i=1;i<=NF;i++){print $i}}' file5.txt > file5_new.txt

What the file would look like?

Image for post
new vertical output

What this awk command does is, for each line (in this input file, we only have one line), we iterate over the column, starting with index i=1, which is the first column, ends with index i=NF, NF is a special variable in awk which stands for the length of each line. Hence, the above command simply said, for each item in the column, we will print it out one by one, each of them will occupy a whole line which allows us to achieve our very goal here.

Then you may ask, what if I have a vertical output but I would like it to be horizontally displayed, can you reverse your operation? Sure!

​cat file5_new.txt | tr '\n' ' ' | awk '{$1=$1}1' FS=" " OFS="\t" > file5_restore.txt

See the result:

Image for post
restoring the original horizontal format

A bit explanation of the command, we first read the file using cat , then we replace the newline character for each line to white space using tr command, it will result in something like this:

Image for post
intermediate result

Then picking up from here, we use awk command again to simply change the delimiter from white space to tab (explained in the last section), done!

A capstone task to understand awk

We have already gotten a flavor of how powerful awk is: there is a very classical task that can guide you to understand the basic syntax of awk command.

Imagine you have a file like the below:

Image for post
input file

I want to know the sum of col3 , Can you achieve it in one line of code? It is useful because, in real life, the input file may be of 1 million rows instead of only 4 rows in this toy example.

$ ​awk 'BEGIN{FS="\t";count=0}{if(NR>1){count+=$3}}END{print count}' file6.txt171

Now let’s illustrate the mechanisms:

Image for post
Explanation of awk

awk processes the file line by line, but it will execute something in BEGIN before delving into each line, and it will execute something after finishing to process each line. This property allows us to easily compute the mean or sum because it basically is a running sum so as we initialize a count variable, and adding value to it until the end, we can either print the final count out or averaging it to get the mean value. With that, we will find out a lot of tasks is manageable in Linux Shell just by utilizing awk.

Why Linux and Shell are useful in general

One question that I was often asked is, why we need to learn Shell commands given the fact that Python can solve most of the tasks in a more structured format? The answer is, Shell has its unique advantages:

  1. In Python, we are working on variables, we manipulate dozens of variables in the memory and get the desired outputs.But in Shell, we are working on files, which allows you to automate the process of manipulating thousands of files within several lines of codes.

2. Shell command allows you to perform the cross-language tasks, or glue several Python, R, or even Matlab scripts together as a meta-script.

3. In some certain cases, Shell commands can be more convenient than other scripting languages (one line code versus several lines).

4. Linux system is prevalent in the cloud computing platforms, high-performance supercomputers, and GPUs, it is highly likely that we will be in a situation where no python interpreter is available. Then Linux and Shell command will be your only weapon to finish up your tasks.