Unix tools for text file manipulation and exploration

This notebook contains a list of examples of common linux/unix commands.

Objectives

  1. View and manipulate text and text files using fundamental Unix commands

  2. Combine fundamental Unix commands

Weird syntax ahead

In order to reduce the need for temporary files to show the their effect, process substitution (that is, the syntax <( *command* )) and pipes (|) are used in the examples. If this looks obscure, the reader is invited to create the necessary temporary files.

To follow along

Please navigate to the examples/text_manipulationdirectory.

“Vertical” text manipulation

In this section, we have a look at commands that cut and sew together files “vertically”.

We are going to use a Tab separated value file as an example:

cat example.tsv
X	Y	Z	T
a1	a2	a3	a4
b1	b2	b3	b4
c1	c2	c3	c4
n	2	p	q
n	1	p	q
o	2	q	n
o	1	q	n
p	2	n	o
p	1	n	o
q	2	o	x
q	1	o	x

Tabs are characters that are nicely understood as column separators by many tools. We slice it vertically using the head and tail commands.

head -n 5 example.tsv
X	Y	Z	T
a1	a2	a3	a4
b1	b2	b3	b4
c1	c2	c3	c4
n	2	p	q
tail -n 4 example.tsv
p	2	n	o
p	1	n	o
q	2	o	x
q	1	o	x

When a text file is large, the head and tail commands are very useful to get an idea of its content. To explore files, wc is also useful

wc example.tsv
 12  48 108 example.tsv

These number are, respectively, the number of lines/words/characters in example.tsv.

Multiple files can be joined together vertically using the cat command:

cat <(head -n 5 example.tsv) <(tail -n +6 example.tsv)
X	Y	Z	T
a1	a2	a3	a4
b1	b2	b3	b4
c1	c2	c3	c4
n	2	p	q
n	1	p	q
o	2	q	n
o	1	q	n
p	2	n	o
p	1	n	o
q	2	o	x
q	1	o	x

Break and concatenate

  1. Verify, using the diff command, that the output command above reproduces the original example.tsv.

    Hint

    You save the output to a temporary file using >.

  2. What if you change the head command to take only 4 lines instead of 5?

  3. (Bonus points) Try again, but do not use temporary files.

Another notable program that cuts files vertically is grep. grep can also be used with regular expressions:

grep -E '[acp][12]?' example.tsv
a1	a2	a3	a4
c1	c2	c3	c4
n	2	p	q
n	1	p	q
p	2	n	o
p	1	n	o

Regular expressions are a very powerful tool to search, extract and replace text, implemented in many programming languages and supported by many tools in the shell.

Multiple grep commands can be combined with pipes, creating sophisticated filters:

grep -E '[acp][12]?' example.tsv | grep -v '2'
n	1	p	q
p	1	n	o

View and search command history

  1. How would you print the last 10 entries in your command history into a text file?

  2. How many times have you used the ls command?

Another kind of vertical manipulation is done with the sort command (in this case, according to the second column):

tail -n 8 example.tsv | sort -k2
p	1	n	o
q	1	o	x
n	1	p	q
o	1	q	n
p	2	n	o
q	2	o	x
n	2	p	q
o	2	q	n

Horizontal Manipulations

cut can extract columns from a file:

cut -f1,2 example.tsv
X	Y
a1	a2
b1	b2
c1	c2
n	2
n	1
o	2
o	1
p	2
p	1
q	2
q	1

and paste can join columns together:

paste <(cut -f1,2 example.tsv) <(cut -f3,4 example.tsv)
X	Y	Z	T
a1	a2	a3	a4
b1	b2	b3	b4
c1	c2	c3	c4
n	2	p	q
n	1	p	q
o	2	q	n
o	1	q	n
p	2	n	o
p	1	n	o
q	2	o	x
q	1	o	x

Cut and paste

  1. Verify using the diff command that the output of the command above is equal to the original example.tsv file

  2. What happens if you change the column choices in cut?

When cut does not cut it: awk

cut is a simple tool that works when the columns of a file have a one-character separator.

If this is not the case, one can resort to awk, which is a very powerful tool.

To print the first 2 columns of example.tsv with awk, we can use

awk '{print $1,$2}' example.tsv
X Y
a1 a2
b1 b2
c1 c2
n 2
n 1
o 2
o 1
p 2
p 1
q 2
q 1

awk can also be used in a pipe, and do mathematical operations, if you need to do quick checks.

Bulk text manipulation with sed

A command that can be used to manipulate general text (not necessarily in tables) is sed.

The typical use case is “search and replace”:

sed 's/p/PI/g' example.tsv
X	Y	Z	T
a1	a2	a3	a4
b1	b2	b3	b4
c1	c2	c3	c4
n	2	PI	q
n	1	PI	q
o	2	q	n
o	1	q	n
PI	2	n	o
PI	1	n	o
q	2	o	x
q	1	o	x

sed can also, instead of sending its output to stdout, do the modification in place, when using the -i option:

cp example.tsv example_copy.tsv
sed -i 's/p/PI/g' example_copy.tsv
cat example_copy.tsv
X	Y	Z	T
a1	a2	a3	a4
b1	b2	b3	b4
c1	c2	c3	c4
n	2	PI	q
n	1	PI	q
o	2	q	n
o	1	q	n
PI	2	n	o
PI	1	n	o
q	2	o	x
q	1	o	x

Following a running program

If we have a process that is generating some output in a text file and we want to monitor its output, we have two possibilities.

The tee command

If we just want to see the output of a process and at the same time save it into a file, the tee command helps us to do that:

./generate.sh | tee growing_file 
Generating line 1...
Generating line 2...
Generating line 3...
Generating line 4...
Generating line 5...
Generating line 6...
Generating line 7...
Generating line 8...
Generating line 9...
Generating line 10...
Done.

The tail -f command

Alternatively, we can use tail -f (-f stands for follow). Example:

./generate.sh > growing_file &
[1] 11594

This command is generating lines of text and adding them one by one to growing_file. To monitor the process, we can do

tail -f -s 5 growing_file
Generating line 1...
Generating line 2...
Generating line 3...
Generating line 4...
Generating line 5...
Generating line 6...
Generating line 7...
Generating line 8...
Generating line 9...
Generating line 10...
Done.

And terminate with CTRL+C when we so decide.

Warning

tail -f can be nasty to other users!

When used to monitor files in a global filesystem (e.g., you home directory) the frequent polling by tail -f might strain the filesystem unnecessarily. By adding the option -s 10, for example, we reduce the load by telling tail to check less frequently - in this case every 10 seconds.