Unix tools for manipulation, exploration and monitoring of text files
This notebook contains a list of examples of common linux/unix commands.
Objectives
- View and manipulate text and text files using fundamental Unix commands, selecting and joining rows and columns 
- Combine fundamental Unix commands 
- Inspect command output and output files with - teeand- tail -f
Weird syntax ahead
In order to reduce the need for temporary files to show the their effect,
process substitution (that is, the syntax <( *command* )) and pipes (|) are used in the examples.
If this looks obscure,
the reader is invited to create the necessary temporary files.
To follow along
Please navigate to the examples/text_manipulationdirectory.
“Vertical” text manipulation
In this section, we have a look at commands that cut and sew together files “vertically”.
We are going to use a Tab separated value file as an example:
cat example.tsv
X	Y	Z	T
a1	a2	a3	a4
b1	b2	b3	b4
c1	c2	c3	c4
n	2	p	q
n	1	p	q
o	2	q	n
o	1	q	n
p	2	n	o
p	1	n	o
q	2	o	x
q	1	o	x
Tabs are characters that are nicely understood as column separators by many tools.
We slice it vertically using the head and tail commands.
head -n 5 example.tsv
X	Y	Z	T
a1	a2	a3	a4
b1	b2	b3	b4
c1	c2	c3	c4
n	2	p	q
tail -n 4 example.tsv
p	2	n	o
p	1	n	o
q	2	o	x
q	1	o	x
When a text file is large,
the head and tail commands are very useful
to get an idea of its content.
To explore files, wc is also useful
wc example.tsv
 12  48 108 example.tsv
These number are, respectively, the number of lines/words/characters in example.tsv.
Multiple files can be joined together vertically
using the cat command:
cat <(head -n 5 example.tsv) <(tail -n +6 example.tsv)
X	Y	Z	T
a1	a2	a3	a4
b1	b2	b3	b4
c1	c2	c3	c4
n	2	p	q
n	1	p	q
o	2	q	n
o	1	q	n
p	2	n	o
p	1	n	o
q	2	o	x
q	1	o	x
Break and concatenate
- Verify, using the - diffcommand, that the output command above reproduces the original- example.tsv.- Hint - You save the output to a temporary file using - >.
- What if you change the head command to take only 4 lines instead of 5? 
- (Bonus points) Try again, but do not use temporary files. 
Solution
- Using temporary files: - cat <(head -n 5 example.tsv) <(tail -n +6 example.tsv) > reconstructed.tsv diff example.tsv reconstructed.tsv - And there should be no output. 
- A possible solution, in a single command, without temporary files, can be obtained by nesting process substitution: - diff <(cat <(head -n 5 example.tsv) <(tail -n +6 example.tsv)) example.tsv - And there should be no output. 
Another notable program that cuts files vertically is grep.
grep can also be used with regular expressions:
grep -E '[acp][12]?' example.tsv
a1	a2	a3	a4
c1	c2	c3	c4
n	2	p	q
n	1	p	q
p	2	n	o
p	1	n	o
Regular expressions are a very powerful tool to search, extract and replace text, implemented in many programming languages and supported by many tools in the shell.
Multiple grep commands can be combined with pipes, creating sophisticated filters:
grep -E '[acp][12]?' example.tsv | grep -v '2'
n	1	p	q
p	1	n	o
View and search command history
- How would you print the last 10 entries in your command history into a text file? 
- How many times have you used the - lscommand?
Another kind of vertical manipulation is done with the sort command (in this case, according to the second column):
tail -n 8 example.tsv | sort -k2
p	1	n	o
q	1	o	x
n	1	p	q
o	1	q	n
p	2	n	o
q	2	o	x
n	2	p	q
o	2	q	n
Horizontal Manipulations
cut can extract columns from a file:
cut -f1,2 example.tsv
X	Y
a1	a2
b1	b2
c1	c2
n	2
n	1
o	2
o	1
p	2
p	1
q	2
q	1
and paste can join columns together:
paste <(cut -f1,2 example.tsv) <(cut -f3,4 example.tsv)
X	Y	Z	T
a1	a2	a3	a4
b1	b2	b3	b4
c1	c2	c3	c4
n	2	p	q
n	1	p	q
o	2	q	n
o	1	q	n
p	2	n	o
p	1	n	o
q	2	o	x
q	1	o	x
Cut and paste
- Verify using the - diffcommand that the output of the command above is equal to the original- example.tsvfile
- What happens if you change the column choices in - cut?
When cut does not cut it: awk
cut is a simple tool that works when the columns of a file have a one-character separator.
If this is not the case, one can resort to awk, which is a very powerful tool.
To print the first 2 columns of example.tsv with awk, we can use
awk '{print $1,$2}' example.tsv
X Y
a1 a2
b1 b2
c1 c2
n 2
n 1
o 2
o 1
p 2
p 1
q 2
q 1
awk can also be used in a pipe, and do mathematical operations, if you need to do quick checks.
Bulk text manipulation with sed
A command that can be used to manipulate general text
(not necessarily in tables) is sed.
The typical use case is “search and replace”:
sed 's/p/PI/g' example.tsv
X	Y	Z	T
a1	a2	a3	a4
b1	b2	b3	b4
c1	c2	c3	c4
n	2	PI	q
n	1	PI	q
o	2	q	n
o	1	q	n
PI	2	n	o
PI	1	n	o
q	2	o	x
q	1	o	x
sed can also, instead of sending its output to stdout, do the modification in place, when using the -i option:
cp example.tsv example_copy.tsv
sed -i 's/p/PI/g' example_copy.tsv
cat example_copy.tsv
X	Y	Z	T
a1	a2	a3	a4
b1	b2	b3	b4
c1	c2	c3	c4
n	2	PI	q
n	1	PI	q
o	2	q	n
o	1	q	n
PI	2	n	o
PI	1	n	o
q	2	o	x
q	1	o	x
Following a running program
If we have a process that is generating some output in a text file and we want to monitor its output, we have two possibilities.
The tee command
If we just want to see the output of a process
and at the same time save it into a file, the tee command helps us to do that:
./generate.sh | tee growing_file 
Generating line 1...
Generating line 2...
Generating line 3...
Generating line 4...
Generating line 5...
Generating line 6...
Generating line 7...
Generating line 8...
Generating line 9...
Generating line 10...
Done.
The tail -f command
Alternatively, we can use tail -f (-f stands for follow).
Example:
./generate.sh > growing_file &
[1] 29003
This command is generating lines of text and adding them one by one to growing_file.
To monitor the process, we can do
tail -f -s 5 growing_file
Generating line 1...
Generating line 2...
Generating line 3...
Generating line 4...
Generating line 5...
Generating line 6...
Generating line 7...
Generating line 8...
Generating line 9...
Generating line 10...
Done.
And terminate with CTRL+C when we so decide.
Warning
tail -f can be nasty to other users!
When used to monitor files in a global filesystem (e.g., you home directory) the frequent polling by tail -f might strain the filesystem unnecessarily.
By adding the option -s 10, for example, we reduce the load by telling tail to check less frequently - in this case every 10 seconds.