Unix tools for text file manipulation and exploration
This notebook contains a list of examples of common linux/unix commands.
Objectives
View and manipulate text and text files using fundamental Unix commands
Combine fundamental Unix commands
Weird syntax ahead
In order to reduce the need for temporary files to show the their effect,
process substitution (that is, the syntax <( *command* )
) and pipes (|
) are used in the examples.
If this looks obscure,
the reader is invited to create the necessary temporary files.
To follow along
Please navigate to the examples/text_manipulation
directory.
“Vertical” text manipulation
In this section, we have a look at commands that cut and sew together files “vertically”.
We are going to use a Tab separated value file as an example:
cat example.tsv
X Y Z T
a1 a2 a3 a4
b1 b2 b3 b4
c1 c2 c3 c4
n 2 p q
n 1 p q
o 2 q n
o 1 q n
p 2 n o
p 1 n o
q 2 o x
q 1 o x
Tabs are characters that are nicely understood as column separators by many tools.
We slice it vertically using the head
and tail
commands.
head -n 5 example.tsv
X Y Z T
a1 a2 a3 a4
b1 b2 b3 b4
c1 c2 c3 c4
n 2 p q
tail -n 4 example.tsv
p 2 n o
p 1 n o
q 2 o x
q 1 o x
When a text file is large,
the head
and tail
commands are very useful
to get an idea of its content.
To explore files, wc
is also useful
wc example.tsv
12 48 108 example.tsv
These number are, respectively, the number of lines/words/characters in example.tsv
.
Multiple files can be joined together vertically
using the cat
command:
cat <(head -n 5 example.tsv) <(tail -n +6 example.tsv)
X Y Z T
a1 a2 a3 a4
b1 b2 b3 b4
c1 c2 c3 c4
n 2 p q
n 1 p q
o 2 q n
o 1 q n
p 2 n o
p 1 n o
q 2 o x
q 1 o x
Break and concatenate
Verify, using the
diff
command, that the output command above reproduces the originalexample.tsv
.Hint
You save the output to a temporary file using
>
.What if you change the head command to take only 4 lines instead of 5?
(Bonus points) Try again, but do not use temporary files.
Solution
Using temporary files:
cat <(head -n 5 example.tsv) <(tail -n +6 example.tsv) > reconstructed.tsv diff example.tsv reconstructed.tsv
And there should be no output.
A possible solution, in a single command, without temporary files, can be obtained by nesting process substitution:
diff <(cat <(head -n 5 example.tsv) <(tail -n +6 example.tsv)) example.tsv
And there should be no output.
Another notable program that cuts files vertically is grep
.
grep
can also be used with regular expressions:
grep -E '[acp][12]?' example.tsv
a1 a2 a3 a4
c1 c2 c3 c4
n 2 p q
n 1 p q
p 2 n o
p 1 n o
Regular expressions are a very powerful tool to search, extract and replace text, implemented in many programming languages and supported by many tools in the shell.
Multiple grep
commands can be combined with pipes, creating sophisticated filters:
grep -E '[acp][12]?' example.tsv | grep -v '2'
n 1 p q
p 1 n o
View and search command history
How would you print the last 10 entries in your command history into a text file?
How many times have you used the
ls
command?
Another kind of vertical manipulation is done with the sort
command (in this case, according to the second column):
tail -n 8 example.tsv | sort -k2
p 1 n o
q 1 o x
n 1 p q
o 1 q n
p 2 n o
q 2 o x
n 2 p q
o 2 q n
Horizontal Manipulations
cut
can extract columns from a file:
cut -f1,2 example.tsv
X Y
a1 a2
b1 b2
c1 c2
n 2
n 1
o 2
o 1
p 2
p 1
q 2
q 1
and paste
can join columns together:
paste <(cut -f1,2 example.tsv) <(cut -f3,4 example.tsv)
X Y Z T
a1 a2 a3 a4
b1 b2 b3 b4
c1 c2 c3 c4
n 2 p q
n 1 p q
o 2 q n
o 1 q n
p 2 n o
p 1 n o
q 2 o x
q 1 o x
Cut and paste
Verify using the
diff
command that the output of the command above is equal to the originalexample.tsv
fileWhat happens if you change the column choices in
cut
?
When cut
does not cut it: awk
cut
is a simple tool that works when the columns of a file have a one-character separator.
If this is not the case, one can resort to awk
, which is a very powerful tool.
To print the first 2 columns of example.tsv
with awk
, we can use
awk '{print $1,$2}' example.tsv
X Y
a1 a2
b1 b2
c1 c2
n 2
n 1
o 2
o 1
p 2
p 1
q 2
q 1
awk
can also be used in a pipe, and do mathematical operations, if you need to do quick checks.
Bulk text manipulation with sed
A command that can be used to manipulate general text
(not necessarily in tables) is sed
.
The typical use case is “search and replace”:
sed 's/p/PI/g' example.tsv
X Y Z T
a1 a2 a3 a4
b1 b2 b3 b4
c1 c2 c3 c4
n 2 PI q
n 1 PI q
o 2 q n
o 1 q n
PI 2 n o
PI 1 n o
q 2 o x
q 1 o x
sed
can also, instead of sending its output to stdout
, do the modification in place, when using the -i
option:
cp example.tsv example_copy.tsv
sed -i 's/p/PI/g' example_copy.tsv
cat example_copy.tsv
X Y Z T
a1 a2 a3 a4
b1 b2 b3 b4
c1 c2 c3 c4
n 2 PI q
n 1 PI q
o 2 q n
o 1 q n
PI 2 n o
PI 1 n o
q 2 o x
q 1 o x
Following a running program
If we have a process that is generating some output in a text file and we want to monitor its output, we have two possibilities.
The tee
command
If we just want to see the output of a process
and at the same time save it into a file, the tee
command helps us to do that:
./generate.sh | tee growing_file
Generating line 1...
Generating line 2...
Generating line 3...
Generating line 4...
Generating line 5...
Generating line 6...
Generating line 7...
Generating line 8...
Generating line 9...
Generating line 10...
Done.
The tail -f
command
Alternatively, we can use tail -f
(-f
stands for follow).
Example:
./generate.sh > growing_file &
[1] 11594
This command is generating lines of text and adding them one by one to growing_file
.
To monitor the process, we can do
tail -f -s 5 growing_file
Generating line 1...
Generating line 2...
Generating line 3...
Generating line 4...
Generating line 5...
Generating line 6...
Generating line 7...
Generating line 8...
Generating line 9...
Generating line 10...
Done.
And terminate with CTRL+C
when we so decide.
Warning
tail -f
can be nasty to other users!
When used to monitor files in a global filesystem (e.g., you home directory) the frequent polling by tail -f
might strain the filesystem unnecessarily.
By adding the option -s 10
, for example, we reduce the load by telling tail
to check less frequently - in this case every 10 seconds.