1.Simple examples
Let's begin simple
It is quite easy to use Awk from the command line to perform simple operations on text files. Suppose we have a file named "coins.txt" that describes a coin collection. Each line in the file contains the following information:
metal weight in ounces date minted country of origin description
The file has the contents:
gold 1 1986 USA American Eagle
gold 1 1908 Austria-Hungary Franz Josef 100 Korona
silver 10 1981 USA ingot
gold 1 1984 Switzerland ingot
gold 1 1979 RSA Krugerrand
gold 0.5 1981 RSA Krugerrand
gold 0.1 1986 PRC Panda - silver lined
silver 1 1986 USA Liberty dollar
gold 0.25 1986 USA Liberty 5-dollar piece
silver 0.5 1986 USA Liberty 50-cent piece
silver 1 1987 USA Constitution dollar
gold 0.25 1987 USA Constitution 5-dollar piece
gold 1 1988 Canada Maple Leaf
The command bellow will search through the file for lines of text that contain the string "gold", and print them out.
$ awk '/gold/' coins.txt
gold 1 1986 USA American Eagle
gold 1 1908 Austria-Hungary Franz Josef 100 Korona
gold 1 1984 Switzerland ingot
gold 1 1979 RSA Krugerrand
gold 0.5 1981 RSA Krugerrand
gold 0.1 1986 PRC Panda
gold 0.25 1986 USA Liberty 5-dollar piece
gold 0.25 1987 USA Constitution 5-dollar piece
gold 1 1988 Canada Maple Leaf
/gold/
defines a matching criteria that will be used on every line of the file. If no command is specified, the default action is to print the matched lines.
Exercise
- Try to run the above example for "silver". What is different? How can one fix it?
This mimics the use of grep or sed, so why would one possibly need awk?
How does this mimic grep or sed?
In grep
, the command that does exactly the same is:
grep silver coins.txt
In sed
, the command that does exactly the same is:
sed -n "/silver/p" coins.txt
awk
starts to shine when the thing you want to do is more
complex then detecting lines with a text.
Now, suppose we want to list all the coins that were minted before 1980.
See Copilot's solution.
We invoke Awk as follows:
$ awk '$3 < 1980 {print $3, " ",$5,$6,$7,$8}' coins.txt
1908 Franz Josef 100 Korona
1979 Krugerrand
$3 < 1980
, so the commands enclosed in the {print $3, " ",$5,$6,$7,$8}
block will be executed only when the criteria is met - i.e. awk will print the values of columns 3," ",5,6,7, and 8. The columns are separated (defined) by white space (one or more consecutive blanks) or tabulator and addressed by the $
sign i.e. $1
is the value of the first column, $2
- second etc. $0
contains the original (unparsed) line including the separators.
Discussion and exercises
- Can you find all "silver" coins older than 1986? One can use grep to filter the silver coins and pipe the result to awk or do it all together in awk.
- Unfortunately, awk does not have a way to print/address all fields after or before a selected one. How can one print all remaining fields?
- A
TAB
separated version 'coins.tab' is more appropriate in such cases and rather common, for the same reason, in many bioinformatics file formatsgff|bed|sam|vcf
.
What about some math? Can I manipulate or analyze the data?
Let's use the following simple file that contains 3 lines with numbers and some text, just to make our life more difficult (or maybe not?)
1 2 3
5 4 6
7 8 9 10 text
$ awk ' {print $1+$2*$3}' 123.txt
7
29
79
$ awk '$1 > 4 {print $1+$2*$3}' 123.txt
29
79
$1 > 4
is our criteria on when to execute the command block {print $1+$2*$3}
. Note that the matching criteria is outside the {}
block.
Exercises
- print the third column for each line where the value in the second column is smaller that the value in the first column.
- print the original content on each line followed by the result of
$1+$2*$3
. - print the the result of
$1+$2*$3
as 4th column, discarding the unnecessary data from the original file.
Awk command-line syntax:
$ awk ' /pattern/ {action} ' file1 file2 ... fileN
- action is performed on every line that matches pattern.
- If pattern is not provided, action is performed on every line.
- If action is not provided, then all matching lines are simply sent to standard output.
- Since patterns and actions are optional, actions must be enclosed in braces to distinguish them from pattern.
- The statements in an awk program may be indented and formatted using spaces, tabs, and new lines.
- Two special patterns:
BEGIN
(execute an action before first input line) andEND
( ... after all lines are read.)
Simple output examples
{ print }
- will print the whole line to standard out
{ print $0 }
- will do the same thing
{ print $1, $3 }
- expressions separated by a comma are, by default, separated by a single space when output
{ print NF, $1, $NF }
- will print the number of fields, the first field, and the last field in the current record
{ print $(NF-2) }
prints the third to last field
Exercises
- run the last two examples on
123.txt
file. What is the difference betweenNF
and$NF
? - is there a difference between
$(NF-2)
and$NF-2
?
Computing and Printing
{ print $1, $2 * $3 }
You can also do computations on the field values and include the results in your output
{ print NR, $0 }
- will print each line prefixed with its line number
Exercises
- what happens if you provide the
123.txt
file twice for the second example? - compare the output of
NR
andFNR
when you run the above test.
Putting Text in the Output
{ print "total pay for", $1, "is", $2 * $3 }
you can also add other text to the output besides what is in the current record. Note that the inserted text needs to be surrounded by double quotes.
{ printf("total pay for %s is $%.2f\n", $1, $2 * $3) }
when using printf
, formatting is under your control, so no automatic spaces or newlines are provided by awk. You have to insert them yourself.
{ printf("%-8s %6.2f\n", $1, $2 * $3 ) }
- well, this escalated too fast...
Exercises
David 3 6
Ana 5 7
Olla 4 4
- run the examples above with the content of the
pay.txt
file. - remove the minus sign in the
%-8s
formatting to see the effect. - more string manipulations exercises
More on format modifiers: gawk documentation
Files