1.Simple examples

Let's begin simple

It is quite easy to use Awk from the command line to perform simple operations on text files. Suppose we have a file named "coins.txt" that describes a coin collection. Each line in the file contains the following information:

metal weight in ounces date minted country of origin description

The file has the contents:

coins.txt

gold     1    1986  USA                 American Eagle
gold     1    1908  Austria-Hungary     Franz Josef 100 Korona
silver  10    1981  USA                 ingot
gold     1    1984  Switzerland         ingot
gold     1    1979  RSA                 Krugerrand
gold     0.5  1981  RSA                 Krugerrand
gold     0.1  1986  PRC                 Panda - silver lined
silver   1    1986  USA                 Liberty dollar
gold     0.25 1986  USA                 Liberty 5-dollar piece
silver   0.5  1986  USA                 Liberty 50-cent piece
silver   1    1987  USA                 Constitution dollar
gold     0.25 1987  USA                 Constitution 5-dollar piece
gold     1    1988  Canada              Maple Leaf

The command bellow will search through the file for lines of text that contain the string "gold", and print them out.

$ awk '/gold/' coins.txt
gold     1    1986  USA                 American Eagle
gold     1    1908  Austria-Hungary     Franz Josef 100 Korona
gold     1    1984  Switzerland         ingot
gold     1    1979  RSA                 Krugerrand
gold     0.5  1981  RSA                 Krugerrand
gold     0.1  1986  PRC                 Panda
gold     0.25 1986  USA                 Liberty 5-dollar piece
gold     0.25 1987  USA                 Constitution 5-dollar piece
gold     1    1988  Canada              Maple Leaf

/gold/ defines a matching criteria that will be used on every line of the file. If no command is specified, the default action is to print the matched lines.

Exercise

Try to run the above example for "silver". What is different? How can one fix it?

This mimics the use of grep or sed, so why would one possibly need awk?

How does this mimic grep or sed?

In grep, the command that does exactly the same is:

grep silver coins.txt

In sed, the command that does exactly the same is:

sed -n "/silver/p" coins.txt

awk starts to shine when the thing you want to do is more complex then detecting lines with a text.

Now, suppose we want to list all the coins that were minted before 1980.
See Copilot's solution.

We invoke Awk as follows:

$ awk '$3 < 1980 {print $3, "    ",$5,$6,$7,$8}' coins.txt
1908      Franz Josef 100 Korona
1979      Krugerrand

In the example above, we defined a matching criteria $3 < 1980, so the commands enclosed in the {print $3, " ",$5,$6,$7,$8} block will be executed only when the criteria is met - i.e. awk will print the values of columns 3," ",5,6,7, and 8. The columns are separated (defined) by white space (one or more consecutive blanks) or tabulator and addressed by the $ sign i.e. $1 is the value of the first column, $2 - second etc. $0 contains the original (unparsed) line including the separators.

Discussion and exercises

Can you find all "silver" coins older than 1986? One can use grep to filter the silver coins and pipe the result to awk or do it all together in awk.
Unfortunately, awk does not have a way to print/address all fields after or before a selected one. How can one print all remaining fields?
A TAB separated version 'coins.tab' is more appropriate in such cases and rather common, for the same reason, in many bioinformatics file formats gff|bed|sam|vcf.

What about some math? Can I manipulate or analyze the data?

Let's use the following simple file that contains 3 lines with numbers and some text, just to make our life more difficult (or maybe not?)

123.txt

1 2 3
5 4 6
7 8 9 10 text

Let's try to make some calculations with the data in this file - summation and multiplication in this case.

$ awk '       {print $1+$2*$3}' 123.txt
7
29
79

Again, we did not provide any matching criteria, so the commands will be executed on every line. Try to run the command. What do you think? Does awk provide the right answer? I am serious! ;-) Let's do the math only if the value in the first column is greater than 4.

$ awk '$1 > 4 {print $1+$2*$3}' 123.txt
29
79

Now, $1 > 4 is our criteria on when to execute the command block {print $1+$2*$3}. Note that the matching criteria is outside the {} block.

Exercises

print the third column for each line where the value in the second column is smaller that the value in the first column.
print the original content on each line followed by the result of $1+$2*$3.
print the the result of $1+$2*$3 as 4^th column, discarding the unnecessary data from the original file.

Awk command-line syntax:

$ awk ' /pattern/ {action} ' file1 file2 ... fileN

action is performed on every line that matches pattern.
If pattern is not provided, action is performed on every line.
If action is not provided, then all matching lines are simply sent to standard output.
Since patterns and actions are optional, actions must be enclosed in braces to distinguish them from pattern.
The statements in an awk program may be indented and formatted using spaces, tabs, and new lines.
Two special patterns: BEGIN (execute an action before first input line) and END ( ... after all lines are read.)

Simple output examples

{ print } - will print the whole line to standard out

{ print $0 } - will do the same thing

{ print $1, $3 } - expressions separated by a comma are, by default, separated by a single space when output

{ print NF, $1, $NF } - will print the number of fields, the first field, and the last field in the current record

{ print $(NF-2) } prints the third to last field

Exercises

run the last two examples on 123.txt file. What is the difference between NF and $NF?
is there a difference between $(NF-2) and $NF-2?

Computing and Printing

{ print $1, $2 * $3 } You can also do computations on the field values and include the results in your output

{ print NR, $0 } - will print each line prefixed with its line number

Exercises

what happens if you provide the 123.txt file twice for the second example?
compare the output of NR and FNR when you run the above test.

Putting Text in the Output

{ print "total pay for", $1, "is", $2 * $3 } you can also add other text to the output besides what is in the current record. Note that the inserted text needs to be surrounded by double quotes.

{ printf("total pay for %s is $%.2f\n", $1, $2 * $3) } when using printf, formatting is under your control, so no automatic spaces or newlines are provided by awk. You have to insert them yourself.

{ printf("%-8s %6.2f\n", $1, $2 * $3 ) } - well, this escalated too fast...

Exercises

pay.txt

David   3       6
Ana     5       7
Olla    4       4

run the examples above with the content of the pay.txt file.
remove the minus sign in the %-8s formatting to see the effect.
more string manipulations exercises

More on format modifiers: gawk documentation

Files

coins.txt