Skip to content

Multiple input files - first approach

The script will collect the data in sequence and store the data in memory, then print the arranged data at the end.

Here is an example problem that is easy to solve with awk.

Let's assume that we have collected some data about some persons. It is not systematic, so the data files are not complete and rows are not in the same order...

file: 1.dat

Daniel    10
Anders     7
Sven      56
Ali       17
Peter      6

file: 2.dat

Peter     Monday
Sven      Sunday
David     Tuesday

Let's put them together! If the files had all names with missing data marked - sorting the files and pasting them together will essentially be enough.

Below, it is just one possible way to do it. First we need to have a list of all names, collect the data, then try to print whatever we have collected.

Warning

awk under OS X is not fully compatible with Gawk (GNU awk) and the internal variable ARGIND is not available - the script will not work as intended.

script.awk

#!/usr/bin/awk -f

{ 
  data[$1][ARGIND]= $2
}

END {
  for (i in data) print i"\t"data[i][1]"\t"data[i][2]
}

Run the script like this:

./script.awk 1.dat 2.dat

The script runs over the two files in a row and on each line it uses associative arrays to collect the names from the first column in data[$1][ARGIND]. data[$1][ARGIND] is two dimensional array with indexes [name][ number of current file/argument]. At the end we will have elements like this:

...
data["Sven"][1] = 56
data["Sven"][2] = "Sunday"
...

The END section runs over the collected names and prints the collected data - so we recover as much as possible from both files.

Here is the output.

Anders          7
Daniel          10
Ali             17
Sven            56              Sunday
David                           Tuesday
Peter           6               Monday

Alternative, perhaps better solution in this case, might the specially written tool for this purpose i.e.

$ join -a1 -a2  -j 1 -o 0,1.2,2.2 -e "NULL"   <(sort 1.dat)  <(sort 2.dat)
Ali 17 NULL
Anders 7 NULL
Daniel 10 NULL
David NULL Tuesday
Peter 6 Monday
Sven 56 Sunday

Note

Note that the files need to be sorted by the field they will be joined, since the program is trying to avoid loading the whole data in the memory. If the data is sorted, awk also can join the data without loading it into the memory. http://unix.stackexchange.com/questions/194968/why-isnt-this-awk-command-doing-a-full-outer-join
Credits to Mahesh Panchal for the tip

Exercise

Copy/paste the text in to two files with the suggested names

scientific
2       |       Bacteria        |       Bacteria <bacteria>     |       scientific name |
29      |       Myxococcales    |               |       scientific name |
139     |       Borreliella burgdorferi |               |       scientific name |
161     |       Treponema pallidum subsp. pallidum      |               |       scientific name |
168     |       Treponema pallidum subsp. pertenue      |               |       scientific name |
356     |       Rhizobiales     |               |       scientific name |
638     |       Arsenophonus nasoniae   |               |       scientific name |
genbank
2       |       eubacteria      |               |       genbank common name     |
29      |       fruiting gliding bacteria       |               |       genbank common name     |
139     |       Lyme disease spirochete |               |       genbank common name     |
161     |       syphilis treponeme      |               |       genbank common name     |
168     |       yaws treponeme  |               |       genbank common name     |
356     |       rhizobacteria   |               |       genbank common name     |
638     |       son-killer infecting Nasonia vitripennis        |               |       genbank common name     |

Can you join the information from both files to collect the data in better format?

ID | scientific name | genbank common name

2 | Bacteria | eubacteria
29 | Myxococcales | fruiting gliding bacteria
139 | Borreliella burgdorferi | Lyme disease spirochete

Leave the extra blanks for the first attempt. We will use this problem (cleaning the remaining spaces before and after) to introduce user defined functions.

Hint

Using FILENAME might come handy.

Possible solution
#!/usr/bin/awk -f
BEGIN{ FS="|" }

{
  data[$1][FILENAME]= $2
}

END {
  for (i in data) print trim(i)"|"trim(data[i]["scientific"])"|"trim(data[i]["genbank"])
}

function trim (x) {
  sub(/^[ \t]*/,"",x);
  sub(/[ \t]*$/,"",x);
  return x
}
Solution usung join suggested by Amrei Binzer-Panchal, 2021.01.18
$ join -a1 -a2  -j 1 -o 0,1.2,2.2 -e "NULL" -t "|"  <(sort scientific)  <(sort genbank)

139     |       Borreliella burgdorferi |       Lyme disease spirochete 
161     |       Treponema pallidum subsp. pallidum      |       syphilis treponeme      
168     |       Treponema pallidum subsp. pertenue      |       yaws treponeme  
2       |       Bacteria        |       eubacteria      
29      |       Myxococcales    |       fruiting gliding bacteria       
356     |       Rhizobiales     |       rhizobacteria   
638     |       Arsenophonus nasoniae   |       son-killer infecting Nasonia vitripennis