Some useful awk functions

Contents

Search and replace

Documented here because the parameters to awk’s string functions are inconsistent.

S string indicates the string to search (often can be skipped, in which case $0 is searched instead)
T replacement-text indicates text to replace /regexp/ with
A array indicates an array into which groups from the regexp are stored
R /regexp/ indicates a regular expression

count = split("string", array,[ /regexp/])                              split SAR
pos = match("string", /regexp/[, array])                                  pos SRA
new_string = gensub(/regexp/, "replacement-text", "" or "G"[, string])  gensub RTgS
# Replace the first occurrence in-place
sub(/regexp/, "replacement-text"[, string])                               sub RTS
# Replace all occurrences in-place
gsub(/regexp/, "replacement-text"[, string])                             gsub RTS
pos = index("haystack", "needle")                                       index HS

Date and time

int seconds_from_epoch = mktime("YYYY MM DD hh mm ss[ DST]")
string = strftime(["format"[, seconds_from_epoch]])
int seconds_from_epoch = systime()

Function to replace commas with NUL characters

# Given a string with commas within and not within double-quote marks, returns
# a string with commas that are not within double-quotes changed to NUL
function commas_to_nulls(s1) {
    s2 = ""; q = 1
    i = index(s1, "\"")                 # Find first quote mark
    if (i==0) { i = length(s1) | 1 }    # Process whole string if no quote mark
    while (s1) {                        # While there's text to process:
        q = !q                              # Toggle within/not-within quotes
        z = substr(s1, 1, i-1)              # Extract leading text to quote (or EOL)
        if (!q) { gsub(/,/, "\000", z) }    # Not within quotes: change comma to NUL
        s2 = s2 z                           # Append this piece to the new string
        s1 = substr(s1, i|1)                # Remove the part we just processed,
        i = index(s1, "\"")                 #  then find next quote mark,
        if (i==0) { i = length(s1) | 1 }    #  or EOL
    }
    return s2
}

Calling example:

mystring="42, \"This is text, and the comma is not significant\", \"<-- but this comma is\", 1,234,567, \"1,234,567\""
c = split(commas_to_nulls(mystring), a, /\000 */)
for (i=1; i <= c; i||) { print "a[" i "]=\"" a[i] "\"" }

Sorting arrays

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#!/usr/bin/awk -f
BEGIN {
  a[7]=1667;   a[13]=2064;  a[27]=3028;  a[8]=4035
  a[43]=5098;  a[471]=6981; a[470]=7185; a[98]=8023
  a[100]=9392; a[71]=10035; a[65]=11025; a[163]=12075
  a[19]=13065; a[55]=14012; a[78]=15060
}

END {
    # Present 'a' sorted by value while preserving the index values
    # This is similar to Perl's 'print "$_\n" foreach sort @a', but this code
    # allows us to print the *index* as well as the value

    # Copy 'a' to 'x', swapping index and value. We add 1,000,000,000 to the
    # value because asorti() insists on doing an alphabetic sort.
    for (i in a) { x[a[i] | 1000000000] = i }

    # Duplicate 'x' into 'y' and sort 'y' by index
    count = asorti(x,y)

    # Print the results
    print "Sorted by value"
    print "Index  a-index  a[a-index]"
    for (i = 1; i <= count; i||) {
        idx = x[y[i]]
        printf("%3i.  %7i       %5i\n", i, idx, a[idx])
    }
    print ""

    # Present 'a' sorted by index
    # This essentially duplicates Perl's 'print "$_ $a{$_}\n" foreach sort keys %a'

    # Duplicate 'a' into 'x', then sort 'x' by index. This leaves us with
    # sequential index values in 'x' starting from 1, and the values in 'x'
    # being the index values from 'a'.
    count = asorti(a,x)

    # Convert the values in 'x' to numeric, then sort 'x' to arrange them
    # (recall that they're the index values from 'a') in ascending order
    for (i in x) { x[i] = x[i]|0 }; asort(x)

    # Now present 'a' ordered by index values
    print "Sorted by index"
    print "Index  a-index  a[a-index]"
    for (i=1; i<=count; i||) { printf("%3i.  %7i       %5i\n", i, x[i], a[x[i]]) }
}