New program: count-http-requests

While I use Pi-hole ad blocker and, in Firefox, the NoScript and uMatrix plug-ins to cut down on the number ads I get when browsing the web, from time to time I like to see just how much traffic these two things save me. On Wednesday 21 August I wrote a little awk program to count HTTP requests, the hosts they connect to, and give me a listing.

Here’s the program:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
#!/usr/bin/awk -f
##---------------------------------------------------------------------------##
#   Program: count-http-requests
#   Author:  Brian  <genius@groupbcl.ca> :)
#   Date:    August 2019
#
#   Reads an input file created from a Firefox Developer Tools Console
#   trace, counts URLs and their success or failure, and displays a
#   list of URLs, counts, and totals.
#
#   As of August 2019, do the following to create an input file and run
#   this program:
#   * Start Firefox
#   * Press F12 to open the Developer Tools
#   * In the Developer Tools pane or window, click "Console"
#   * Leave only "Requests" turned on; turn off Errors, Warnings, Logs,
#     Info, Debug, CSS, and XHR.
#   * Navigate to the URL of interest
#   * When the page finishes loading, right-click on the main body of the
#     developer tools window and select "Export visible messages to
#     clipboard"
#   * Edit a file (for example, "/r/http-requests.A.text" and paste the
#     clipboard contents into it.
#   * Run this program as follows:
#       count-http-requests FILENAME | sort | cut -f2 | less
##---------------------------------------------------------------------------##
# BUUS: This script is part of Brian's Useful Utilities Set
BEGIN { h_count=0; t_count_succeed=0; t_count_fail=0; max_h_len=0 }

# On a GET request, get the FQDN and increment its request count
match($0, /^(GET|POST) *https?:\/\/([^\/]+)/, a) {
    t_count_all++

    # Reverse the host name (e.g. host.domain.tld --> tld.domain.host) to
    # group all hosts in a given domain together
    count = split(a[2], b, /\./)
    host = ""
    for (i=count; i>0; i--) { host = host (i==count ? "" : ".") b[i] }

    # Initialise counters if we haven't seen this host before
    if (host in h_name) { } else {
        h_count++
        h_name[host] = a[2]
        h_count_all[host] = 0
        h_count_succeed[host] = 0
        h_count_fail[host] = 0
    }
    h_count_all[host]++
    if (length(d) > max_h_len) max_h_len = length(d)
}

# On an HTTP response, count success or fail
match($0, /\[HTTP\/... ([0-9][0-9][0-9])/, a) {
    if (a[1] < 400) {
        h_count_succeed[host]++
        t_count_succeed++
    } else {
        h_count_fail[host]++
        t_count_fail++
    }
}

# Display results
END {
    for (host in h_name) {
        i = h_count_all[host] - (h_count_succeed[host] + h_count_fail[host])
        x = ""
        if (h_count_succeed[host])  x = h_count_succeed[host] " succeeded"
        if (h_count_fail[host]) x = x (x ? ", " : "" ) h_count_fail[host] " failed"
        if (i)                  x = x (x ? ", " : "" ) i " never connected"
        printf("%s\t%-" max_h_len+1 "s total %i; %s\n", host, h_name[host] ":",
            h_count_all[host], x)
    }

    print "z\t  Total " t_count_all " requests to " h_count " individual hosts: " \
        t_count_succeed " succeeded, " \
        t_count_fail " failed, " \
        t_count_all - (t_count_succeed + t_count_fail) " never connected"
}

The Daily Mail is the possibly the worst site on the internet for an advertising signal-to-noise ratio. The front page of the Daily Mail for people running with no ad-blockers at all makes nearly 400 individual requests to nearly a hundred different hosts. A lot of the additional requests are due cascading JavaScript, where one JavaScript program makes requests that conneect to additional sites and get more JavaScript, which in turn do the same thing ....

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
ad.360yield.com: total 2; 2 succeeded
acdn.adnxs.com: total 2; 2 succeeded
ib.adnxs.com: total 4; 4 succeeded
    ... (368) lines deleted ...
creative.dailymail.co.uk: total 1; 1 succeeded
crta.dailymail.co.uk: total 4; 4 succeeded
dailymail.co.uk: total 1; 1 succeeded
i.dailymail.co.uk: total 63; 63 succeeded
scripts.dailymail.co.uk: total 1; 1 succeeded
video.dailymail.co.uk: total 4; 4 succeeded
www.dailymail.co.uk: total 23; 23 succeeded
  Total 378 requests to 94 individual hosts: 377 succeeded, 1 failed, 0 never connected

Enabling Firefox’s uMatrix and NoScript add-ons, leaving Pi-hole to block advertising sites, I got the following:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
adservice.google.ca: total 2; 2 never connected
ad.360yield.com: total 1; 1 never connected
acdn.adnxs.com: total 2; 2 succeeded
    ... (183) lines deleted ...
creative.dailymail.co.uk: total 1; 1 never connected
crta.dailymail.co.uk: total 2; 2 succeeded
dailymail.co.uk: total 1; 1 succeeded
i.dailymail.co.uk: total 62; 10 succeeded, 52 never connected
scripts.dailymail.co.uk: total 1; 1 never connected
video.dailymail.co.uk: total 4; 4 succeeded
www.dailymail.co.uk: total 24; 7 succeeded, 17 never connected
  Total 193 requests to 57 individual hosts: 43 succeeded, 1 failed, 149 never connected

When run with Pi-hole ad blocking and uMatrix and NoScript enabled, going to the site looks like this:

1
2
3
4
5
6
7
d3tsytm1wtjqo2.cloudfront.net: total 6; 6 succeeded
dailymail.co.uk: total 1; 1 succeeded
i.dailymail.co.uk: total 56; 56 succeeded
scripts.dailymail.co.uk: total 1; 1 succeeded
video.dailymail.co.uk: total 2; 2 succeeded
www.dailymail.co.uk: total 20; 20 succeeded
  Total 86 requests to 6 individual hosts: 86 succeeded, 0 failed, 0 never connected

Like I said, the Daily Mail is probably the worst offender on the web. Here are a couple of other sites:

Site Full blocking No blocking
cbc.ca/news 65 requests (8 hosts): 65 / 0 / 0 190 requests (48 hosts): 189 / 0 / 1
universetoday.com 46 requests (8 hosts): 45 / 0 / 1 135 requests (40 hosts): 134 / 0 / 1