Getting Windows to save files downloaded from the web in UTF-8 format

Contents

The problem and the fix

My original Powershell code for getting the JSON files was:

set URL=https://ae911truth.org/signatures
powershell "(Invoke-Webrequest -Uri %URL%/%1.json -UseBasicParsing).content" | sed -f _ae911.sed | awk -f _ae911.awk -v list=%1 >>%2

However, the data coming from the .content method appears to be encoded using an internal Windows format and not UTF-8. The fix was to use Set-Content to translate the internal encoding to UTF-8. Set-Content, however, really prefers writing to a file, so I wrote the output to _temp.txt and processed that:

set URL=https://ae911truth.org/signatures
powershell "(Invoke-Webrequest -Uri %URL%/%1.json -UseBasicParsing).content | Set-Content -Path _temp.txt"
sed -f _ae911.sed _temp.txt | awk -f _ae911.awk -v list=%1 >>%2
del _temp.txt

As processed by Linux

 23823 ASCII lines
     0 ISO-8859-1 lines
     0 Non-ISO-ASCII lines
  1175 UTF-8 lines
 24998 total

As processed by Windows, but without piping to “Set-Content”

  23821 ASCII lines
    377 ISO-8859-1 lines
    573 Non-ISO-ASCII lines
    227 UTF-8 lines
  24998 total

As processed by Windows, piping to “Set-Content”

  23823 ASCII lines
    453 ISO-8859-1 lines
    140 Non-ISO-ASCII lines
    582 UTF-8 lines
  24998 total

Bad UTF-8

How did an n-dash or m-dash end up as E2 3F 3F?

  • E2 80 94 is UTF-8 for an m-dash
  • E2 80 94 as ISO-8859-1 is â undef undef
  • E2 3F 3F is Windows-1252 for â??
  • Ergo, the m-dash started out as UTF-8, was incorrectly read in as ISO-8859-1 and stored in the databvase using Windows-1252 encoding as (literally) à??. And that’s how it’s coming back out.

How did É end up as C3 3F?

  • In UTF-8, É is encoded as C3 89 0A
  • C3 89 0A as ISO-8859-1 is à undef <lf>
  • Ergo, É started out as UTF-8, was incorrectly read in as ISO-8859-1 with the invalid second byte being changed to ?, and stored as (literal) Ã?. The trailing linefeed character was probably dropped.

UTF-8 to ISO-8859-1 to Windows-1252

Character UTF-8 (hex) Hex as ISO-8859-1 Converted to Windows-1252
C2 B7  · ÷
À C3 80 0A Ã <undef> <lf> Ã?
à C3 A0 0A Ã <nbsp> <lf> Ã<nbsp>
É C3 89 0A Ã <undef> <lf> Ã?
é C3 A9 0A à © <lf> é
E2 80 89 â <undef> <undef> â??
E2 80 94 â <undef> <undef> â??
E2 80 98 â <undef> <undef> â??
E2 80 9C â <undef> <undef> â??
E2 80 9D â <undef> <undef> â??