Getting Windows to save files downloaded from the web in UTF-8 format
Contents
The problem and the fix
My original Powershell code for getting the JSON files was:
set URL=https://ae911truth.org/signatures powershell "(Invoke-Webrequest -Uri %URL%/%1.json -UseBasicParsing).content" | sed -f _ae911.sed | awk -f _ae911.awk -v list=%1 >>%2
However, the data coming from the .content
method appears to be encoded using
an internal Windows format and not UTF-8. The fix was to use Set-Content
to
translate the internal encoding to UTF-8. Set-Content
, however, really
prefers writing to a file, so I wrote the output to _temp.txt
and processed
that:
set URL=https://ae911truth.org/signatures powershell "(Invoke-Webrequest -Uri %URL%/%1.json -UseBasicParsing).content | Set-Content -Path _temp.txt" sed -f _ae911.sed _temp.txt | awk -f _ae911.awk -v list=%1 >>%2 del _temp.txt
As processed by Linux
23823 ASCII lines 0 ISO-8859-1 lines 0 Non-ISO-ASCII lines 1175 UTF-8 lines 24998 total
As processed by Windows, but without piping to “Set-Content”
23821 ASCII lines 377 ISO-8859-1 lines 573 Non-ISO-ASCII lines 227 UTF-8 lines 24998 total
As processed by Windows, piping to “Set-Content”
23823 ASCII lines 453 ISO-8859-1 lines 140 Non-ISO-ASCII lines 582 UTF-8 lines 24998 total
Bad UTF-8
How did an n-dash or m-dash end up as E2 3F 3F
?
-
E2 80 94
is UTF-8 for an m-dash -
E2 80 94
as ISO-8859-1 isâ undef undef
-
E2 3F 3F
is Windows-1252 forâ??
- Ergo, the m-dash started out as UTF-8, was incorrectly read in as ISO-8859-1
and stored in the databvase using Windows-1252 encoding as (literally)
à??
. And that’s how it’s coming back out.
How did É
end up as C3 3F
?
- In UTF-8,
É
is encoded asC3 89 0A
-
C3 89 0A
as ISO-8859-1 isà undef <lf>
- Ergo,
É
started out as UTF-8, was incorrectly read in as ISO-8859-1 with the invalid second byte being changed to?
, and stored as (literal)Ã?
. The trailing linefeed character was probably dropped.
UTF-8 to ISO-8859-1 to Windows-1252
Character | UTF-8 (hex) | Hex as ISO-8859-1 | Converted to Windows-1252 |
---|---|---|---|
• |
C2 B7 |
 · |
÷ |
À |
C3 80 0A |
à <undef> <lf> |
Ã? |
à |
C3 A0 0A |
à <nbsp> <lf> |
Ã<nbsp> |
É |
C3 89 0A |
à <undef> <lf> |
Ã? |
é |
C3 A9 0A |
à © <lf> |
é |
’ |
E2 80 89 |
â <undef> <undef> |
â?? |
— |
E2 80 94 |
â <undef> <undef> |
â?? |
‘ |
E2 80 98 |
â <undef> <undef> |
â?? |
“ |
E2 80 9C |
â <undef> <undef> |
â?? |
” |
E2 80 9D |
â <undef> <undef> |
â?? |
|