I was working on a small project for parsing maven POM files.
The problem I encountered was, not all POM xml were encoded with UTF-8. Most of them were, in fact encoded with UTF-8
. However, some of the ancient projects or very old versions of popular projects, didn’t strictly stick to UTF-8
.
ISO-8859-1
and Windows-1252
are both seen in POM xml files in my tests. There might be more.
<?xml version="1.0" encoding="ISO-8859-1"?>
Go
doesn’t directly support decode / unmarshal XML
with character encoding other than UTF-8
. You will get below error if you do so:
xml: encoding "ISO-8859-1" declared but Decoder.CharsetReader is nil
But it allows user to provide a customized Reader
for non-UTF8 encodings. Below is the code from Go
standard library for XML decoding:
// encoding/xml/xml.go
enc := procInst("encoding", content)
if enc != "" && enc != "utf-8" && enc != "UTF-8" && !strings.EqualFold(enc, "utf-8") {
if d.CharsetReader == nil {
d.err = fmt.Errorf("xml: encoding %q declared but Decoder.CharsetReader is nil", enc)
return nil, d.err
}
newr, err := d.CharsetReader(enc, d.r.(io.Reader))
if err != nil {
d.err = fmt.Errorf("xml: opening charset %q: %v", enc, err)
return nil, d.err
}
if newr == nil {
panic("CharsetReader returned a nil Reader for charset " + enc)
}
d.switchToReader(newr)
}
It calls Decoder.CharsetReader
with the encoding name and the function should return a Reader
that supports the encoding.
According W3C Recommendation for XML (Extensible Markup Language (XML) 1.0 (Fifth Edition)), the encoding in XML shall be a registered name in IANA-CHARSETS.
The next question is, how do we create a Reader
from the encoding name or charset name?
Fortunately, there is a Go
package golang.org/x/text/encoding/ianaindex
. It provides a function to get the Encoding
by an IANA charset name.
// Encoding returns an Encoding for IANA-registered names. Matching is case-insensitive.
func (x *Index) Encoding(name string) (encoding.Encoding, error)
In the package, variable IANA
is the implementation of Index
type for IANA name mappings:
// IANA is an index that supports all names and aliases using IANA names as
// the canonical identifier.
IANA *Index = iana
We can call ianaindex.IANA.Encoding
function to retrieve the Encoding
by the charset name.
decoder.CharsetReader = func(charset string, reader io.Reader) (io.Reader, error) {
enc, err := ianaindex.IANA.Encoding(charset)
if err != nil {
return nil, fmt.Errorf("charset %s: %s", charset, err.Error())
}
return enc.NewDecoder().Reader(reader), nil
}
So far so good. But soon, I met another issue. I got a panic when parsing a XML with encoding US-ASCII
:
<?xml version="1.0" encoding="US-ASCII"?>
panic: runtime error: invalid memory address or nil pointer dereference
It happened when run into this line of code: return enc.NewDecoder().Reader(reader), nil
. What’s happening? There was no error but ianaindex.IANA.Encoding(charset)
returned a nil
for US-ASCII
. It turns out the Go
package doens’t support or implement all charsets / encodings. Some of them are missing. The Encoding
function returns nil
for those unsupported but valid charset names. There is a bug on github for this issue.
As a workaround, I decided just to assume the unsupported ones from an XML encoding should be compatible with (or a subset of) UTF-8
(US-ASCII
is indeed a subset of UTF-8
). The assumption might be wrong in other cases. You may have to handle differently for other encodings.
decoder.CharsetReader = func(charset string, reader io.Reader) (io.Reader, error) {
e, err := ianaindex.IANA.Encoding(charset)
if err != nil {
return nil, fmt.Errorf("encoding %s: %s", charset, err.Error())
}
if e == nil {
// Assume it's compatible with (a subset of) UTF-8 encoding
// Bug: https://github.com/golang/go/issues/19421
return reader, nil
}
return e.NewDecoder().Reader(reader), nil
}
Put it altogether, to unmarshal XML data in go: