DZHG  / How to Parse XML with Non-UTF8 Encoding in Go

I was working on a small project for parsing maven POM files.

The problem I encountered was, not all POM xml were encoded with UTF-8. Most of them were, in fact encoded with UTF-8. However, some of the ancient projects or very old versions of popular projects, didn’t strictly stick to UTF-8.

ISO-8859-1 and Windows-1252 are both seen in POM xml files in my tests. There might be more.

<?xml version="1.0" encoding="ISO-8859-1"?>

Go doesn’t directly support decode / unmarshal XML with character encoding other than UTF-8. You will get below error if you do so:

xml: encoding "ISO-8859-1" declared but Decoder.CharsetReader is nil

But it allows user to provide a customized Reader for non-UTF8 encodings. Below is the code from Go standard library for XML decoding:

// encoding/xml/xml.go
enc := procInst("encoding", content)
if enc != "" && enc != "utf-8" && enc != "UTF-8" && !strings.EqualFold(enc, "utf-8") {
	if d.CharsetReader == nil {
		d.err = fmt.Errorf("xml: encoding %q declared but Decoder.CharsetReader is nil", enc)
		return nil, d.err
	}
	newr, err := d.CharsetReader(enc, d.r.(io.Reader))
	if err != nil {
		d.err = fmt.Errorf("xml: opening charset %q: %v", enc, err)
		return nil, d.err
	}
	if newr == nil {
		panic("CharsetReader returned a nil Reader for charset " + enc)
	}
	d.switchToReader(newr)
}

It calls Decoder.CharsetReader with the encoding name and the function should return a Reader that supports the encoding.

According W3C Recommendation for XML (Extensible Markup Language (XML) 1.0 (Fifth Edition)), the encoding in XML shall be a registered name in IANA-CHARSETS.

The next question is, how do we create a Reader from the encoding name or charset name?

Fortunately, there is a Go package golang.org/x/text/encoding/ianaindex. It provides a function to get the Encoding by an IANA charset name.

// Encoding returns an Encoding for IANA-registered names. Matching is case-insensitive. 
func (x *Index) Encoding(name string) (encoding.Encoding, error)

In the package, variable IANA is the implementation of Index type for IANA name mappings:

// IANA is an index that supports all names and aliases using IANA names as
// the canonical identifier.
IANA *Index = iana

We can call ianaindex.IANA.Encoding function to retrieve the Encoding by the charset name.

decoder.CharsetReader = func(charset string, reader io.Reader) (io.Reader, error) {
	enc, err := ianaindex.IANA.Encoding(charset)
	if err != nil {
		return nil, fmt.Errorf("charset %s: %s", charset, err.Error())
	}
	return enc.NewDecoder().Reader(reader), nil
}

So far so good. But soon, I met another issue. I got a panic when parsing a XML with encoding US-ASCII:

<?xml version="1.0" encoding="US-ASCII"?>
panic: runtime error: invalid memory address or nil pointer dereference

It happened when run into this line of code: return enc.NewDecoder().Reader(reader), nil. What’s happening? There was no error but ianaindex.IANA.Encoding(charset) returned a nil for US-ASCII. It turns out the Go package doens’t support or implement all charsets / encodings. Some of them are missing. The Encoding function returns nil for those unsupported but valid charset names. There is a bug on github for this issue.

As a workaround, I decided just to assume the unsupported ones from an XML encoding should be compatible with (or a subset of) UTF-8 (US-ASCII is indeed a subset of UTF-8). The assumption might be wrong in other cases. You may have to handle differently for other encodings.

decoder.CharsetReader = func(charset string, reader io.Reader) (io.Reader, error) {
	e, err := ianaindex.IANA.Encoding(charset)
	if err != nil {
		return nil, fmt.Errorf("encoding %s: %s", charset, err.Error())
	}
	if e == nil {
		// Assume it's compatible with (a subset of) UTF-8 encoding
		// Bug: https://github.com/golang/go/issues/19421
		return reader, nil
	}
	return e.NewDecoder().Reader(reader), nil
}

Put it altogether, to unmarshal XML data in go: