DZHG  / How to Handle UTF-8 URIs in Spring MVC

Introduction

UTF-8 is a commonly used encoding for URIs, for example, https://zh.wikipedia.org/wiki/進擊的巨人. However, in Servlet Spec, the default request character encoding is ISO-8859-1. As browser doesn’t not send a content-type header with encoding for GET requests. The servlet engine will try to decode the URI using ISO-8859-1 enoding by default. It will cause issue when extracting the URI for other processing.

In Spring MVC and Spring Security, the decoded URI is used in many places. One of the usage is to match a RequestHandler in DispatcherServlet. If you use AntPathRequestMatcher, you may get unexpected result for request URIs with UTF-8 characters. Some requests may not match the pattern which it should match.

For example, the URI /wiki/%E9%80%B2%E6%93%8A%E7%9A%84%E5%B7%A8%E4%BA%BA, supposed to match Ant pattern: /wiki/*, but it won’t match. The reason is internally, Spring uses ISO-8859-1 to decode URIs. The decoded URI is /wiki/進擊的巨人. Spring will try to uses regular expression .* match the last part. However, it won’t match as it contains invalid characters.

Solutions

There are 3 different ways to solve the problem.

  1. Set character encoding in request with content-type header
  2. Use a filter to manually set character encoding for requests
  3. Set default request character encoding to UTF-8 in servlet engine level (web.xml)

The first solution may not work as browsers do not set content-type header with character encoding for GET requests.

The second one requires an additional filter defined in web.xml. It does only one thing, call request.setCharacterEncoding("UTF-8") to explictly set the character encoding to UTF-8.

The third solution might be the best choice. Adding a request-character-encoding element in web.xml and set it to UTF-8. It will be come the default character encoding for all requests. Individual request can still override it by using first and second solutions.

<!-- in web.xml -->
<request-character-encoding>UTF-8</request-character-encoding>

A Little Bit of Details

In Spring, the URI decode code is in UrlPathHelper:

// org.springframework.web.util.UrlPathHelper
public class UrlPathHelper {
	// ...... 
	private String decodeInternal(HttpServletRequest request, String source) {
		String enc = determineEncoding(request);
		try {
			return UriUtils.decode(source, enc);
		}
		catch (UnsupportedCharsetException ex) {
			if (logger.isWarnEnabled()) {
				logger.warn("Could not decode request string [" + source + "] with encoding '" + enc +
						"': falling back to platform default encoding; exception message: " + ex.getMessage());
			}
			return URLDecoder.decode(source);
		}
	}
	// ......
}

determineEncoding will check the encoding of the current request first. If it’s not set, it will use default character encoding (‘ISO-8859-1’).

	protected String determineEncoding(HttpServletRequest request) {
		String enc = request.getCharacterEncoding();
		if (enc == null) {
			enc = getDefaultEncoding();
		}
		return enc;
	}

Based on servlet spec, request.getCharacterEncoding() will check if the current request has an explict character encoding. If not, it will check current servlet context setting (which is the request-character-encoding in web.xml).

Conclusion

If we use any one of the solution, the URI /wiki/%E9%80%B2%E6%93%8A%E7%9A%84%E5%B7%A8%E4%BA%BA will be decoded correctly as /wiki/進擊的巨人.

We can handle UTF-8 encoded URIs correctly in Spring.