Introduction
UTF-8
is a commonly used encoding for URIs, for example, https://zh.wikipedia.org/wiki/進擊的巨人. However, in Servlet Spec, the default request character encoding is ISO-8859-1
. As browser doesn’t not send a content-type
header with encoding for GET
requests. The servlet engine will try to decode the URI using ISO-8859-1
enoding by default. It will cause issue when extracting the URI for other processing.
In Spring MVC and Spring Security, the decoded URI is used in many places. One of the usage is to match a RequestHandler
in DispatcherServlet
. If you use AntPathRequestMatcher
, you may get unexpected result for request URIs with UTF-8 characters. Some requests may not match the pattern which it should match.
For example, the URI /wiki/%E9%80%B2%E6%93%8A%E7%9A%84%E5%B7%A8%E4%BA%BA
, supposed to match Ant
pattern: /wiki/*
, but it won’t match. The reason is internally, Spring uses ISO-8859-1
to decode URIs. The decoded URI is /wiki/é²æç巨人
. Spring will try to uses regular expression .*
match the last part. However, it won’t match as it contains invalid characters.
Solutions
There are 3 different ways to solve the problem.
- Set character encoding in request with
content-type
header - Use a filter to manually set character encoding for requests
- Set default request character encoding to
UTF-8
in servlet engine level (web.xml)
The first solution may not work as browsers do not set content-type
header with character encoding for GET
requests.
The second one requires an additional filter defined in web.xml
. It does only one thing, call request.setCharacterEncoding("UTF-8")
to explictly set the character encoding to UTF-8
.
The third solution might be the best choice. Adding a request-character-encoding
element in web.xml
and set it to UTF-8
. It will be come the default character encoding for all requests. Individual request can still override it by using first and second solutions.
<!-- in web.xml -->
<request-character-encoding>UTF-8</request-character-encoding>
A Little Bit of Details
In Spring, the URI decode code is in UrlPathHelper
:
// org.springframework.web.util.UrlPathHelper
public class UrlPathHelper {
// ......
private String decodeInternal(HttpServletRequest request, String source) {
String enc = determineEncoding(request);
try {
return UriUtils.decode(source, enc);
}
catch (UnsupportedCharsetException ex) {
if (logger.isWarnEnabled()) {
logger.warn("Could not decode request string [" + source + "] with encoding '" + enc +
"': falling back to platform default encoding; exception message: " + ex.getMessage());
}
return URLDecoder.decode(source);
}
}
// ......
}
determineEncoding
will check the encoding of the current request first. If it’s not set, it will use default character encoding (‘ISO-8859-1’).
protected String determineEncoding(HttpServletRequest request) {
String enc = request.getCharacterEncoding();
if (enc == null) {
enc = getDefaultEncoding();
}
return enc;
}
Based on servlet spec, request.getCharacterEncoding()
will check if the current request has an explict character encoding. If not, it will check current servlet context setting (which is the request-character-encoding
in web.xml
).
Conclusion
If we use any one of the solution, the URI /wiki/%E9%80%B2%E6%93%8A%E7%9A%84%E5%B7%A8%E4%BA%BA
will be decoded correctly as /wiki/進擊的巨人
.
We can handle UTF-8
encoded URIs correctly in Spring.