The Definitive Guide to HTML, URL and Javascript Escaping

by @jehiah on 2010-08-27 19:01UTC
Filed under: All , Javascript , Python , HTML , Security , Programming

One thing I’ve learned in building web applications over the past 10 years is that building them securely and guarding against script injection attacks is hard. really hard.

The problem is that it’s easy to forget where you need to escape, it’s easy to be lazy, and it’s hard to know some of the edge cases. On top of this most web application frameworks do a bad job, or make it hard to do things the “correct” way.

In general though, escaping is simple: always escape appropriate to the context. This means you should always, and I mean ALWAYS, find yourself escaping for XHTML context, escaping for including a string in a URL, and escaping for a javascript context. Any time you output the contents of a variable, escape it.

XHTML escaping

The most common times you are dealing with building html is in a server side context, or in javascript when you are building raw html. In this context you care about escaping anything that breaks start/closing html tags, and anything that can break out of quoted attributes, and anything that can break the html escape character &.

A good mental check is to think about how to break this code:

<input value="{{value}}">

These are the things you should escape:

& ==> &amp;
" ==> &quot;
> ==> &gt;
< ==> &lt;
' ==> &#39; (optional)

If you are not careful about always building your html string with double quoted attributes, you should also escape the single quote (better safe than sorry).

In python you can accomplish this escape sequence with

xml.sax.saxutils.escape(value, {'"': "&quot;"})

In javascript you can do this with

string.replace('&', "&amp;").replace('"', "&quot;").replace("'", "&#39;").replace('>', "&gt;").replace('<', "&lt;")

javascript escaping

As part of web applications you are always outputting data into a <script> tag. It’s important that all the values you output are escaped to prevent breaking out of quoted strings, and from breaking the closing </script> tag.

A good mental check is to think about breaking this code:

<script>
    var a = "{{value}}";
</script>

Typically this is done by JSON encoding all data output into the script tag.

<script>
    var a = {{json_encode(value)}};
</script>

Unfortunately this isn’t sufficient. It really surprises people, myself included, that a properly escaped JSON string in a script tag that contains </script> will break processing of the script block.

For example, take this properly JSON encoded string:

<script>
    var a = "</script><script>alert('and now i control your page');</script>";
</script>

Notice you can break the script and double quoted string sandbox without a double quote!

In a page served as XML it’s possible to guard against this by making the whole script tag a PDATA block where you apply html escaping to the whole script contents, but in XHTML/HTML mode that is not an option.

The way to solve this is using backslash escape sequence for the forward slash. (Note: some JSON encoders do this by default, some do not.)

/ ==> \/

However you can’t directly apply this substitution after JSON encoding as this would break any existing backslash escapes. The following substitution can however be used as a post-processing step after JSON encoding a string

</ ==> <\/

In python you can do this with:

return json.dumps(value).replace("</", "<\\/")

URL escaping

Building URLs is tricky even though it seems very simple at first.

A good mental check is to think of how to keep any of these from working as expected.

url = "http://test.example/blog/" + post_name 
url = "http://test.example/page?msg=" + message

Always think of your values as having #, ?, ;, &, ../, % or any whitespace character. Any of these need to be % escaped.

In javascript use encodeURIComponent() (note: do not use escape() for this)

In python use urllib.quote() or urllib.urlencode()

Also note that when outputting a URL into html it also needs to be html encoded. remember to think of the url as containing <, >, ", ', & which are all things that need to be escaped in a html context.

This means that you need to use xhtml ampersand encoding and url percent escaping together:

<a href="{{xhtml_escape("http://this.example/" + urllib.quote(value) )}}>...</a>

I’ve not tried to describe how null characters or non printable characters like form feeds come into play. You may find it useful to escape unicode characters 0 - 20 from all user input re.sub(r"[\x00-\x20]+", " ", value).

If you have any comments, or find anything I’ve missed, let me know. If you are looking for a web framework that makes escaping straight forward, check out Tornado

More Resources:

Subscribe via RSS ı Email
Jehiah Czebotar