Sunday, 8 September 2013

NEVER sanitize your inputs!

I've seen this cartoon being linked-to in so many comment threads and forums. Anytime its even a little bit applicable, someone will post a link to this cartoon. It has become so pervasive that if you search Google for "327", it’ll be the third link returned, right after the Wikipedia pages for the year and the car.

Search "328" and the next XKCD is no-where to be seen.

The lesson, according to this character and so many real people on the internet, is to sanitize your inputs. The school in the cartoon didn't sanitize its inputs - and one of its database tables got deleted!

Ask anyone about developing websites and they will tell you the first lesson is always to sanitize your inputs. In this day and age you'd have to be crazy not to sanitize your inputs.

Trouble is, sanitizing your inputs is very bad advice.

What went wrong at the school?


A quick aside for what's going on in this cartoon. A new student named...
      Robert'); DROP TABLE Students; --
... joins a school and the administrators dutifully add a record to their database for the new student. The software takes the new student's name and builds an SQL instruction.

    string sqlcmd = "INSERT INTO Students (name) VALUES ('" + name + "')";
    // INSERT INTO Students (name) VALUES ('Wilhelm von Hackensplat')

With normal names, the string would be a perfectly valid SQL command which will add a new record into the table named Students. But what about our friend Bobby Tables?

   INSERT INTO Students (name) VALUES ('Robert'); DROP TABLE Students; --')

Because that single-quote character wasn't sanitized away, an extra command to drop the Students table crept in. This is what we know in the trade as an "SQL Injection" attack, as some unintentional SQL got injected in.

So let’s sanitize it?


We can’t allow people to go about running arbitrary SQL commands willy-nilly. Something must be done!

That single-quote character in the student's name is clearly the problem, so we'll take it out while building the SQL command. This fixes the command and you won’t find database tables disappearing. So why do I call this bad advice?

Trouble is, the single-quote character has a bit of a split personality. As well as being a quote, it's also an apostrophe. Real people have real names with apostrophes and if you've ever seen a name where one has clearly been dropped, you've seen the mark of the sanitizer.

Perhaps this is why some Irish people prefer to spell their name using the letter Ó. After years of having their name mangled by naive software developers, they made a new letter.

So forget sanitizing your inputs. What you need to do instead is to contain your inputs.

Contain my inputs?


The error made by the programmers at the school was that they failed to contain Bobby's name. A student’s name is just a sequence of characters, so you need to use it in a way that could only be a sequence of characters.

Lucky for us, all good SQL access libraries support parameters. Instead, you write the command but with placeholders for the values to be added in little boxes later.

   INSERT INTO Students (name) VALUES (@name)

Here, there's a clear demarcation between what's the SQL command and what's the value from outside. The student's name is inside the little box where the apostrophe is just another character. The name has been contained and that destructive command inside can't break out.

But that's what we mean by "sanitize"!


Then you should stop calling it that. The word "sanitize" is a common enough word and most people understand it as a word for cleaning - removing the bad stuff and keeping the good stuff.

  "Did you sanitize the kitchen worktop?"
  "Yes. I put it in that sealed box over there."
  "That’s not sanitizing!"

  "When I use a word, it means just what I choose it to mean. Neither more nor less."

There is a real problem with software not accepting names with apostrophes, as discussed earlier. Real software developers are listening to the advice to sanitize and interpreting it to mean they should have the bad characters removed.

Isn’t sanitization still needed with HTML?


HTML has a similar problem with injection. Say you’re building a website that can take comments from the public, like this one, you’d want to prevent people from leaving comments with bits of scripting code inside.

   "I <i>love</i> this website! <script>alert('Baron von Hackensplat Was Here');</script>"

Its fine to allow the emphasis, but if your website also publishes the script, anyone else visiting your site will end up running that script.

Unfortunately, HTML doesn’t support a nice little box from whence nothing can escape, so we need to provide that box of containment ourselves. Any HTML from the public should be parsed and rewritten as safe-HTML, where only a safe subset of tags are allowed.

You might argue that this amounts to sanitization, but it betrays a bad mental model. Okay, you've dealt with the big problem, but forgotten about the little problems.

Have you ever seen a comment thread where, starting part way down the page, everything is in italics? This is caused by someone opening italics but not closing them. If your mental model is to sanitize, your natural reaction would be remove the ability to use italics. If your mental model is instead to contain, you know that italics is really harmless and just needs to be closed when left open.

Cross-out Cross Site Scripting


In closing, I’d just like to appeal to the industry to drop the phrase "Cross-Site-Scripting" and call it "HTML Injection" instead.

Any scripting that you didn't write or don’t trust, cross-site or not, is a very bad thing to have on your website. Putting "Scripting" in the name makes people think of scripting as the problem but its so much more than that.

Calling it "HTML Injection" draws an obvious parallel with "SQL Injection". Its the same problem with the same solution.

Credits: XKCD 327 - Exploits of a mom by Randall Munroe.
"When I use a word..." is a quote from Lewis Carroll's "Through the Looking-Glass".
Second: sanitize the gloves by Thomas Cizauskas.
Fun with cling film by Elizabeth Gomm.

9 comments:

  1. Worst advice ever.

    ReplyDelete
    Replies
    1. Yeah, because logically separating data from commands is bad practice... Ha ha ha.

      Have you even tried a decent SQL wrapper/library, like SQLAlchemy?

      Delete
  2. I agreed with you right up to your suggestion to use HTML whitelisting. This is itself much harder than it seems.

    The better solution is what you already suggested for SQL - treat user input as non-interpretable text. The HTML equivalent of a parameterized statement is the createTextNode() function. Rendering all your user input that way, as text nodes, will absolutely stop any sort of injection. If you need formatting, then use some limited alternative like Markdown instead of allowing HTML.

    ReplyDelete
  3. Perhaps this is why some Irish people *prefer* to spell their name using the letter Ó

    They don't *prefer*, that's just their name

    ReplyDelete
    Replies
    1. Chief Engineer Miles O'Brien14 April 2014 at 19:43

      ...prefers to use the apostrophe.

      Delete
    2. In the U.S. the convention is to use the apostrophe for the anglicized spelling of those Irish surnames. It is exceedingly rare to see the acute-o. (Where "exceedingly rare = never".) It's also most common to see databases that simply drop the apostrophe rather than include it for the correct spelling.

      The results are that we see 4 different spellings of those Irish surnames - 1 approximately correct and 3 obviously misspelled - including different variants even in official government documentation. This hasn't yet got me stuck on one side of a border, but I expect it will happen one day as more computerization happens and with no programming convention on how to deal with the apostrophe. (Fortunately the human customs officials or border guards still recognize that the discrepancies are the result of unresolved programming practices.)

      Delete
  4. NOTE: I've had to replace all instances of < and > in this post with [ and ] to get past your "validator". Read the comment to see why.

    "Unfortunately, HTML doesn’t support a nice little box from whence nothing can escape, so we need to provide that box of containment ourselves."

    When you add a text node from JS via DOM or a library that maps to DOM like jQuery, that's precisely what happens: $('foo').text('hello [i]there[/i]'); // renders the [i] and [/i] as plain text in the page instead of making text italic.

    "Any HTML from the public should be parsed and rewritten as safe-HTML, where only a safe subset of tags are allowed. You might argue that this amounts to sanitization, but it betrays a bad mental model."

    Irony. Well it betrays a bad mental model, but not on my side, for sure! You don't *disallow* tags, because they should not be treated as tags. The sanitization mindset sees "disallowed script tag!", and the proper mindset sees "I see some text, I'll encode as HTML, so special chars like [ and ] becomes < and >".

    The fact you may have a little parser which sees the italic tag doesn't mean it has to "disallow" the script tag. Rather, it should simply see [script] as plain text and encode it accordingly.

    Also, if this comment itself would "disallow" the script tag, my own comment would be mangled! Let's see what happens... Click...

    Irony! I "SCRIPT" tag is disallowed! I can't post my innocent example because if your "sanitization" mindset. That's so sad.

    So that's why I had to replace all instances of "<" and ">" with "[" and "]". Try to follow your own advice. Don't sanitize. Don't "disallow". Escape.

    ReplyDelete
    Replies
    1. You know this is blogger and not his own coded comment engine, right?

      Delete
  5. Given that this is an entirely semantic argument, it seems prudent to mention that "sanitize" has two meanings:
    1:to make sanitary (as by cleaning or sterilizing)
    2:to make more acceptable by removing unpleasant or undesired features
    It seems to me that the first is more appropriate as the intent is to retain all information. In fact, outright stripping of characters results in "dirty" data that is incomplete. The second definition refers to attempts at censorship, which does not seem to me to be the use here - if our kitchen is the data, we certainly want the kitchen itself to remain intact after we clean it.
    As a result, it seems unnecessary to come up with a new term like "contain". Regardless of method, if your data comes out of your database different from how it went in, you have written bad code. This is the case with a sql injection as well as a stripped apostrophe.

    ReplyDelete