Search "328" and the next XKCD is no-where to be seen.
The lesson, according to this character and so many real people on the internet, is to sanitize your inputs. The school in the cartoon didn't sanitize its inputs - and one of its database tables got deleted!
Ask anyone about developing websites and they will tell you the first lesson is always to sanitize your inputs. In this day and age you'd have to be crazy not to sanitize your inputs.
Trouble is, sanitizing your inputs is very bad advice.
What went wrong at the school?
A quick aside for what's going on in this cartoon. A new student named...
Robert'); DROP TABLE Students; --
... joins a school and the administrators dutifully add a record to their database for the new student. The software takes the new student's name and builds an SQL instruction.
string sqlcmd = "INSERT INTO Students (name) VALUES ('" + name + "')";
// INSERT INTO Students (name) VALUES ('Wilhelm von Hackensplat')
With normal names, the string would be a perfectly valid SQL command which will add a new record into the table named Students. But what about our friend Bobby Tables?
INSERT INTO Students (name) VALUES ('Robert'); DROP TABLE Students; --')
Because that single-quote character wasn't sanitized away, an extra command to drop the Students table crept in. This is what we know in the trade as an "SQL Injection" attack, as some unintentional SQL got injected in.
So let’s sanitize it?
We can’t allow people to go about running arbitrary SQL commands willy-nilly. Something must be done!
That single-quote character in the student's name is clearly the problem, so we'll take it out while building the SQL command. This fixes the command and you won’t find database tables disappearing. So why do I call this bad advice?
Trouble is, the single-quote character has a bit of a split personality. As well as being a quote, it's also an apostrophe. Real people have real names with apostrophes and if you've ever seen a name where one has clearly been dropped, you've seen the mark of the sanitizer.
Perhaps this is why some Irish people prefer to spell their name using the letter Ó. After years of having their name mangled by naive software developers, they made a new letter.
So forget sanitizing your inputs. What you need to do instead is to contain your inputs.
Contain my inputs?
The error made by the programmers at the school was that they failed to contain Bobby's name. A student’s name is just a sequence of characters, so you need to use it in a way that could only be a sequence of characters.
Lucky for us, all good SQL access libraries support parameters. Instead, you write the command but with placeholders for the values to be added in little boxes later.
INSERT INTO Students (name) VALUES (@name)
Here, there's a clear demarcation between what's the SQL command and what's the value from outside. The student's name is inside the little box where the apostrophe is just another character. The name has been contained and that destructive command inside can't break out.
But that's what we mean by "sanitize"!
Then you should stop calling it that. The word "sanitize" is a common enough word and most people understand it as a word for cleaning - removing the bad stuff and keeping the good stuff.
"Yes. I put it in that sealed box over there."
"That’s not sanitizing!"
"When I use a word, it means just what I choose it to mean. Neither more nor less."
There is a real problem with software not accepting names with apostrophes, as discussed earlier. Real software developers are listening to the advice to sanitize and interpreting it to mean they should have the bad characters removed.
Isn’t sanitization still needed with HTML?
HTML has a similar problem with injection. Say you’re building a website that can take comments from the public, like this one, you’d want to prevent people from leaving comments with bits of scripting code inside.
"I <i>love</i> this website! <script>alert('Baron von Hackensplat Was Here');</script>"
Its fine to allow the emphasis, but if your website also publishes the script, anyone else visiting your site will end up running that script.
Unfortunately, HTML doesn’t support a nice little box from whence nothing can escape, so we need to provide that box of containment ourselves. Any HTML from the public should be parsed and rewritten as safe-HTML, where only a safe subset of tags are allowed.
You might argue that this amounts to sanitization, but it betrays a bad mental model. Okay, you've dealt with the big problem, but forgotten about the little problems.
Have you ever seen a comment thread where, starting part way down the page, everything is in italics? This is caused by someone opening italics but not closing them. If your mental model is to sanitize, your natural reaction would be remove the ability to use italics. If your mental model is instead to contain, you know that italics is really harmless and just needs to be closed when left open.
Cross-out Cross Site Scripting
In closing, I’d just like to appeal to the industry to drop the phrase "Cross-Site-Scripting" and call it "HTML Injection" instead.
Any scripting that you didn't write or don’t trust, cross-site or not, is a very bad thing to have on your website. Putting "Scripting" in the name makes people think of scripting as the problem but its so much more than that.
Calling it "HTML Injection" draws an obvious parallel with "SQL Injection". Its the same problem with the same solution.
Credits: XKCD 327 - Exploits of a mom by Randall Munroe.
"When I use a word..." is a quote from Lewis Carroll's "Through the Looking-Glass".
Second: sanitize the gloves by Thomas Cizauskas.
Fun with cling film by Elizabeth Gomm.