9

I'm running a comment system, and I want to accept regular, un-formatted text.

I don't want anything too complicated, so I thought of just searching and replacing all < to space (through regex or a simple for loop), so <script src="http://malicioussite.com"> would just appear as script src="http://malicioussite.com".

Is there any reason not to do this?

Can a hacker still get away with a XSS?

Vilican
  • 2,792
  • 8
  • 23
  • 35
xss
  • 107
  • 2
  • 1
    The hardest part about XSS is that protecting against it always depends on the context. There are a few characters to always escape like the less-than and greater-than, but depending on what you're doing with the information you may need to escape more or less characters. The safest path is to escape every character that is non-alphanumeric and to set your character encoding explicitly. – sethmlarson Jan 15 '16 at 13:19
  • 5
    What's wrong with good ol' < and >? – bjb568 Jan 16 '16 at 00:13

3 Answers3

26

Replacing < and > characters isn't enough in all cases. Sure, it will prevent any user to open a HTML tag, but that won't prevent him/her to inject HTML attributes in a HTML tag.

For example, let's take a parser which transforms [img=XXX] into <img src="XXX" />, only replacing < and >.

A malicious user could enter [img=X" onerror="alert(1)] and the parser would return:

<img src="X" onerror="alert(1)" />

And the alert will prompt, meaning that a XSS attack is possible.

Benoit Esnard
  • 14,694
  • 7
  • 69
  • 69
21

It may be enough, and it may not be, but it is definitely not a good idea.

Can a hacker still get away with a XSS?

Possibly, depending on the situation.

@BenoitEsnard already described one situation where filtering out < and > is not enough: When the user input is echoed inside attributes of existing HTML tags, because then an attacker could just add new attributes themselves.

Here is a list with different contexts and how to handle them when preventing XSS.

Is there any reason not to do this?

Yes.

Lets assume that you really only echo the comment inside <textarea>COMMENT</textarea> when editing a comment, and inside <div id=comment>COMMENT</div> when showing a comment, nowhere else, and you don't want any HTML formatting at all, just plain text as you said.

If you write your function correctly, it would be secure. But it wouldn't be very user friendly. Depending on the kind of website you have, users would want to use < and > in many situations, eg: Love you <3, 2 < 3, use this: this->exec(), <font> is deprecated, >.<, ...

So it is definitely a usability issue, and possibly a security issue depending on context and correctness of implementation.

Just use the functions which are commonly used instead of writing your own mechanism (eg in PHP use htmlentities when echoing user input in a HTML context where you do not want to parse given HTML, use some library such as HTMLPurifier if you do need HTML, and so on)

tim
  • 29,640
  • 7
  • 98
  • 121
0

No, it is not enough - if you just blindly insert user submitted text into your HTML output, there are many different context where other symbols can be treated as part of markup. The most safe and, arguably, correct way is to explicitly treat plain-text like, well, plain-text. Retrieve this text with separate JavaScript-based request from your server either as just text chunk or packed in JSON and assign it directly to data property of text node in DOM. This way you guarantee that it won't ever be processed as anything but just a plain text without impeding user ability to use whatever symbols they want.

// Obtain user somehow generated text. For this demonstration I'll just inline it.
var text = "some <funky> &text& with something that looks like </script> suspicious HTML"
// Find target text node in your page. For this demonstration I'll just create it myself.
var text_node = document.createTextNode('')
document.body.appendChild(text_node)

// Now just assign your text data to node
text_node.data = text
Oleg V. Volkov
  • 799
  • 5
  • 11
  • That is not an appropriate solution for non-security reasons - specifically, it has way too much overhead compared to just escaping the text. – user253751 Jan 16 '16 at 11:07
  • JSON has the same problem than HTML: you must escape some characters if you want it to work properly. – Benoit Esnard Jan 16 '16 at 11:43
  • 1
    Also, some web pages have a lot of user-generated content (forums, social networks, etc.), having one AJAX request per content is just way too slow / unpractical when a robust and fast solution exists. – Benoit Esnard Jan 16 '16 at 11:45
  • @BenoitEsnard, use JSON array. JSON is single context with much less corner cases than entire HTML. In pretty much any language you can just call encode_json on native array and be done with it. But no language have same for HTML because it is much more complex. – Oleg V. Volkov Jan 16 '16 at 14:21
  • @immibis, did you actually used id? I used it on several multi-billion hit-per-day sites without any problems (browser gaming/BMMO and RTB) in several different languages. Thanks to it's simplicity it is lightning fast in most implementations I've seen for different languages. – Oleg V. Volkov Jan 16 '16 at 14:27
  • @immibis, also, it is appropriate and ONLY correct solution. You must stop mixing markup/templates and data to stop any chances for later to be interpreted as former. Nobody argues that SQL binding is only correct way, despite it too bringing small overhead by transmitting field values separately instead of embedding them directly into text query. – Oleg V. Volkov Jan 16 '16 at 14:29
  • @OlegV.Volkov If escaping is not a correct way to separate data from commands, then do you emit un-escaped JSON in the AJAX response? If so, you're back to square one, except now your problem's with the AJAX response instead of the HTML page. – user253751 Jan 17 '16 at 00:13
  • @immibis, I don't even get what you're talking about - there's no such thing as "unescaped JSON". – Oleg V. Volkov Jan 17 '16 at 01:53
  • @OlegV.Volkov Suppose the text to be displayed is <foo""bar>. In HTML you need to escape it as &lt;foo""bar&gt; (assuming it's not meant to be an attribute value). In JSON you need to escape it as <foo\"\"bar>. So using JSON does not let you avoid escaping. – user253751 Jan 17 '16 at 07:57
  • @OlegV.Volkov By the way, an extra round trip is hardly "lightning fast". I have a 300ms ping time to almost anywhere, so for me your page will take 300ms longer to load. – user253751 Jan 17 '16 at 08:05
  • @immbis, You don't quote anything in message in some pre-defined format. You use encoding function that take language-native object and get some message buffer. – Oleg V. Volkov Jan 17 '16 at 11:45