Monday, September 27, 2010

org/owasp/esapi/codecs/HTMLEntityCodec.java (is it really correct ?)

org/owasp/esapi/codecs/HTMLEntityCodec.java

After reply from jwilliams, I have almost apologized for arogancy...
but...I gues jwilliams is the same person, that coded this:
org/owasp/esapi/codecs/HTMLEntityCodec.java
@author Jeff Williams

I have spent some more time to tune my implementation,
and to figure out what the OWASps HTMLEntityCodec does,
since the jwilliams comment did not match what I have saw in his code:

I have tested against ESAPI-2.0-rc6
the primitive code

return ESAPI.encoder().encodeForHTML(in);


This is what gets encoded (all green are "encoded somehow")


and please see HTML source code
for what is produced in markup !

So basically what you have posted as comment
is SOMETHING ELSE as YOUR code does:
Your post:
1) Encoding characters > 255 isn't useful, barring games with the character set.
2) There is no security problem with rendering named entities, although ESAPI uses hex entities to help performance.
3) Nobody is immune to charset switching
4) It's dangerous to remove characters entirely, you should replace with u+FFFD

What I think:
1) but you ARE encoding > 255, and incorrectly.
You ARE using ALSO, NAMED entities, not hex ! (and for huge ranges),
you print out standalone surrogates as hex which is error and violation of SGML def of HTML,
and correct surrage pairs are encoded as two hex codes instead of one hex int
2) esapi uses NAMED as well for wide range of chars !
3) no comment yet waiting for explanation
4) You are NOT using u+FFFD but whitespace " "

So please if I'm wrong "again", correct me but I do not want to waste ANY more time with
fixing external api,
I will stick to my fixed code,
and I warn the others to make their own versions as well (or use better code than OWASPs RI).

Please update DOCS to make things clear, if this is intended encoding and
OWASP considers this as safe, I will skip the libs just by reading docs, and
save some time on communication and testings.

Final Note ?

I hope this can open eyes a bit:


public static void main(String[] args) {
EscapeUtils2 eu=new EscapeUtils2();
String s1="<>abc123+-";
String s2=new String(new int[]{0xdc00,65823,65839,65855},0,4);
out.println(eu.escapeHtmlFull(s1+s2));
out.println(ESAPI.encoder().encodeForHTML(s1+s2));
}

a.in.the.k:
&#60;&#62;abc123+-&#xFFFD;&#65823;&#65839;&#65855;

with ESAPI gou get:
&lt;&gt;abc123&#x2b;-&#xdc00;&#xd800;&#xdd1f;&#xd800;&#xdd2f;&#xd800;&#xdd3f;

extra encoded + sign
&#xdc00 entity outputed (should not apear in HTML !!! by specification)
and 3 more &#xd800 entities (should not apear in HTML !!! by specification),
which happens because SMP should be encode(codePoint) not encode(char)+encode(char).


+ with my code you get all strange data situations logged:
27.9.2010 13:35:43 EscapeUtils2 log
WARNING: UNUSED DESCSET: codePoint=56320 at index:10
27.9.2010 13:35:43 EscapeUtils2 log
WARNING: SMP: codePoint=65823 at index:11
27.9.2010 13:35:43 EscapeUtils2 log
WARNING: SMP: codePoint=65839 at index:13
27.9.2010 13:35:43 EscapeUtils2 log
WARNING: SMP: codePoint=65855 at index:15



I do not expect that my data will contain SMPs or unpaired surrogates. But at least if they do, I produce correct markup.

1 comment:

  1. Stumbled onto this issue because Ironically I'm fixing an issue with non-BMP encoding with ESAPI.

    Long story short, I have no clue at all what your conversation with Jeff Williams looked like. I can say this however: security issues with non-bmp chars would really be if some code interpreter accepted it as legal syntax.

    By default, no popular language in the last 60 years has used characters in this range.

    That said, I'm coding my fix to this library to properly encode these codepoints, and deliberately prevent callers from making non-BMP Characters immune. (That's my intent at least.)

    I welcome you to write unit tests to help exercise this new (for ESAPI) functionality. Like you, the library will strive to create correct markup.

    ReplyDelete