document, test, and name function sanitizeBalance

This commit is contained in:
Greg Weber 2010-11-01 16:21:31 -07:00
parent 018ee4889a
commit d501579948
3 changed files with 31 additions and 19 deletions

View File

@ -1,17 +1,20 @@
Summary
=======
provides a function Text.HTML.SanitizeXSS.sanitizeXSS that filters html to prevent XSS attacks.
provides 2 functions in the module Text.HTML.SanitizeXSS
* sanitizeXSS - filters html to prevent XSS attacks.
* sanitizeBalance - same as sanitizeXSS but makes sure there are no lone closing tags - this could prevent a user's html from messing up your page
Use Case
========
All html from an untrusted source (user of a web application) should be ran through this function.
If you trust the html (you wrote it), you do not need to use this.
HTML from an untrusted source (user of a web application) should be ran through this library.
If you trust the HTML (you wrote it), you do not need to use this.
If you don't trust the html you probably also do not trust that the tags are balanced- so you should use sanitizeWithBalancing.
Detail
========
This is not escaping! Escaping html does prevents XSS attacks. Strings should be html escaped to show up properly and to prevent XSS attacks. However, escaping will ruin the display of the html.
This function removes any tags or attributes that are not in its white-list of. This may sound picky, but most html should make it through unchanged, making the process unnoticeable to the user but giving us safe html.
This function removes any tags or attributes that are not in its white-list. This may sound picky, but most html should make it through unchanged, making the process unnoticeable to the user but giving us safe html.
Integration
===========
@ -19,12 +22,17 @@ It is recommended to integrate this so that it is automatically used whenever an
Credit
===========
This was taken from John MacFarlane's Pandoc (with permission) modified to be faster and parsing redone with TagSoup. html5lib is also being used as a reference (BSD style license).
Original code was taken from John MacFarlane's Pandoc (with permission), but modified to be faster and with parsing redone using TagSoup. html5lib is now being used as a reference (BSD style license).
Michael Snoyman added the balanced tags functionality.
Limitations
===========
Balancing - sanitizeBalance
---------------------------------
The goal of this function is to prevent your html from breaking when unknown html is placed inside it. I would expect it to work very well in practice and don't see a downside to using it unless you have an alternative aproach. However, this function does not at all guarantee valid html. In fact, it is likely that the result of balancing will still be invalid HTML. This means there is still no guarantee what a browser will do with the html, so there is no guarantee that it will prevent you html from breaking. Other possible aproaches would be to run the html through a library like libxml2 which understands html or to first render the html in a hidden iframe or maybe a hidden div at the bottom of the page so that it is isolated, and then use javascript to insert it into the page where you want it.
TagSoup Parser
--------------
TagSoup is used to parse the HTML, and it does a good job. However TagSoup does not maintain all white space. TagSoup does not distinguish between the following cases:

View File

@ -1,6 +1,6 @@
module Text.HTML.SanitizeXSS
( sanitizeXSS
, sanitizeBalanceXSS
, sanitizeBalance
) where
import Text.HTML.TagSoup
@ -14,8 +14,15 @@ import Codec.Binary.UTF8.String ( encodeString )
import qualified Data.Map as Map
sanitizeBalanceXSS :: String -> String
sanitizeBalanceXSS = renderTagsOptions renderOptions {
-- | santize the html to prevent XSS attacks. See README.md <http://github.com/gregwebs/haskell-xss-sanitize> for more details
sanitizeXSS :: String -> String
sanitizeXSS = renderTagsOptions renderOptions {
optMinimize = \x -> x `elem` ["br","img"] -- <img><img> converts to <img />, <a/> converts to <a></a>
} . safeTags . parseTags
-- same as sanitizeXSS but makes sure there are no lone closing tags. See README.md <http://github.com/gregwebs/haskell-xss-sanitize> for more details
sanitizeBalance :: String -> String
sanitizeBalance = renderTagsOptions renderOptions {
optMinimize = \x -> x `elem` ["br","img"] -- <img><img> converts to <img />, <a/> converts to <a></a>
} . balance Map.empty . safeTags . parseTags
@ -43,12 +50,6 @@ balance m (TagOpen name as : tags) =
Just i -> Map.insert name (i + 1) m
balance m (t:ts) = t : balance m ts
-- | santize the html to prevent XSS attacks. See README.md <http://github.com/gregwebs/haskell-xss-sanitize> for more details
sanitizeXSS :: String -> String
sanitizeXSS = renderTagsOptions renderOptions {
optMinimize = \x -> x `elem` ["br","img"] -- <img><img> converts to <img />, <a/> converts to <a></a>
} . safeTags . parseTags
safeTags :: [Tag String] -> [Tag String]
safeTags [] = []
safeTags (t@(TagClose name):tags)

13
test.hs
View File

@ -1,8 +1,11 @@
import Text.HTML.SanitizeXSS
main = do
let test = " <a href='http://safe.com'>safe</a><a href='unsafe://hack.com'>anchor</a> <img src='evil://evil.com' /> <unsafe></foo> <bar /> <br></br> <b>Unbalanced</div><img src='http://safe.com'>"
let actual = (sanitizeBalanceXSS test)
let expected = " <a href=\"http://safe.com\">safe</a><a>anchor</a> <img /> <br /> <b>Unbalanced<div></div><img src=\"http://safe.com\"></b>"
putStrLn $ "testing: " ++ test
testHTML = " <a href='http://safe.com'>safe</a><a href='unsafe://hack.com'>anchor</a> <img src='evil://evil.com' /> <unsafe></foo> <bar /> <br></br> <b>Unbalanced</div><img src='http://safe.com'>"
test actual expected = do
putStrLn $ "testing: " ++ testHTML
putStrLn $ if actual == expected then "pass" else "failure\n" ++ "\nexpected:" ++ (show expected) ++ "\nactual: " ++ (show actual)
main = do
test (sanitizeBalance testHTML) " <a href=\"http://safe.com\">safe</a><a>anchor</a> <img /> <br /> <b>Unbalanced<div></div><img src=\"http://safe.com\"></b>"
test (sanitizeXSS testHTML) " <a href=\"http://safe.com\">safe</a><a>anchor</a> <img /> <br /> <b>Unbalanced</div><img src=\"http://safe.com\">"