From d501579948ec4dc2c8811c4d314554967ab3d274 Mon Sep 17 00:00:00 2001 From: Greg Weber Date: Mon, 1 Nov 2010 16:21:31 -0700 Subject: [PATCH] document, test, and name function sanitizeBalance --- README.md | 18 +++++++++++++----- Text/HTML/SanitizeXSS.hs | 19 ++++++++++--------- test.hs | 13 ++++++++----- 3 files changed, 31 insertions(+), 19 deletions(-) diff --git a/README.md b/README.md index 982a95f..618cc91 100644 --- a/README.md +++ b/README.md @@ -1,17 +1,20 @@ Summary ======= -provides a function Text.HTML.SanitizeXSS.sanitizeXSS that filters html to prevent XSS attacks. +provides 2 functions in the module Text.HTML.SanitizeXSS +* sanitizeXSS - filters html to prevent XSS attacks. +* sanitizeBalance - same as sanitizeXSS but makes sure there are no lone closing tags - this could prevent a user's html from messing up your page Use Case ======== -All html from an untrusted source (user of a web application) should be ran through this function. -If you trust the html (you wrote it), you do not need to use this. +HTML from an untrusted source (user of a web application) should be ran through this library. +If you trust the HTML (you wrote it), you do not need to use this. +If you don't trust the html you probably also do not trust that the tags are balanced- so you should use sanitizeWithBalancing. Detail ======== This is not escaping! Escaping html does prevents XSS attacks. Strings should be html escaped to show up properly and to prevent XSS attacks. However, escaping will ruin the display of the html. -This function removes any tags or attributes that are not in its white-list of. This may sound picky, but most html should make it through unchanged, making the process unnoticeable to the user but giving us safe html. +This function removes any tags or attributes that are not in its white-list. This may sound picky, but most html should make it through unchanged, making the process unnoticeable to the user but giving us safe html. Integration =========== @@ -19,12 +22,17 @@ It is recommended to integrate this so that it is automatically used whenever an Credit =========== -This was taken from John MacFarlane's Pandoc (with permission) modified to be faster and parsing redone with TagSoup. html5lib is also being used as a reference (BSD style license). +Original code was taken from John MacFarlane's Pandoc (with permission), but modified to be faster and with parsing redone using TagSoup. html5lib is now being used as a reference (BSD style license). +Michael Snoyman added the balanced tags functionality. Limitations =========== +Balancing - sanitizeBalance +--------------------------------- +The goal of this function is to prevent your html from breaking when unknown html is placed inside it. I would expect it to work very well in practice and don't see a downside to using it unless you have an alternative aproach. However, this function does not at all guarantee valid html. In fact, it is likely that the result of balancing will still be invalid HTML. This means there is still no guarantee what a browser will do with the html, so there is no guarantee that it will prevent you html from breaking. Other possible aproaches would be to run the html through a library like libxml2 which understands html or to first render the html in a hidden iframe or maybe a hidden div at the bottom of the page so that it is isolated, and then use javascript to insert it into the page where you want it. + TagSoup Parser -------------- TagSoup is used to parse the HTML, and it does a good job. However TagSoup does not maintain all white space. TagSoup does not distinguish between the following cases: diff --git a/Text/HTML/SanitizeXSS.hs b/Text/HTML/SanitizeXSS.hs index 7b5493e..e168ec6 100644 --- a/Text/HTML/SanitizeXSS.hs +++ b/Text/HTML/SanitizeXSS.hs @@ -1,6 +1,6 @@ module Text.HTML.SanitizeXSS ( sanitizeXSS - , sanitizeBalanceXSS + , sanitizeBalance ) where import Text.HTML.TagSoup @@ -14,8 +14,15 @@ import Codec.Binary.UTF8.String ( encodeString ) import qualified Data.Map as Map -sanitizeBalanceXSS :: String -> String -sanitizeBalanceXSS = renderTagsOptions renderOptions { +-- | santize the html to prevent XSS attacks. See README.md for more details +sanitizeXSS :: String -> String +sanitizeXSS = renderTagsOptions renderOptions { + optMinimize = \x -> x `elem` ["br","img"] -- converts to , converts to + } . safeTags . parseTags + +-- same as sanitizeXSS but makes sure there are no lone closing tags. See README.md for more details +sanitizeBalance :: String -> String +sanitizeBalance = renderTagsOptions renderOptions { optMinimize = \x -> x `elem` ["br","img"] -- converts to , converts to } . balance Map.empty . safeTags . parseTags @@ -43,12 +50,6 @@ balance m (TagOpen name as : tags) = Just i -> Map.insert name (i + 1) m balance m (t:ts) = t : balance m ts --- | santize the html to prevent XSS attacks. See README.md for more details -sanitizeXSS :: String -> String -sanitizeXSS = renderTagsOptions renderOptions { - optMinimize = \x -> x `elem` ["br","img"] -- converts to , converts to - } . safeTags . parseTags - safeTags :: [Tag String] -> [Tag String] safeTags [] = [] safeTags (t@(TagClose name):tags) diff --git a/test.hs b/test.hs index c55e261..81e9901 100644 --- a/test.hs +++ b/test.hs @@ -1,8 +1,11 @@ import Text.HTML.SanitizeXSS -main = do - let test = " safeanchor

Unbalanced" - let actual = (sanitizeBalanceXSS test) - let expected = " safeanchor
Unbalanced
" - putStrLn $ "testing: " ++ test +testHTML = " safeanchor

Unbalanced" + +test actual expected = do + putStrLn $ "testing: " ++ testHTML putStrLn $ if actual == expected then "pass" else "failure\n" ++ "\nexpected:" ++ (show expected) ++ "\nactual: " ++ (show actual) + +main = do + test (sanitizeBalance testHTML) " safeanchor
Unbalanced
" + test (sanitizeXSS testHTML) " safeanchor
Unbalanced"