<?xml version="1.0" encoding="UTF-8"?>
<feed xml:lang="en-US" xmlns="http://www.w3.org/2005/Atom">
  <id>tag:www.refactormycode.com,2007:users665</id>
  <link type="application/atom+xml" href="http://www.refactormycode.com/users/665" rel="self"/>
  <title>Nick</title>
  <updated>Mon Jun 23 00:35:49 -0700 2008</updated>
  <entry>
    <id>tag:www.refactormycode.com,2007:Refactor11525</id>
    <published>2008-06-23T00:35:49-07:00</published>
    <title>[C#] On Sanitize HTML</title>
    <content type="html">&lt;p&gt;@Chris: it is actually simpler than you make it out to be, you don't actually create an expression for each tag you want to check, you just create one expression that will strip out everything you need.  And then you reconstruct the tag using a white list of attributes. I have posted the regex below.  I don't want to spam the board with my code but you can find the original here &lt;a href="http://refactormycode.com/codes/333-sanitize-html#refactor_11281" target="_blank"&gt;http://refactormycode.com/codes/333-sanitize-html#refactor_11281&lt;/a&gt; that will do all this for you.  Very simple and easy to implement code.&lt;/p&gt;

&lt;p&gt;Also @Chris you mentioned some very trivial things such as case and quote use, both can easily be corrected when you can pull out the attributes and values from tags.  I am reconstructing the tag with out the blacklisted attributes so I can just as easily call ToLower on the attribute and wrap it with quotes.&lt;/p&gt;

&lt;p&gt;Remember you are not trying to make the system a burden the users, if they enter in a link tag with some attributes that don't fall in white list you shouldn't just ignore the whole link, because 99 times out of 100 they didn't really care about the extra attribute but they most definitely cared about the link.  Same can be argued for &amp;lt;blockquote /&amp;gt; where somebody might use the cite attribute.  &lt;/p&gt;

&lt;p&gt;Remember the purpose of this is to sanitize the HTML, not provide an iron fist.  An Iron fist will just confuse the user as to why their link was not entered in to the system.&lt;/p&gt;

&lt;pre&gt;@&amp;quot;(?'tag_start'&amp;lt;/?)(?'tag'\w+)((\s+(?'attr'(?'attr_name'\w+)(\s*=\s*(?:&amp;quot;&amp;quot;.*?&amp;quot;&amp;quot;|'.*?'|[^'&amp;quot;&amp;quot;&amp;gt;\s]+)))?)+\s*|\s*)(?'tag_end'/?&amp;gt;)&amp;quot;&lt;/pre&gt;</content>
    <author>
      <name>Nick</name>
      <email>nick@coderjournal.com</email>
    </author>
    <link type="text/html" href="http://www.refactormycode.com/codes/333-sanitize-html/refactors/11525" rel="alternate"/>
  </entry>
  <entry>
    <id>tag:www.refactormycode.com,2007:Refactor11462</id>
    <published>2008-06-22T12:10:04-07:00</published>
    <title>[C#] On Sanitize HTML</title>
    <content type="html">&lt;p&gt;Hi Jeff,  point well taken from your last comment.  I had forgotten that you were using WMD, I am use to working with the WYSIWYG editors on the web.  While looking over your code I did actually have a bug in some of your assumptions.  You assume that anchors are going to be written like &amp;lt;a href=&amp;quot;...&amp;quot; title=&amp;quot;...&amp;quot; /&amp;gt;, but what if somebody does &amp;lt;a title=&amp;quot;...&amp;quot; href=&amp;quot;...&amp;quot; /&amp;gt; or &amp;lt;a href=&amp;quot;title&amp;quot; rel=&amp;quot;friend&amp;quot; /&amp;gt;?  It doesn't look like your regex can handle that.  Same with the &amp;lt;img /&amp;gt; tag.  Except this one has 25 different ways it can be written with just the attributes you allow, and hundreds with all the valid attributes.  &lt;/p&gt;

&lt;p&gt;I know I have said this before, it was probably missed with all the flurry of comments, but you might want to consider also whitelisting the attributes, so they can appear in any order.&lt;/p&gt;

&lt;pre&gt;&lt;/pre&gt;</content>
    <author>
      <name>Nick</name>
      <email>nick@coderjournal.com</email>
    </author>
    <link type="text/html" href="http://www.refactormycode.com/codes/333-sanitize-html/refactors/11462" rel="alternate"/>
  </entry>
  <entry>
    <id>tag:www.refactormycode.com,2007:Refactor11373</id>
    <published>2008-06-21T13:59:48-07:00</published>
    <title>[C#] On Sanitize HTML</title>
    <content type="html">&lt;p&gt;Jeff I think we are forgetting about something.  You have all of these tags which are valid, but what if somebody copies and pastes from the web?  Where &amp;lt;h /&amp;gt; tags and &amp;lt;p /&amp;gt; tags and probably everything else contain a class, style, or etc attribute.  You need to strip out the attributes that you don't care about, while keeping the general tag.  I provided a solution to this a couple items back.&lt;/p&gt;

&lt;pre&gt;&lt;/pre&gt;</content>
    <author>
      <name>Nick</name>
      <email>nick@coderjournal.com</email>
    </author>
    <link type="text/html" href="http://www.refactormycode.com/codes/333-sanitize-html/refactors/11373" rel="alternate"/>
  </entry>
  <entry>
    <id>tag:www.refactormycode.com,2007:Refactor11286</id>
    <published>2008-06-20T15:16:52-07:00</published>
    <title>[C#] On Sanitize HTML</title>
    <content type="html">&lt;p&gt;I just updated it to use StringBuilder.  What I like about this method, that I am using, is that I can add special processing in the code, such as adding a rel=&amp;quot;nofollow&amp;quot; when an &amp;quot;a&amp;quot; tag is found.  I am using this with out much trouble on &lt;a href="http://www.ideapipe.com" target="_blank"&gt;http://www.ideapipe.com&lt;/a&gt;.  It is very fast, because for the most part my WYSIWYG editor is limiting most of the HTML anyways.  Personally I like the whitelist everything even the attributes approach, because browsers allow you to add onclick, onfocus, onwhatever to every tag, and they will all execute JavaScript.  &lt;/p&gt;

&lt;pre&gt;return HtmlTagExpression.Replace(text, new MatchEvaluator((Match m) =&amp;gt; {
	if (!ValidHtmlTags.ContainsKey(m.Groups[&amp;quot;tag&amp;quot;].Value))
		return String.Empty;

	StringBuilder generatedTag = new StringBuilder(m.Length);

	System.Text.RegularExpressions.Group tagStart = m.Groups[&amp;quot;tag_start&amp;quot;];
	System.Text.RegularExpressions.Group tagEnd = m.Groups[&amp;quot;tag_end&amp;quot;];
	System.Text.RegularExpressions.Group tag = m.Groups[&amp;quot;tag&amp;quot;];
	System.Text.RegularExpressions.Group tagAttributes = m.Groups[&amp;quot;attr&amp;quot;];

	generatedTag.Append(tagStart.Success ? tagStart.Value : &amp;quot;&amp;lt;&amp;quot;);
	generatedTag.Append(tag.Value);

	foreach (Capture attr in tagAttributes.Captures)
	{
		int indexOfEquals = attr.Value.IndexOf('=');

		// don't proceed any futurer if there is no equal sign or just an equal sign
		if (indexOfEquals &amp;lt; 1)
			continue;

		string attrName = attr.Value.Substring(0, indexOfEquals);

		// check to see if the attribute name is allowed and write attribute if it is
		if (ValidHtmlTags[tag.Value].Contains(attrName))
		{
			generatedTag.Append(' ');
			generatedTag.Append(attr.Value);
		}
	}

	// add nofollow to all hyperlinks
	if (tagStart.Success &amp;amp;&amp;amp; tagStart.Value == &amp;quot;&amp;lt;&amp;quot; &amp;amp;&amp;amp; tag.Value.Equals(&amp;quot;a&amp;quot;, StringComparison.OrdinalIgnoreCase))
		generatedTag.Append(&amp;quot; rel=\&amp;quot;nofollow\&amp;quot;&amp;quot;);

	generatedTag.Append(tagEnd.Success ? tagEnd.Value : &amp;quot;&amp;gt;&amp;quot;);

	return generatedTag.ToString();
}));&lt;/pre&gt;</content>
    <author>
      <name>Nick</name>
      <email>nick@coderjournal.com</email>
    </author>
    <link type="text/html" href="http://www.refactormycode.com/codes/333-sanitize-html/refactors/11286" rel="alternate"/>
  </entry>
</feed>

