Saturday, July 25, 2009

Trust but Validate

The phrase "trust but verify" came out of the cold war. I learned SGML while working at ATLIS Systems. For many years ATLIS was a leader in SGML. One phrase coined by ATLIS was, "If a system is not based on a DTD and checked by an SGML parser, it is NOT SGML." In essence, you had to validate each and every instance. HTML (up through version 4) is not XML. HTML (up through version 4) is an SGML application. XML replaced SGML. I am amazed at how few XML aware applications never bother to create a DTD or Schema for the XML.

Years ago, I created a tool to help maintain my websites. The tool initially started out in Python. I really like the way the regular-expression library integrates with Python. Python refers to everything as an object. This allows putting function pointers into dictionaries and passing them to other functions. This was an important feature in the tool I created to maintain my websites. I created the user interface in Delphi. There is a mechanism to integrate Python and Delphi. I could quickly prototype the core regular expression replacements in Python and use that code in the finished user interface application.

I rewrote the Delphi code in C#. Although I thought the code was relatively clean, there were a few minor points that needed refactoring. I considered integrating my Python into C#. Instead, I found that the C# regular-expression library uses delegates. C# delegates are similar to function pointers in C/C++. Delegates allow you to pass function pointers to the regular-expression library just like you can do in Python. The C# syntax is a bit more wordy (verbosity never bothers me) than Python but a lot more strict. Python gives you enough rope to hang yourself by its lack of type checking. C# is a much strong typed language. I refactored my old Delphi and Python program into C#.

Microsoft added a lot of support for XML and HTML to .Net. One feature I did not have in the old version of my tool was a mechanism to validate the document. The code below shows how to validate a document in C#.

StringReader sr = new StringReader(WebPage);
XmlReaderSettings settings = new XmlReaderSettings();
settings.ProhibitDtd = false;
settings.ValidationEventHandler += new ValidationEventHandler(PageValidationEvent);
XmlReader xvr = XmlReader.Create(sr, settings);
try
{
while (xvr.Read())
{
// do nothing
}
}
catch (XmlException e)
{
// handle any validation errors
}

The code below handles a single error.

public static void PageValidationEvent(object sender, ValidationEventArgs args)
{
// code to handle each error
}

Microsoft's validation is only mediocre. It catches most errors but not all. But for a few simple lines of code it is worth the effort.

I try to be very careful and ensure that my documents are correctly formed and valid. I started a major redesign of one of my websites. I originally built the site using my Delphi-Python tool. I had not done any major work on the site in years. When I started rebuilding the site using the C# version of my tool I found a lot of validation errors. Even though I tried to be careful I made some simple mistakes.

Those mistakes are the reason why I decided to write this article I titled "Trust but Validate". Now that I think about it, don't both with the trust, just validate your HTML and XML.

No comments: