Making XmlValidatingReader Assume a Default DTD

Recently, a deceptively simple problem got thrown my way when one of our devs raised a flag and I jumped in to help out. After much searching and eventually solving the problem with only persistence and fair amount of research, I thought I would describe the problem and solution here. The problem can basically be stated as this: "Read XML content into a System.Xml.XmlDocument where the XML content contains named entity references that are defined in an external DTD. And, the zinger is that the DTD is known, but assumed, and the XML does not contain the DTD declaration, and we cannot change the XML content."

For example, despite the below XML excerpt using well known XHTML/HTML entities, attempting to load the follow XML into a System.Xml.XmlDocument will result in an exception with the message: Reference to undeclared entity, 'nbsp'. Line 3, position 22:

	<?xml version="1.0" encoding="utf-8" ?>
	<html>
		<p id="space">Hello&nbsp;World</p>
		<p id="pi">pi:&pi;</p>
		<p id="euro">euro:&euro;</p>
	</html>

The exception occurs because these entities are defined in an external DTD; the XHTML 1.0 DTDs. The key to this is using a constructor of the XmlValidatingReader that accepts an instance of the XmlParserContext class that has been initialized to assume the default document type is XHTML 1.0 strict. After some reading and searching I was pretty sure that XmlParserContext's DocTypeName, PublicId, and SystemId properties were what I wanted, but but with documentation merely of Gets or sets the name of the document type declaration, Gets or sets the public identifier, and Gets or sets the system identifier respectively, the documentation wasn't much help at all. Instead, it is necessary to understand the details of a document type declaration.

Reading through the Prolog and Document Type Declaration of the XML 1.0 specification revealed how to decipher the XHTML 1.0 document type declaration shown below into it's system identifier of http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd, public identifier of -//W3C//DTD XHTML 1.0 Strict//EN, and it's document type name of html (essentially the root element's type).

<!DOCTYPE html 
     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

The code I used to initialized the XmlParserContext for XHTML 1.0 strict was wrapped in the function below:

private static XmlParserContext CreateXhtmlContext()
{
	XmlNameTable nt = new NameTable();
	XmlNamespaceManager nsmgr = new XmlNamespaceManager(nt);
	XmlParserContext context = new XmlParserContext(null, nsmgr, null, XmlSpace.None);
	context.DocTypeName = "html";
	context.PublicId = "-//W3C//DTD XHTML 1.0 Strict//EN";
	context.SystemId = "xhtml1-strict.dtd";
	return context;
}

The function can be called to pass the XmlParserContext into an XmlValidatingReader constructor like shown below:

//prepare the validating reader:
XmlParserContext xhtmlContext = CreateXhtmlContext();
XmlValidatingReader rdr = new XmlValidatingReader(File.OpenRead(filePath), XmlNodeType.Document, xhtmlContext);
rdr.ValidationType = ValidationType.None;
rdr.XmlResolver = CreateXhtmlResolver();
// Now use the handy XmlDocument to get the content:
XmlDocument doc = new XmlDocument();
doc.Load(rdr);

Then the only thing left is providing the actual contents of the DTDs to the XmlValidatingReader. I used a custom XmlResolver to do load the DTDs from an assembly resource on demand. Once you have that hooked up, XmlValidatingReader will use the DTDs to resolve those pesky external entities. You can download a the complete code here.

posted @ Wednesday, July 20, 2005 1:25 AM

Print
Comments have been closed on this topic.