As we were (OK: I am;-) running into trouble integrating HTML-to-XML parsing into the Bison-based parser, I have written a specialized C++ class that can do this prior to the actual parsing. It will output only correct XML *structure*, and (as far as I can tell) correct XHTML rules (<tr> in <table> etc.) as well.
"Broken" HTML will be changed into < / > entities, so only valid XML will reach the output. However, I took some care to automagically fix the "usual suspects" (obligatory 21C3 reference) of HTML ugliness, like not-closed <li> and various table chaos. Even a lonely <caption> (not closed) somewhere in the text will generate a full table. It might not be pretty, but it will be vaild XML.
While this is primarily intended for the wiki-to-XML parser, it might work for enforcing XML output for the current parser as well. We'd only have to run the wiki source through it before actually parsing.
Source: CVS HEAD, Module "flexbisonparse", file "html2xml.cpp". (GPL, of course)
Magnus