On 07/23/2013 06:02 PM, Subramanya Sastry wrote:
On 07/23/2013 05:28 PM, John Vandenberg wrote:
VE and Parsoid devs have put in a lot and lot of effort to recognize broken wikitext source, fix it or isolate it,
My point was that you dont appear to be doing analysis of how of all Wikipedia content is broken; at least I dont see a public document listing which templates and pages are causing the parser problems, so the communities on each Wikipedia can fix them ahead of deployment.
Unfortunately, this is much harder to do. What we can consider is to periodically swap out our test pages to consider a fresh patch of pages so new kinds of problems show up in automated testing. In some cases, detecting problems automatically is equivalent to be able to fix them up automatically as well.
Actually, we do have a beginnings of a page for this that I had forgotten about: http://www.mediawiki.org/wiki/Parsoid/Broken_wikitext_tar_pit I dont think this is very helpful at this time and is what you are asking for, but just pointing it out for the record that we've thought about it some.
Some of these cases -- we are actually beginning to address * fostered content in top-level pages (we handle fostering from templates) * handling of templates that produce part of a table-cell, or multiple cells, or multiple attributes of an image.
Ideally, we would not have to support these kind of use cases, but given what we are seeing in production now, we might try to deal with some of these cases ... Interestingly enough, we do a much better job of protecting against unclosed tables, fostered content out of tables, etc. when they come from templates rather than when such wikitext occurs in the page content itself. We have a couple of DOM analysis passes to detect those problems and protect them from editing ... but that needs to be extended to top level page content.
Subbu.