Hi Bin,
Thanks for the reports. Please feel free to add yourselves to the relevant bug reports to track progress. You can also file additional bug reports against Parsoid here: https://bugzilla.wikimedia.org/enter_bug.cgi?product=Parsoid
On 12/02/2013 01:39 AM, Bin Li (李斌) wrote:
Hi Parsoid developers,
I have compared Wikipedia HTML and Parsoid HTML (same title and oldid) for 500 random samples. And I found some bug examples and difference patterns that may help you. We also expect the bugs to be fixed. Thanks! Below are the examples:
Bug examples:
- In
http://parsoid-lb.eqiad.wikimedia.org/enwiki/1913_Gettysburg_reunion?oldid=5..., References 18 is “(Pennsylvania Department of Health). http://books.google.com/books?id=swkTAAAAYAAJ&pg=PA72. Retrieved 2011-02-06.”. But in http://en.wikipedia.org/w/index.php?title=1913_Gettysburg_reunion&oldid=..., it’s “(Pennsylvania Department of Health). Retrieved 2011-02-06.”
Looks like some differences in Cite template processing. We'll investigate and file a bug.
- The first external link in
http://en.wikipedia.org/w/index.php?title=...From_the_Hungry_i&oldid=555... is “The Kingston Trio Liner Notes album entry.”, but in http://parsoid-lb.eqiad.wikimedia.org/enwiki/...From_the_Hungry_i?oldid=5559... it’s “[http://www.lazyka.com/linernotes/trio_01(Guard,Rynolds,Shane)/recrdngs/LP_T1... http://www.lazyka.com/linernotes/trio_01%28Guard,Rynolds,Shane%29/recrdngs/LP_T1107.htm#.%20.%20.%20From%20the%20hungry%20i: The Kingston Trio Liner Notes album entry.]”. It’s an obvious bug.
We will investigate and file a bug.
- In
http://en.wikipedia.org/w/index.php?title=1973_CARIFTA_Games&oldid=47338..., every table have title line: “Event Gold Silver Bronze”. But in http://parsoid-lb.eqiad.wikimedia.org/enwiki/1973_CARIFTA_Games?oldid=473380..., the table title line disappears.
Bug 53139 (https://bugzilla.wikimedia.org/show_bug.cgi?id=53139) -- duplicates (53927, 57266). We'll probably have to tackle this sooner than later.
- In
http://en.wikipedia.org/w/index.php?title=Airdisco_Phi-Phi&oldid=5516488..., there are a table on the right: “Phi-Phi … Number built 1”. But it disappers in http://parsoid-lb.eqiad.wikimedia.org/enwiki/Airdisco_Phi-Phi?oldid=55164880....
Related to Bug 53139.
- The figcaption not displays in wikipedia, but displays in parsoid.
Example 1: see “Breg , the old part of Novo Mesto along the Krka River” in http://parsoid-lb.eqiad.wikimedia.org/enwiki/%C5%A0entjo%C5%A1t,_Novo_Mesto?..., it not exist in http://en.wikipedia.org/w/index.php?title=%C5%A0entjo%C5%A1t,_Novo_Mesto&.... Example 2: “T-6 Texan IIs over Columbus Mississippi” appears twice in http://parsoid-lb.eqiad.wikimedia.org/enwiki/14th_Operations_Group?oldid=572... but one time in http://en.wikipedia.org/w/index.php?title=14th_Operations_Group&oldid=57....
There are various image parsing bugs in bugzilla (that Marc pasted the url for in an earlier email) that we haven't gotten to fixing yet.
- The link “[1] [2] ...” in text or references disappears in Parsoid
HTML. Example1: see “[1] [2] [3] [4]” in http://en.wikipedia.org/w/index.php?title=1982_PBA_Open_Conference&oldid..., it disappears in http://parsoid-lb.eqiad.wikimedia.org/enwiki/1982_PBA_Open_Conference?oldid=.... Example2: “[1]” in http://en.wikipedia.org/w/index.php?title=2008%E2%80%9309_Barnsley_F.C._seas..., disappears in http://parsoid-lb.eqiad.wikimedia.org/enwiki/2008%E2%80%9309_Barnsley_F.C._s....
We'll investigate and file a bug.
Other different patterns with examples:
have the table of contents. But http://parsoid-lb.eqiad.wikimedia.org/enwiki/$pent?oldid=535219749 hasn’t.
have “[edit]” after each section to click. But http://parsoid-lb.eqiad.wikimedia.org/enwiki/$pent?oldid=535219749 hasn’t.
- The sign “^ ” in references of
http://en.wikipedia.org/w/index.php?title=$pent&oldid=535219749 is replaced with “↑” in http://parsoid-lb.eqiad.wikimedia.org/enwiki/$pent?oldid=535219749.
- The superscript “a b c d” etc in references of
http://en.wikipedia.org/w/index.php?title=%C3%87a_plane_pour_moi&oldid=5... is replaced with “{num}.0 {num}.1 {num}.2 {num}.3” etc in http://parsoid-lb.eqiad.wikimedia.org/enwiki/%C3%87a_plane_pour_moi?oldid=58...
Parsoid doesn't generate Table of contents or edit links yet. We may not generate edit links in Parsoid and may rely on JS for rendering them. As for the latter two, we are thinking of dealing with wiki-specific styles by relying on CSS/JS rather than generating different HTML for different rendering styles so core Parsoid code is not cluttered with these stylistic differences which are really core parse output issues.
- The voice playing component may be different between
http://en.wikipedia.org/w/index.php?title=%C3%89tincelles_(Moszkowski)&o... http://en.wikipedia.org/w/index.php?title=%C3%89tincelles_%28Moszkowski%29&oldid=555997335 (See Problems playing this file?) and http://parsoid-lb.eqiad.wikimedia.org/enwiki/%C3%89tincelles_(Moszkowski)?ol... http://parsoid-lb.eqiad.wikimedia.org/enwiki/%C3%89tincelles_%28Moszkowski%29?oldid=555997335.
I haven't looked closely, but this could be Bug 49896 (https://bugzilla.wikimedia.org/show_bug.cgi?id=49896) and is one our list of things to fix.
Subbu.