Hi Parsoid developers,
I have compared Wikipedia HTML and Parsoid HTML (same title and oldid) for 500 random samples. And I found some bug examples and difference patterns that may help you. We also expect the bugs to be fixed. Thanks! Below are the examples:
Bug examples: 1. In http://parsoid-lb.eqiad.wikimedia.org/enwiki/1913_Gettysburg_reunion?oldid=5..., References 18 is “(Pennsylvania Department of Health). http://books.google.com/books?id=swkTAAAAYAAJ&pg=PA72. Retrieved 2011-02-06.”. But in http://en.wikipedia.org/w/index.php?title=1913_Gettysburg_reunion&oldid=..., it’s “(Pennsylvania Department of Health). Retrieved 2011-02-06.”
2. The first external link in http://en.wikipedia.org/w/index.php?title=...From_the_Hungry_i&oldid=555... “The Kingston Trio Liner Notes album entry.”, but in http://parsoid-lb.eqiad.wikimedia.org/enwiki/...From_the_Hungry_i?oldid=5559... “[ http://www.lazyka.com/linernotes/trio_01(Guard,Rynolds,Shane)/recrdngs/LP_T1...: The Kingston Trio Liner Notes album entry.]”. It’s an obvious bug.
3. In http://en.wikipedia.org/w/index.php?title=1973_CARIFTA_Games&oldid=47338..., every table have title line: “Event Gold Silver Bronze”. But in http://parsoid-lb.eqiad.wikimedia.org/enwiki/1973_CARIFTA_Games?oldid=473380..., the table title line disappears.
4. In http://en.wikipedia.org/w/index.php?title=Airdisco_Phi-Phi&oldid=5516488..., there are a table on the right: “Phi-Phi … Number built 1”. But it disappers in http://parsoid-lb.eqiad.wikimedia.org/enwiki/Airdisco_Phi-Phi?oldid=55164880... .
5. The figcaption not displays in wikipedia, but displays in parsoid. Example 1: see “Breg , the old part of Novo Mesto along the Krka River” in http://parsoid-lb.eqiad.wikimedia.org/enwiki/%C5%A0entjo%C5%A1t,_Novo_Mesto?..., it not exist in http://en.wikipedia.org/w/index.php?title=%C5%A0entjo%C5%A1t,_Novo_Mesto&.... Example 2: “T-6 Texan IIs over Columbus Mississippi” appears twice in http://parsoid-lb.eqiad.wikimedia.org/enwiki/14th_Operations_Group?oldid=572... one time in http://en.wikipedia.org/w/index.php?title=14th_Operations_Group&oldid=57... .
6. The link “[1] [2] ...” in text or references disappears in Parsoid HTML. Example1: see “[1] [2] [3] [4]” in http://en.wikipedia.org/w/index.php?title=1982_PBA_Open_Conference&oldid..., it disappears in http://parsoid-lb.eqiad.wikimedia.org/enwiki/1982_PBA_Open_Conference?oldid=.... Example2: “[1]” in http://en.wikipedia.org/w/index.php?title=2008%E2%80%9309_Barnsley_F.C._seas..., disappears in http://parsoid-lb.eqiad.wikimedia.org/enwiki/2008%E2%80%9309_Barnsley_F.C._s... .
Other different patterns with examples: 1. http://en.wikipedia.org/w/index.php?title=$pent&oldid=535219749 have the table of contents. But http://parsoid-lb.eqiad.wikimedia.org/enwiki/$pent?oldid=535219749 hasn’t.
2. http://en.wikipedia.org/w/index.php?title=$pent&oldid=535219749 have “[edit]” after each section to click. But http://parsoid-lb.eqiad.wikimedia.org/enwiki/$pent?oldid=535219749 hasn’t.
3. The sign “^ ” in references of http://en.wikipedia.org/w/index.php?title=$pent&oldid=535219749 is replaced with “↑” in http://parsoid-lb.eqiad.wikimedia.org/enwiki/$pent?oldid=535219749.
4. The superscript “a b c d” etc in references of http://en.wikipedia.org/w/index.php?title=%C3%87a_plane_pour_moi&oldid=5... replaced with “{num}.0 {num}.1 {num}.2 {num}.3” etc in http://parsoid-lb.eqiad.wikimedia.org/enwiki/%C3%87a_plane_pour_moi?oldid=58...
5. The voice playing component may be different between http://en.wikipedia.org/w/index.php?title=%C3%89tincelles_(Moszkowski)&o... Problems playing this file?) and http://parsoid-lb.eqiad.wikimedia.org/enwiki/%C3%89tincelles_(Moszkowski)?ol... .
On Mon, Dec 2, 2013 at 7:39 AM, Bin Li (李斌) binli@google.com wrote:
- The first external link in
http://en.wikipedia.org/w/index.php?title=...From_the_Hungry_i&oldid=555... is “The Kingston Trio Liner Notes album entry.”, but in http://parsoid-lb.eqiad.wikimedia.org/enwiki/...From_the_Hungry_i?oldid=5559... it’s “[http://www.lazyka.com/linernotes/trio_01(Guard,Rynolds,Shane)/recrdngs/LP_T1...: The Kingston Trio Liner Notes album entry.]”. It’s an obvious bug.
It's the first external link in the external links section, not the first external link in the whole article.
minimized the test case a bit: the trailing colon makes the difference.
http://parsoid-lb.eqiad.wikimedia.org/enwiki/User%3AJeremyb%2Fparsoidtest1?o...
-Jeremy
Hi, and thanks for all that info!
We keep a bug database using Bugzilla, you can see the Parsoid ones with a query like this one: https://bugzilla.wikimedia.org/buglist.cgi?list_id=254884&resolution=---...
We'll investigate all those you've pointed out, but in the meantime please feel free to use Bugzilla to see if we're already aware of them, and, if not, open new ones so you can track their status.
Thanks! Marc
On Mon, Dec 2, 2013 at 8:49 AM, Jeremy Baron jeremy@tuxmachine.com wrote:
On Mon, Dec 2, 2013 at 7:39 AM, Bin Li (李斌) binli@google.com wrote:
- The first external link in
http://en.wikipedia.org/w/index.php?title=...From_the_Hungry_i&oldid=555...
is “The Kingston Trio Liner Notes album entry.”, but in
http://parsoid-lb.eqiad.wikimedia.org/enwiki/...From_the_Hungry_i?oldid=5559...
it’s “[
http://www.lazyka.com/linernotes/trio_01(Guard,Rynolds,Shane)/recrdngs/LP_T1... :
The Kingston Trio Liner Notes album entry.]”. It’s an obvious bug.
It's the first external link in the external links section, not the first external link in the whole article.
minimized the test case a bit: the trailing colon makes the difference.
http://parsoid-lb.eqiad.wikimedia.org/enwiki/User%3AJeremyb%2Fparsoidtest1?o...
-Jeremy
Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
Hi Bin,
Thanks for the reports. Please feel free to add yourselves to the relevant bug reports to track progress. You can also file additional bug reports against Parsoid here: https://bugzilla.wikimedia.org/enter_bug.cgi?product=Parsoid
On 12/02/2013 01:39 AM, Bin Li (李斌) wrote:
Hi Parsoid developers,
I have compared Wikipedia HTML and Parsoid HTML (same title and oldid) for 500 random samples. And I found some bug examples and difference patterns that may help you. We also expect the bugs to be fixed. Thanks! Below are the examples:
Bug examples:
- In
http://parsoid-lb.eqiad.wikimedia.org/enwiki/1913_Gettysburg_reunion?oldid=5..., References 18 is “(Pennsylvania Department of Health). http://books.google.com/books?id=swkTAAAAYAAJ&pg=PA72. Retrieved 2011-02-06.”. But in http://en.wikipedia.org/w/index.php?title=1913_Gettysburg_reunion&oldid=..., it’s “(Pennsylvania Department of Health). Retrieved 2011-02-06.”
Looks like some differences in Cite template processing. We'll investigate and file a bug.
- The first external link in
http://en.wikipedia.org/w/index.php?title=...From_the_Hungry_i&oldid=555... is “The Kingston Trio Liner Notes album entry.”, but in http://parsoid-lb.eqiad.wikimedia.org/enwiki/...From_the_Hungry_i?oldid=5559... it’s “[http://www.lazyka.com/linernotes/trio_01(Guard,Rynolds,Shane)/recrdngs/LP_T1... http://www.lazyka.com/linernotes/trio_01%28Guard,Rynolds,Shane%29/recrdngs/LP_T1107.htm#.%20.%20.%20From%20the%20hungry%20i: The Kingston Trio Liner Notes album entry.]”. It’s an obvious bug.
We will investigate and file a bug.
- In
http://en.wikipedia.org/w/index.php?title=1973_CARIFTA_Games&oldid=47338..., every table have title line: “Event Gold Silver Bronze”. But in http://parsoid-lb.eqiad.wikimedia.org/enwiki/1973_CARIFTA_Games?oldid=473380..., the table title line disappears.
Bug 53139 (https://bugzilla.wikimedia.org/show_bug.cgi?id=53139) -- duplicates (53927, 57266). We'll probably have to tackle this sooner than later.
- In
http://en.wikipedia.org/w/index.php?title=Airdisco_Phi-Phi&oldid=5516488..., there are a table on the right: “Phi-Phi … Number built 1”. But it disappers in http://parsoid-lb.eqiad.wikimedia.org/enwiki/Airdisco_Phi-Phi?oldid=55164880....
Related to Bug 53139.
- The figcaption not displays in wikipedia, but displays in parsoid.
Example 1: see “Breg , the old part of Novo Mesto along the Krka River” in http://parsoid-lb.eqiad.wikimedia.org/enwiki/%C5%A0entjo%C5%A1t,_Novo_Mesto?..., it not exist in http://en.wikipedia.org/w/index.php?title=%C5%A0entjo%C5%A1t,_Novo_Mesto&.... Example 2: “T-6 Texan IIs over Columbus Mississippi” appears twice in http://parsoid-lb.eqiad.wikimedia.org/enwiki/14th_Operations_Group?oldid=572... but one time in http://en.wikipedia.org/w/index.php?title=14th_Operations_Group&oldid=57....
There are various image parsing bugs in bugzilla (that Marc pasted the url for in an earlier email) that we haven't gotten to fixing yet.
- The link “[1] [2] ...” in text or references disappears in Parsoid
HTML. Example1: see “[1] [2] [3] [4]” in http://en.wikipedia.org/w/index.php?title=1982_PBA_Open_Conference&oldid..., it disappears in http://parsoid-lb.eqiad.wikimedia.org/enwiki/1982_PBA_Open_Conference?oldid=.... Example2: “[1]” in http://en.wikipedia.org/w/index.php?title=2008%E2%80%9309_Barnsley_F.C._seas..., disappears in http://parsoid-lb.eqiad.wikimedia.org/enwiki/2008%E2%80%9309_Barnsley_F.C._s....
We'll investigate and file a bug.
Other different patterns with examples:
have the table of contents. But http://parsoid-lb.eqiad.wikimedia.org/enwiki/$pent?oldid=535219749 hasn’t.
have “[edit]” after each section to click. But http://parsoid-lb.eqiad.wikimedia.org/enwiki/$pent?oldid=535219749 hasn’t.
- The sign “^ ” in references of
http://en.wikipedia.org/w/index.php?title=$pent&oldid=535219749 is replaced with “↑” in http://parsoid-lb.eqiad.wikimedia.org/enwiki/$pent?oldid=535219749.
- The superscript “a b c d” etc in references of
http://en.wikipedia.org/w/index.php?title=%C3%87a_plane_pour_moi&oldid=5... is replaced with “{num}.0 {num}.1 {num}.2 {num}.3” etc in http://parsoid-lb.eqiad.wikimedia.org/enwiki/%C3%87a_plane_pour_moi?oldid=58...
Parsoid doesn't generate Table of contents or edit links yet. We may not generate edit links in Parsoid and may rely on JS for rendering them. As for the latter two, we are thinking of dealing with wiki-specific styles by relying on CSS/JS rather than generating different HTML for different rendering styles so core Parsoid code is not cluttered with these stylistic differences which are really core parse output issues.
- The voice playing component may be different between
http://en.wikipedia.org/w/index.php?title=%C3%89tincelles_(Moszkowski)&o... http://en.wikipedia.org/w/index.php?title=%C3%89tincelles_%28Moszkowski%29&oldid=555997335 (See Problems playing this file?) and http://parsoid-lb.eqiad.wikimedia.org/enwiki/%C3%89tincelles_(Moszkowski)?ol... http://parsoid-lb.eqiad.wikimedia.org/enwiki/%C3%89tincelles_%28Moszkowski%29?oldid=555997335.
I haven't looked closely, but this could be Bug 49896 (https://bugzilla.wikimedia.org/show_bug.cgi?id=49896) and is one our list of things to fix.
Subbu.
On 12/02/2013 12:29 PM, Subramanya Sastry wrote:
- The link “[1] [2] ...” in text or references disappears in Parsoid
HTML. Example1: see “[1] [2] [3] [4]” in http://en.wikipedia.org/w/index.php?title=1982_PBA_Open_Conference&oldid..., it disappears in http://parsoid-lb.eqiad.wikimedia.org/enwiki/1982_PBA_Open_Conference?oldid=.... Example2: “[1]” in http://en.wikipedia.org/w/index.php?title=2008%E2%80%9309_Barnsley_F.C._seas..., disappears in http://parsoid-lb.eqiad.wikimedia.org/enwiki/2008%E2%80%9309_Barnsley_F.C._s....
This is a deliberate change, and is documented in our DOM spec [1]. Auto-numbered external links are rendered as empty links with the numbering added using CSS counters. The PHP parser explicitly inserts numbers into the HTML DOM.
Gabriel
[1]: https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#Numbered_external_...
wikitext-l@lists.wikimedia.org