On 07/23/2013 06:55 PM, John Vandenberg wrote:
On Wed, Jul 24, 2013 at 9:02 AM, Subramanya Sastry ssastry@wikimedia.org wrote:
http://parsoid.wmflabs.org:8001/stats
This is the url for our round trip testing on 160K pages (20K each from 8 wikipedias).
Very minor point .. there are ~400 missing pages on the list; is that intentional ? ;-)
One is 'Mos:time' which is in NS 0, and does actually exist as a redirect to the WP: manual of style: https://en.wikipedia.org/wiki/Mos:time
1. Some pages get deleted and then go 404. (http://parsoid.wmflabs.org:8001/failedFetches) 2. There are some (known) bugs in our rt testing infrastructure around recording results -- should be fixed once our testing infrastructure is updated and moved to mysql (from sqlite)
... But, 99.6% means that 0.4% of pages still had corruptions, and that 15% of pages had syntactic dirty diffs.
So 15% is 24000 pages which can bust, but may not if the edit doesnt touch the bustable part.
No, 15% of pages aren't bust. 15% of pages introduce meaning-preserving (hence purely syntactic) dirty diffs depending on what piece of the page is edited. Ex: whitespace diffs, addition of " around attribute values are the most common ones.
For an example, see this: http://parsoid.wmflabs.org:8001/result/d5fe6c9052c23bcc0b63a4d0d1b3e5b68fd2e...
0.4% (~ 640) pages are classified as semantic diffs. We assign a numerical score in base 1000 (digit 3 = # errors, digit 2 = # semantic errors, digit 1: # syntactic errors). When results are sorted in reverse order of score, it gives us the most egregious pages to focus on (crashers first, semantic errors next, purely dirty diffs next).
So, going to http://parsoid.wmflabs.org:8001/topfails and paging through that will give you what you are looking for. 16 pages with 40 entries each. We hang out on #mediawiki-parsoid and can help editors make sense of the diffs if anyone wants to look for broken wikitext and fix them.
Subbu.
Does /topfails cycle through all 24000, 40 pages at a time?
Could you provide a dump of the list of 24000 bustable pages? Split by project? Each community could then investigate those pages for broken tables, and more critically .. templates which emit broken wikisyntax that is causing your team grief.
Do you have stats on each of those eight wikipedias? i.e. is there noticeable differences in the percentages on different wikipedias? if so, can you report those percentages for each projects? I'm guessing Chinese is an example where there are higher percentages..?
-- John Vandenberg
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l