Hi,
We've added Mediawiki parser content analysis to the content analysis report that the Reading web team performed last quarter.
We also added the option to see the Gzip (lvl6) version of the report to have a look at more realistic numbers (since traffic is gzipped in prod) (see select box at the top).
http://chimeces.com/loot-content-analysis/
No surprises, the results are pretty similar to the restbase analysis, in that navboxes are around 14% of the content and references are around 50%.
*Request*: If you know about useless html markup emitted by the mediawiki parser and would like to see what % of the content it accounts for, please answer here or in the task with examples and we'll add it to the report (like we did with restbase and the *extraneous markup*).
*Related phab task: https://phabricator.wikimedia.org/T123325 https://phabricator.wikimedia.org/T123325*
Thanks, Joaquin
Joaquin Oltra Hernandez, 20/01/2016 16:55:
We've added Mediawiki parser content analysis to the content analysis report that the Reading web team performed last quarter.
Thanks. It would be useful to understand what your dataset is: I see 9 page titles, presumably fetched from the English Wikipedia. Is this your dataset? How did you ensure it's representative of what users see?
Nemo
The list of articles was decided here phabricator.wikimedia.org/T120504
There's a page for a few categories, a very short article (Campus honeymoon), a long one (Barack Obama).
As I mentioned in the task it is trivial to run the report with a different sample set and I'm happy to if somebody has the interest to visualize a different set of articles.
Adam for example in the task posted different sets of articles based on other criteria like page views ( https://phabricator.wikimedia.org/T120504#1900287), it'd be interesting to run those and see if the trends on navbox and reference sizes hold up.
We're also thinking about running a similar report across a bigger dataset in a more aggregated way, maybe the top 100.000 articles on pageviews with the sizes weighted with the pageviews number to get a more global understanding, but we haven't gotten around to it yet (it would be a new more global, less per-page one). On Jan 20, 2016 6:53 PM, "Federico Leva (Nemo)" nemowiki@gmail.com wrote:
Joaquin Oltra Hernandez, 20/01/2016 16:55:
We've added Mediawiki parser content analysis to the content analysis report that the Reading web team performed last quarter.
Thanks. It would be useful to understand what your dataset is: I see 9 page titles, presumably fetched from the English Wikipedia. Is this your dataset? How did you ensure it's representative of what users see?
Nemo