Re: [Analytics] A new format for the pageview dumps

19 Mar 2015


      Yep, but bytecounts as an approximate for information density or
content size are themselves not terribly useful (mobile web or
desktop, two different bytecounts, same content). The goal with this
is more, I think, to enable a reference point to work out "okay, what
version of the article were these people prrrobably looking at, and
what did it look like?"
On 20 March 2015 at 00:18, Gergo Tisza gtisza@wikimedia.org wrote:
...
On Mon, Mar 16, 2015 at 3:14 PM, Oliver Keyes okeyes@wikimedia.org wrote:
...
Kevin: I'm not sure what value there'd be. I mean, there's page-size,
maybe? But pageID gives us that (or should).
Time-traveling with MediaWiki is very hard. Calculating the length of
wikitext for a given pageID at a given time is cumbersome (instead of simple
text processing, you are now dealing with DB queries, need to set up a local
DB mirror etc). Finding out what title it had at the time is prohibitively
hard (you have to parse semi-structured objects which are serialized into
strings in the log table and follow the chain of renames). Finding out the
byte size of the rendered HTML is practically impossible (templates and
interface messages change; flagged revisions/pending changes might result in
older versions of articles being shown). If you omit the bytecounts, there
is no way people will be able to reconstruct them from the logs.
Not saying that's a problem - I personally don't see much use for them. Just
don't expect pageID to be very useful for "normalizing" logs.

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] A new format for the pageview dumps