Hi Oliver,
Tab-separation would be welcomed. Title normalisation would be *very* useful too. Another thing that could potentially save a lot of space would be to throw out all malformed requests, pieces of javascript, and similar junk. Not sure how difficult that would be though, without doing an actual query on the DB for the page id.
For example, an excerpt from 20140101-000000.gz (with only the title and views fields):
'اÙ�ØاÙ�Â_Ù�شباب'_Â_Ù�Ù�اطعÂ_Ù�ضØÙ�ة 1 '/javascript:document.location.href='/'_encodeURIComponent(document.getElementById('txt_input_text').value) 9 '03_Bonnie_&_Clyde 18 A_Night_at_the_Opera_(Queen_album) 57 '40s_on_4 2 '50s_on_5 1 '71_(film) 4 '74_Jailbreak 3 '77 1 '79-00_é�å�¶å�©åºÃ¯Â¿Â½å_±é��vol.8_ACå�¬å�±åºÃ¯Â¿Â½å��æ©Ã¯Â¿Â½æ§Ã¯Â¿Â½ 1
Cheers,
G
Giovanni Luca Ciampaglia
✎ 919 E 10th ∙ Bloomington 47408 IN ∙ USA ☞ http://www.glciampaglia.com/ ✆ +1 812 855-7261 ✉ gciampag@indiana.edu
2015-03-13 12:06 GMT-07:00 Oliver Keyes okeyes@wikimedia.org:
So, we've got a new pageviews definition; it's nicely integrated and spitting out TRUE/FALSE values on each row with the best of em. But what does that mean for third-party researchers?
Well...not much, at the moment, because the data isn't being released somewhere. But one resource we do have that third-parties use a heck of a lot, is the per-page pageviews dumps on dumps.wikimedia.org.
Due to historical size constrains and decision-making (and by historical I mean: last decade) these have a number of weirdnesses in formatting terms; project identification is done using a notation style not really used anywhere else, mobile/zero/desktop appear on different lines, and the files are space-separated. I'd like to put some volunteer time into spitting out dumps in an easier-to-work-with format, using the new definition, to run in /parallel/ with the existing logs.
*The new format* At the moment we have the format:
project_notation - encoded_title - pageviews - bytes
This puts zero and mobile requests to pageX in a different place to desktop requests, requires some reconstruction of project_notation, and contains (for some use cases) extraneous information - that being the byte-count. The files are also headerless, unquoted and space-separated, which saves space but is sometimes...I think the term is "eeeeh-inducing".
What I'd like to use as a new format is:
full_project_url - encoded_title - desktop_pageviews - mobile_and_zero_pageviews
This file would:
- Include a header row;
- Be formatted as a tab-separated, rather than space-separated, file;
- Exclude bytecounts;
- Include desktop and mobile pageview counts on the same line;
- Use the full project URL ("en.wikivoyage.org") instead of the
pagecounts-specific notation ("en.v")
So, as a made-up example, instead of:
de.m.v Florence 32 9024 de.v Florence 920 7570
we'd end up with:
de.wikivoyage.org Florence 920 32
In the future we could also work to /normalise/ the title - replacing it with the page title that refers to the actual pageID. This won't impact legacy files, and is currently blocked on the Apps team, but should be viable as soon as that blocker goes away.
I've written a script capable of parsing and reformatting the legacy files, so we should be able to backfill in this new format too, if that's wanted (see below).
*The size constraints*
There really aren't any. Like I said, the historical rationale for a lot of these decisions seems to have been keeping the files small. But by putting requests to the same title from different site versions on the same line, and dropping byte-count, we save enough space that the resulting files are approximately the same size as the old ones - or in many cases, actually smaller.
*What I'm asking for*
Feedback! What do people think of the new format? What would they like to see that they don't? What don't they need, here? How useful would normalisation be? How useful would backfilling be?
*What I'm not asking for* WMF time! Like I said, this is a spare-time project; I've also got volunteers for Code Review and checking, too (Yuvi and Otto).
The replacement of the old files! Too many people depend on that format and that definition, and I don't want to make them sad.
Thoughts?
-- Oliver Keyes Research Analyst Wikimedia Foundation
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l