Thanks to Federico Leva (Nemo) for the follow up questions (i.e. "what
additional data would you need and why, given clarifications above?"). I
have the following suggestions for the Wikimedia staff to consider and ask
other Wikipedian and Wikipedia researchers to share their thoughts. I
organize my answers from the easiest to the most difficult.
Some suggestions to the current and future Wikipedia traffic data curation
and presentation.
(I). What can be done quickly with relatively little pain but with
substantial gains?
(I-A). Tabulate the data points in absolute numbers first, not percentage
numbers
In terms of data points, this should be easy to do. It would be much useful
if absolute numbers, not percentage numbers, are provided so as to see
historical dynamics in terms of absolute numbers not relative numbers in
percentage points.
Currently, say for Chinese Wikipedia, it is already possible to see the
trend regarding the *proportion* of Taiwan users versus that of Hong Kong.
But it would be much more helpful, for researchers and Wikipedians alike,
to see if there is a decline or increase in absolute numbers. That is to
say, for Chinese Wikipedia, we can know better if, for a specific region,
the viewing/editing traffic has increased or decreased.
We need to know whether there is a growth and decline and the current
percentage data does not allow us to do so.
(I-B). Include all language versions for the *editing traffic* report as
well.
In terms of the language version coverage, it would be useful for both
Wikipedians and researchers to compare the *editing versus viewing* traffic
so as to identify the gaps for development if the editing traffic report
would be as comprehensive as that of viewing traffic.
Currently, many language versions are reported with viewing traffic data
only, not with editing traffic data. Hindi, Kurdish, Uyghur, Wuu,
Cantonese, etc. are such examples:
http://users.ox.ac.uk/~kebl3178/wikipedia_traffic_hi.html
http://users.ox.ac.uk/~kebl3178/wikipedia_traffic_ku.html
http://users.ox.ac.uk/~kebl3178/wikipedia_traffic_ug.html
http://users.ox.ac.uk/~kebl3178/wikipedia_traffic_wuu.html
http://users.ox.ac.uk/~kebl3178/wikipedia_traffic_zh-yue.html
(I-C). Provide static data objects in more accessible format (i.e. csv
and/or json).
My life would be much easier if csv and/or json formats are provided. I
believe that others would be easier too. For the current outcome, I had to
scrape the data off the html page, which was a lot of work.
(II).What should be done soon to provide more consistent and accessible
traffic data reports?
(II-A). Putting viewing traffic and editing traffic report on the same
page.
For viewers' convenience, table presentation and visualization should allow
readers to compare editing traffic and viewing traffic *on the same page*
so that viewers do not have to switch between pages
(II-B). Organizing and archiving the traffic reports for historical
comparison.
In terms of traffic data report release (per language and per country one),
it would be of great help for Wikipedians and researchers alike to organize
past reports according to the coverage of the data points (i.e. annually or
quarterly or even monthly) and make the historical pages accessible.
(Currently I have to retrieve past data via Internet archive.)
(I-C). Provide dynamic data objects in more accessible format (i.e. csv
and/or json).
It would be awesome to have some API developed to generate traffic data
report in csv or json formats. Note that the infographics that I have
prototyped can be tweaked in a way to load data for more interactive
experience.
(III).What should be discussed for the longer-term development to inform
Wikipedia policies and strategies using/curating the traffic reports?
(III-A). Shorter time aggregate units.
I notice that there seems to be a shift from providing annual report ones
to quarterly ones lately. It is a good direction for others can do the
annual average themselves based on quarterly ones.
From researchers' point of view, I would prefer
more frequent and shorter
data release cycles (e.g. monthly if not weekly), and then
do the
statistics (average, etc.) myself so as to derive annual report.
(III-B). Smaller (i.e more specific) geographic aggregate units.
The country (geographic) information is often based on geo-IP databases,
and sometimes provincial and city-level data would be available. It would
be extremely useful if the aggregate units can be lowered one level down to
the first administrative levels below countries.
This will create important reports for the geographic distribution of
editing/viewing traffic across different provinces in mainland China or
India, or different states in the United States.
(III-C). Relevant geolinguistic and geocultural database for
country/language name/code disambiguation and queries.
First, the country codes and language codes should be provided and
maintained centrally in one place so as to help others to reuse the data
with data consistency and integrity.
Second, the country names and language names should be also provided and
maintained (preferably from Wikidata) so as to help others to localize the
traffic data report in all languages! I somehow believe that traffic
reports in different languages will help various Wikimedia's outreach
programs, including fund-raising. Effectively the country/language
names/codes together will provide an important "translation memory" for
identifying/converting/translating country and language names/codes.
( I know that the Unicode Common Locale Data Repository (CLDR Version
25<http://cldr.unicode.org/index/downloads/cldr-25>
) provides “language-territory”
<http://www.unicode.org/cldr/charts/latest/supplemental/language_territory_information.html>
or “territory-language”
<http://www.unicode.org/cldr/charts/latest/supplemental/territory_language_information.html>unit-based
charts, but I believe that the Wikimedia projects can use and build one
better..)
The above suggestions are limited by my own experience and understanding of
Wikipedia and content localization/language industry. It is of course
biased towards my own interpretations of geolinguistic methods. There are
of course other suggestions worthy of considerations and discussions
regarding reporting viewing/editing traffic data. I would argue, however,
that the geolinguistic comparisons inform a more geocultural (and possibly
geopolitical) understanding that also matches how Wikipedia projects are
currently divided and governed (language versions with some regional
considerations).
Best,
han-teng liao
2014-05-17 14:50 GMT+08:00 Federico Leva (Nemo) <nemowiki(a)gmail.com>om>:
h, 17/05/2014 01:54:
Thus, we might want to share what has been done
+1, but:
and what could be done
> regarding the current traffic data provided by the Wikimedia Foundation
> while acknowledging the sensitivity of the traffic data release,
what additional data would you need and why, given clarifications above?
Have you considered using revision data instead, correlated with the
publicly available squid reports which already tell you what's the share of
each country for each language?
Nemo
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l