I work for a consulting firm called Strategy&. We have been engaged by Facebook on behalf of Internet.org to conduct a study on assessing the state of connectivity globally. One key area of focus is the availability of relevant online content. We are using a the availability of encyclopedic knowledge in one's primary language as a proxy for relevant content. We define this as 100K+ Wikipedia articles in one's primary language. We have a few questions related to this analysis prior to publishing it:
* We are currently using the article count by language based on Wikimedia's foundation public link: Source: http://meta.wikimedia.org/wiki/List_of_Wikipedias. Is this a reliable source for article count - does it include stubs?
* Is it possible to get historic data for article count. It would be great to monitor the evolution of the metric we have defined over time?
* What are the biggest drivers you've seen for step change in the number of articles (e.g., number of active admins, machine translation, etc.)
* We had to map Wikipedia language codes to ISO 639-3 language codes in Ethnologue (source we are using for primary language data). The 2 language code for a wikipedia language in the "List of Wikipedias" sometimes matches but not always the ISO 639-1 code. Is there an easy way to do the mapping?
[Description: Strategy& Logo]
Formerly Booz & Company
Rawia Abdel Samad
Direct: +9611985655 | Mobile: +97455153807
My username is rbaasland and I would like to contribute to the analytics
project. I was wondering if I could have access to the project, or how I go
about contributing to this project?
Thank you very much,
Amir and Neta,
This is interesting research!
What was the visualization decision process like? I've often seen large
inter-connections visualized using Chord or Network diagrams, did you
decide on a heat map because of some peculiarities of this dataset?
I've tracked down an unexplained EL phenomenon that surfaced in our stats
as a false trend in our global stats.
The data I'm looking at specifically is coming from Media Viewer's
Have a look at this graph:
the big change is on Jan 7th/8th
It shows how many EL events we've recorded, per client-reported country,
over the last 90 days. The sampling factor we use has been constant for
each wiki over that period. Thus, the distribution shouldn't evolve
drastically, aside from seasonal/local trends. Besides the Ukraine spike on
a particular date (probably related to world events), the graph before Jan
7th looks like what you would expect. Then, following the outage that
happened on Jan 7th, not only the balance is completely changed, but it
evolves over time (the US and China are keeping "higher than normal"
levels, while the rest seems to slide down lower than pre-7th quantities),
showing me that something strange is happening and is probably unresolved.
This balance shifting over time is really problematic for tracking Media
Viewer client-side network performance, because Chinese traffic suddenly
accounting for a bigger or smaller share of the overall recorded events
creates big swings in the global averages/percentiles (since network
performance in China is bad).
I'm working on adding performance instrumentation to the Parsoid codebase
with statsd/node-txstatsd, and then visualizing the metrics via Grafana.
I'm at the stage where I'm looking to add the metrics' namespaces and
schema to the WMF Grafana configs.
It looks like WMF has Grafana working with Graphite/Carbon as a metrics
database and ElasticSearch as the db database, where can I find the
production Carbon config files to input the settings for tmy metrics?
Also, from my research WMF's carbon data retention schema is set to
'1m:1y, 10m:10y', should I default to this as my retention schema?
Note that the metrics are fired off anytime the Parsoid API is used so each
datapoint doesn't necessarily represent a minute/second/etc of data.
Some of you are probably aware of the pagecounts-raw dataset hosted at http://dumps.wikimedia.org/other/pagecounts-raw/ <http://dumps.wikimedia.org/other/pagecounts-raw/>. This week, we are making a change to how this dataset is generated. This should be mostly transparent, but an announcement is needed just in case anyone notices any differences.
pagecounts-raw has historically been generated by piping the udp2log webrequest logs into a C program called webstatscollector. This code is fairly old, and the logic it uses to generate pagecounts is out of date. However, since this data has been public for so long, we made an effort to continue to support it as is.
We are still in the process of backfilling, but eventually all pagecounts-raw data after January 1 2015 will be generated from webrequest data stored in HDFS. This data is collected using Kafka, and pagecounts-raw is now generated by Hive.
You may see a slight increase in article counts. The webrequest data in HDFS is less lossy than the udp2log data.
By the way, do you know about the pagecounts-all-sites dataset? pagecounts-all-sites is in a similar format to pagecounts-raw, but comes with more up to date pagecount logic. Most importantly, it includes mobile site pagecounts. Perhaps you should use pagecounts-all-sites instead of pagecounts-raw, eh? :)
 https://github.com/wikimedia/analytics-webstatscollector <https://github.com/wikimedia/analytics-webstatscollector>
 http://dumps.wikimedia.org/other/pagecounts-all-sites/ <http://dumps.wikimedia.org/other/pagecounts-all-sites/>