I work for a consulting firm called Strategy&. We have been engaged by Facebook on behalf of Internet.org to conduct a study on assessing the state of connectivity globally. One key area of focus is the availability of relevant online content. We are using a the availability of encyclopedic knowledge in one's primary language as a proxy for relevant content. We define this as 100K+ Wikipedia articles in one's primary language. We have a few questions related to this analysis prior to publishing it:
* We are currently using the article count by language based on Wikimedia's foundation public link: Source: http://meta.wikimedia.org/wiki/List_of_Wikipedias. Is this a reliable source for article count - does it include stubs?
* Is it possible to get historic data for article count. It would be great to monitor the evolution of the metric we have defined over time?
* What are the biggest drivers you've seen for step change in the number of articles (e.g., number of active admins, machine translation, etc.)
* We had to map Wikipedia language codes to ISO 639-3 language codes in Ethnologue (source we are using for primary language data). The 2 language code for a wikipedia language in the "List of Wikipedias" sometimes matches but not always the ISO 639-1 code. Is there an easy way to do the mapping?
[Description: Strategy& Logo]
Formerly Booz & Company
Rawia Abdel Samad
Direct: +9611985655 | Mobile: +97455153807
I'm inquiring about the delay for publishing the January compressed Wikistats files that are maintained by Erik Zachte. I'm guessing those processes are given a low priority compared to the content backups that need to run. More generally, I'm interested in finding new ways that I can help out. I'm an ex-Microsoftie who is now on the fraud analytics team at TD Bank. I've been involved with the Wikimedia group in Atlanta. I organize the picnic each summer, and helped get the rest of the historic buildings photographed. I've dabbled in reverting vandalism, and I contribute to articles when I actually have something to contribute. I don't feel like I've settled into a contributor role that really fits me yet though.
I enjoy using a variety of the traffic data sets that Wikimedia publishes. It seems the traffic servers get bogged down sometimes though. Can I help? Should I try to get the Atlanta group to pool our donations this year for an extra computer?
My username is rbaasland and I would like to contribute to the analytics
project. I was wondering if I could have access to the project, or how I go
about contributing to this project?
Thank you very much,
Thank you! Would you mind posting a note on Analytics(a)lists.wikimedia.org
when it is working normally again?
On Wed, Feb 11, 2015 at 1:36 PM, Henrik Abelsson <henrik(a)abelsson.com>
> Hi Kevin,
> Looking into it!
> On 11/02/15 16:36, Kevin Leduc wrote:
> Hi Henrik,
> stats.grok.se has missing data in the last week. Can you restart the
> service to see if that helps?
> Kevin Leduc
> Analytics Product Manager
TL;DR: If you think your Hive queries are currently taking longer than
usual, please find qchris in IRC, and if he is not responsive, kindly
ask someone with root on stat1002 (like Ops) to kill the process
java -Dproc_balancer -Xmx1000m [...]
Data in the Analytics cluster is not evenly distributed. Some data
nodes are >90% full, while others are half empty.
Data nodes that are >90% full are considered unhealthy and no longer
contribute to the pool of available resources. So unhealty data nodes
no longer contribute to the total available memory in the cluster.
There are other motivations too, but the latter item alone is enough
motivation to keep the data nodes balanced and hence healthy.
Rebalancing is running since 2015-02-26, but situation is getting
worse quicker than rebalancing can rebalance.
We've been up to 5 unhealthy nodes.
Since we're missing their memory, I decided that we should rebalance
more aggressively. Hence, I bumped the rebalancer's capacity, and
nodes are recovering and getting healthy again.
I am monitoring the increased-capacity rebalancer closely, but in case
you're getting blocked by it without me noticing, please find me in
IRC and let me know, so I can turn the rebalancer's capacity down.
Or if you find me unresponsive, please find someone with root on
stat1002 (like Ops) and ask thon to kill the process
java -Dproc_balancer -Xmx1000m [...]
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
I'm a student of computational physics from Czech Republic and I sometimes used data displayed here http://stats.wikimedia.org/wikimedia/squids/SquidReportCountryData.htm for my personal analysis of Wikipedia just to know how is used and trending. But it has gone silent during January and there are no updates for year 2015. Do you plan to publish country data somewhere?