I just noticed that stats.grok.se doesn't have any data beyond Saturday May
23. Wondering if Henrik or others know what the issue is (are the Wikimedia
dumps not up-to-date, or has stats.grok.se not been running the updating
scripts?)
Vipul
EventLogging suffered from performance problems and data loss from Tuesday
2015-05-05 22:00 UTC to Wednesday 2015-05-06 20:00 UTC (22 hours).
During that period, an exceptional amount of events were sent to EL server
for a given schema. The system could not handle them properly, and this
caused data loss (30%-40% during the period) and some small gaps in the db.
All schemas were affected.
The missing data will be backfilled during this week.
Phab Task: https://phabricator.wikimedia.org/T98588
Incident documentation:
https://wikitech.wikimedia.org/wiki/Incident_documentation/20150506-EventLo…
Cheers,
Marcel
Hi,
Are there statistics about the number of people who click on red links in
Wikimedia projects?
And about what they do as the next step - go back, close the page, create
an article, something else?
--
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
http://aharoni.wordpress.com
“We're living in pieces,
I want to live in peace.” – T. Moore
Dan – thanks for the thorough update, hope you don’t mind if I repost this to the analytics list – I bet several people on this list are eager to know where this is going.
Dario
Begin forwarded message:
>
> From: Milimetric <no-reply(a)phabricator.wikimedia.org>
> Subject: [Maniphest] [Commented On] T44259: Make domas' pageviews data available in semi-publicly queryable database format
> Date: May 21, 2015 at 9:31:36 AM PDT
> To: dario(a)wikimedia.org
> Reply-To: T44259+public+a4a5010c21d15736(a)phabricator.wikimedia.org
>
> Milimetric added a comment.
>
> I'd love to start a more open discussion about our progress on this. Here's the recent history and where we are:
>
> February 2015: with data flowing into the Hadoop cluster, we defined which raw webrequests were "page views". The research is here <https://meta.wikimedia.org/wiki/Research:Page_view> and the code is here <https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery…>
> March 2015: we used this page view definition to create a raw pageview table in Hadoop. This is queryable by Hive but it's about 3 TB per day of data. So we don't have the resources to expose it publicly
> April 2015: we used this data internally to query but it overloaded our cluster and queries were slow
> May 2015: we're working on an intermediate aggregation that would total up page counts by hour over the dimensions that we think most people care about. We estimate this will cut down size by a factor of 50
> Progress has been slow mostly because Event Logging is our main priority and it's been having serious scaling issues. We think we have a good handle on the Event Logging issues after our latest patch, and in a week or so we're going to mostly focus on the Pageview API.
>
> Once this new intermediate aggregation is done, we'll hopefully free up some cluster resources and be in a better position to load up a public API. Right now, we are evaluating two possible data pipelines:
>
> Pipeline 1:
>
> Put daily aggregates into PostgreSQL. We think per article hourly data would be too big for PostgreSQL.
> Pipeline 2:
>
> Query data from the Hive tables directly with Impala. Impala is good for medium to small data, but is much faster than Hive. We might be able to query the hourly data if we use this method.
> Common Pipeline after we make the choice above:
>
> Mondrian builds OLAP cubes and handles caching which is very useful with this much data
> point RESTBase to Mondrian and expose API publicly at restbase.wikimedia.org. This will be a reliable public API that people can build tools around
> point Saiku to Mondrian and make a new public website for exploratory analytics. Saiku is an open source OLAP cube visualization and analysis tool
> Hope that helps. As we get closer to making this API real, we would love your input, participation, questions, etc.
>
>
> TASK DETAIL
> https://phabricator.wikimedia.org/T44259 <https://phabricator.wikimedia.org/T44259>
> EMAIL PREFERENCES
> https://phabricator.wikimedia.org/settings/panel/emailpreferences/ <https://phabricator.wikimedia.org/settings/panel/emailpreferences/>
> To: Milimetric
> Cc: Daniel_Mietchen, PKM, jeremyb, Arjunaraoc, Mr.Z-man, Tbayer, Elitre, scfc, Milimetric, Legoktm, drdee, Nemo_bis, Tnegrin, -jem-, DarTar, jayvdb, Aubrey, Ricordisamoa, MZMcBride, Magnus, MrBlueSky, Multichill
+analytics
On Tue, May 19, 2015 at 3:23 PM, Brian Gerstle <bgerstle(a)wikimedia.org>
wrote:
> +search
>
> On Tue, May 19, 2015 at 3:14 PM, Brian Gerstle <bgerstle(a)wikimedia.org>
> wrote:
>
>> The subject hints at a question that's been nagging me for a while, and
>> now that I'm going to be hacking on testing in Lyon I wanted to ask:
>>
>> Do we have a list of articles we usually run tests against?
>>
>> If not, do we have any processes for curating such a list? Would anyone
>> be interested in a brainstorming session at Lyon to discuss this further?
>>
>> Basically, as a developer, I would love to have more confidence that some
>> code I wrote doesn't break on our most popular articles. Or, if we can get
>> more sophisticated, that *certain properties of my code hold true for
>> certain kinds of generated pages*.*
>>
>> Please respond with your thoughts and whether you think I should create a
>> phab task for the hackathon about this. In either case, ping me anytime or
>> grab me at Lyon to discuss further!
>>
>> Regards,
>>
>> Brian
>>
>> * Yes, I'm talking about using property-based testing generators to
>> create random, shrinkable MW pages that we can run tests on. Not sure if
>> it's practical, but could be an interesting experiment.
>>
>> --
>> EN Wikipedia user page: https://en.wikipedia.org/wiki/User:Brian.gerstle
>> IRC: bgerstle
>>
>
>
>
> --
> EN Wikipedia user page: https://en.wikipedia.org/wiki/User:Brian.gerstle
> IRC: bgerstle
>
--
EN Wikipedia user page: https://en.wikipedia.org/wiki/User:Brian.gerstle
IRC: bgerstle
I’m sharing a proposal that Reid Priedhorsky and his collaborators at Los Alamos National Laboratory recently submitted to the Wikimedia Analytics Team aimed at producing privacy-preserving geo-aggregates of Wikipedia pageview data dumps and making them available to the public and the research community. [1]
Reid and his team spearheaded the use of the public Wikipedia pageview dumps to monitor and forecast the spread of influenza and other diseases, using language as a proxy for location. This proposal describes an aggregation strategy adding a geographical dimension to the existing dumps.
Feedback on the proposal is welcome on the lists or the project talk page on Meta [3]
Dario
[1] https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pagev…
[2] http://dx.doi.org/10.1371/journal.pcbi.1003892
[3] https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_…
Hello,
I am publishing a paper reporting the impact of distributing information on
Wikipedia.
One of the values which I am reporting is the pageviews of a set of English
Wikipedia articles as measured from 2013 to 2015. I get pageviews from
stats.grok.se. As I understand, numbers there have not always included
mobile device pageviews.
What is best estimate of the count of mobile device pageviews as can be
derived from the stats.grok.se pageview count? I think that I read
somewhere for this range, mobile device pageviews have been supposed to be
anywhere from 40% of the grok.se views to 120% of that value.
What is the most reasonable range to report for mobile device pageviews of
English Wikipedia articles from 2013-2015? Is 40-120% of the stats.grok.se
report the most reasonable range to report?
I need to report something. If there is any precedent for expressing this
somewhere then I would like to follow the precedent and cite whatever paper
described it.
Thanks,
--
Lane Rasberry
user:bluerasberry on Wikipedia
206.801.0814
lane(a)bluerasberry.com
Hi everyone,
The next research showcase will be live-streamed this Wednesday, May 13 at
11.30 PT. The streaming link will be posted on the lists a few minutes
before the showcase starts and as usual, you can join the conversation on
IRC at #wikimedia-research.
We look forward to seeing you!
Leila
This month
*The people's classifier: Towards an open model for algorithmic
infrastructure*
By Aaron Halfaker <https://www.mediawiki.org/wiki/User:Halfak_(WMF)>
Recent research has implicated that Wikipedia's algorithmic infrastructure
is perpetuating social issues. However, these same algorithmic tools are
critical to maintaining efficiency of open projects like Wikipedia at
scale. But rather than simply critiquing algorithmic wiki-tools and calling
for less algorithmic infrastructure, I'll propose a different strategy --
an open approach to building this algorithmic infrastructure. In this
presentation, I'll demo a set of services that are designed to open a
critical part Wikipedia's quality control infrastructure -- machine
classifiers. I'll also discuss how this strategy unites critical/feminist
HCI with more dominant narratives about efficiency and productivity.
*Social transparency online*
By Jennifer Marlow <http://www.aboutjmarlow.com/> and Laura Dabbish
<http://www.lauradabbish.com/>
An emerging Internet trend is greater social transparency, such as the use
of real names in social networking sites, feeds of friends' activities,
traces of others' re-use of content, and visualizations of team
interactions. There is a potential for this transparency to radically
improve coordination, particularly in open collaboration settings like
Wikipedia. In this talk, we will describe some of our research identifying
how transparency influences collaborative performance in online work
environments. First, we have been studying professional social networking
communities. Social media allows individuals in these communities to create
an interest network of people and digital artifacts, and get
moment-by-moment updates about actions by those people or changes to those
artifacts. It affords and unprecedented level of transparency about the
actions of others over time. We will describe qualitative work examining
how members of these communities use transparency to accomplish their
goals. Second, we have been looking at the impact of making workflows
transparent. In a series of field experiments we are investigating how
socially transparent interfaces, and activity trace information in
particular, influence perceptions and behavior towards others and
evaluations of their work.
I just killed 100+ 3-day unindexed research queries on dbstore1002.
All replication streams were lagging by nearly 1 day, and /tmp was
hundreds of GB.
This seems similar to the problem from ~2 weeks ago, when old /tmp did
fill up. The queries were of the following form (but with some
variation). We need some indexing scheme for MobileWebEditing*
tables, or to come up with a new approach.
SELECT
Month.Date,
COALESCE(Web.Web, 0) AS Web
-- http://stackoverflow.com/a/6871220/365238
-- ... using MariaDB 10 SEQUENCE engine instead of information_schema.columns
FROM (
SELECT DATE_FORMAT(
ADDDATE(CURDATE() - INTERVAL 30 - 1 DAY, @num:=@num+1),
'%Y-%m-%d'
) AS Date
FROM seq_1_to_100, (SELECT @num:=-1) num LIMIT 30
) AS Month
LEFT JOIN (
SELECT
DATE(timestamp) AS Date,
SUM(1) AS Web
FROM (SELECT timestamp, wiki, event_username, event_action,
event_namespace, event_userEditCount FROM MobileWebEditing_5644223
UNION SELECT timestamp, wiki, event_username, event_action,
event_namespace, event_userEditCount FROM MobileWebEditing_6077315
UNION SELECT timestamp, wiki, event_username, event_action,
event_namespace, event_userEditCount from MobileWebEditing_6637866
UNION SELECT timestamp, wiki, event_username, event_action,
event_namespace, event_userEditCount from MobileWebEditing_7675117
UNION SELECT timestamp, wiki, event_username, event_action,
event_namespace, event_userEditCount from MobileWebEditing_8599025) as
MobileWebEditing WHERE
event_action = 'error' AND
wiki != 'testwiki'
GROUP BY Date
) AS Web ON Month.Date = Web.Date;
EXPLAIN:
+------+--------------+--------------------------+--------+---------------+---------+---------+------------+----------+----------------------------------------------+
| id | select_type | table | type |
possible_keys | key | key_len | ref | rows | Extra
|
+------+--------------+--------------------------+--------+---------------+---------+---------+------------+----------+----------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL
| NULL | NULL | NULL | 30 |
|
| 1 | PRIMARY | <derived4> | ref | key0
| key0 | 4 | Month.Date | 563154 | Using where
|
| 4 | DERIVED | <derived5> | ALL | NULL
| NULL | NULL | NULL | 56315405 | Using where; Using
temporary; Using filesort |
| 5 | DERIVED | MobileWebEditing_5644223 | ALL | NULL
| NULL | NULL | NULL | 1152600 |
|
| 6 | UNION | MobileWebEditing_6077315 | ALL | NULL
| NULL | NULL | NULL | 685212 |
|
| 7 | UNION | MobileWebEditing_6637866 | ALL | NULL
| NULL | NULL | NULL | 1528269 |
|
| 8 | UNION | MobileWebEditing_7675117 | ALL | NULL
| NULL | NULL | NULL | 1663281 |
|
| 9 | UNION | MobileWebEditing_8599025 | ALL | NULL
| NULL | NULL | NULL | 51286043 |
|
| NULL | UNION RESULT | <union5,6,7,8,9> | ALL | NULL
| NULL | NULL | NULL | NULL |
|
| 2 | DERIVED | <derived3> | system | NULL
| NULL | NULL | NULL | 1 |
|
| 2 | DERIVED | seq_1_to_100 | index | NULL
| PRIMARY | 8 | NULL | 100 | Using index
|
| 3 | DERIVED | NULL | NULL | NULL
| NULL | NULL | NULL | NULL | No tables used
|
+------+--------------+--------------------------+--------+---------------+---------+---------+------------+----------+----------------------------------------------+
---
DBA @ WMF