When I was working on related stuff, I found that the value of
x_analytics_map ia null on the wmf.webrequest table in stat1002, when
is_pageview is filtered for true, and agent_type is user. I'm wondering why
that would be.
These are the things I found:
For 28th April 2015, of 741,858,511 requests, 28,827,374 have
x_analytics(is set to '-') and x_analytics_map set to null. It's about 3.9%
of all requests that day.
You can find these counts by doing something like this on hive in the
WHERE x_analytics_map IS NULL
AND agent_type = 'user'
AND is_pageview = TRUE
AND YEAR = 2015
AND MONTH = 4
AND DAY = 28;
Does anyone have ideas on why this might be and if something underlying is
Hi Analytics dev team,
just a heads up that it's a week that the pagecounts-all-sites (and
pagecounts-raw) did not have the 20150409-160000 file generated .
To ease data quality assurances and avoid faulty aggregates, the
pageview aggregator scripts that do the aggregation for dashiki's
“Reader / Daily Pageviews” block for a week on missing data (unless
they are being told that for a given day, missing data is ok).
For the above hourly pagecounts-all-sites file, this week of blocking
has now passed without action.
Hence, the aggregator scripts will start aggregating again (to some
degree), but the undeclared hole for the 2015-04-09 in the data will
naturally start to bubble up.
If that hour's file cannot get generated, adding this date to the
BAD_DATES.csv of the aggregator data repository, will unblock the
aggregator cron job and make weekly, monthly, aggregates consider
2015-04-09 as day without data.
If that hour's file gets generated, be aware that aggregator per
default only automatically backfills for a week. So from today on, you
need to explicitly run the script to backfill for 2015-04-09.
P.S.: Since I guess the question of monitoring will arise ... the
missing pagecounts file has alerted people at least twice by email.
The subsequent aggregator blocking has been logged.
But you can add yourself in the MAILTO of the aggregator cron at
in puppet, if you want an additional notification for that.
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Hi Wikipedia analysts,
I work in product development for the Insights team at Way To Blue. We provide reports to major film studios to provide them with information about how their films are doing on social platforms over the course of their marketing campaign, up to a film's release in cinemas.
We typically focus on social media volumes (and traditionally split by Twitter, Forums, Blogs and News), however I just stumbled across the free Wikipedia article traffic statistics site (http://stats.grok.se/ ) and am thinking that daily Wikipedia traffic could be an interesting additional metric for us to report. I have a few questions about daily Wikipedia traffic data:
- Are you affiliated with the site above, or do you independently cover traffic volume?
- Do you know if it is possible to attribute a source country to traffic count? E.g. I can see that the daily traffic on the Tomorrowland (film) page was 5773 yesterday, however it would be interesting what proportion of that is US vs. UK vs. etc.
We have never used Wikipedia traffic as a source, so if you are the wrong team to be asking about this please let me know!
s1-analytics-slave has been struggling recently (SUL finalization
load, plus some other stuff). I've had to pause EventLogging
replication there tonight in order to let S1 catch up, as well as do
some table maintenance.
I estimate 24 hours impact.
analytics-store is not affected.
Now that all data that is generated by udp2log is also being generated by the Analytics Cluster, we are finally ready to turn off analytics udp2log instances. I will start with the ones that are used to generate the logs on stat1002 at /a/squid/archive. The (identical) cluster generated logs can be found on stat1002 at /a/log/webrequest/archive. I will paste the contents of the README file in /a/squid/archive describing the differences at the bottom of this email.
If you use any of the logs in /a/squid/archive for regular statistics, you will need to switch your code to use files in /a/log/webrequest/archive instead. I plan to start turning off udp2log instances on Monday April 27th (that’s next week!).
From the README:
[@stat1002:/a/squid/archive] $ cat README.migrate-to-hive.2015-02-17
* This directory will run stale once udp2log will get turned off. *
* Please use the corresponding TSVs from /a/log/webrequest/archive/ *
* instead. *
The TSV files in this directory underneath /a/squid/archive get
generated by udp2log and suffer from
* Sub-par data quality (E.g.: udp2log had an inherent loss).
* Lack of a way to backfill/fix data.
* Some files consuming https requests twice, which made filtering
* Consfusing naming scheme, where each file covered 24 hours, but not
midnight to midnight, but ~06:30 previous day to ~06:30 current day.
The new TSVs at /a/log/webrequest/archive/ contain the same
information but get generated by Hive, and address the above four
* By using Hive's webrequest table as input, the inherent loss is
gone. Also statistics on the hour's data quality are available.
* Hive data allows to backfill/fix data.
* Only data from the varnishes gets picked up. So https traffic no
longer gets duplicated.
* The files now cover 24 hours from midnight to midnight. No more
stitching/cutting is needed to get the logs for a given day.
Please migrate to using the Hive-generated TSVs from
Thanks! I’ll keep you updated as this happens.
And here are the minutes and slides from the remaining two quarterly
review meetings from this round:
Legal, Finance, Talent & Culture (HR), Communications:
Analytics, User Experience, Team Practices, Product Management
Much of the content of the slides will also (in somewhat more polished
form) feature in the WMF quarterly report, which is planned to be
published by May 15.
On Tue, Apr 21, 2015 at 9:53 PM, Tilman Bayer <tbayer(a)wikimedia.org> wrote:
> Hi all,
> the quarterly reviews for the past quarter (January-March 2015) took
> place last week. Minutes and slides are now available for the
> following meetings:
> Community Engagement, Advancement (Fundraising and Fundraising Tech):
> Mobile Web, Mobile Apps, Wikipedia Zero:
> Parsoid, Services, MediaWiki Core, Tech Ops, Release Engineering,
> Multimedia, Labs, Engineering Community:
> Editing (covering VisualEditor), Collaboration (covering Flow),
> Language Engineering:
> As mentioned in February , the quarterly review process has been
> extended to basically all groups in the Foundation since Lila took the
> helm last year, and it was further refined this quarter, reducing the
> number of meetings to six overall, each combining several areas.
> Minutes and slides from the remaining two meetings should come out
> soon, too. (And naturally, all the engineering team names above refer
> to the structure before the reorganization that has just been
>  https://lists.wikimedia.org/pipermail/wikimedia-l/2015-February/076835.html
> On Wed, Dec 19, 2012 at 6:49 PM, Erik Moeller <erik(a)wikimedia.org> wrote:
>> Hi folks,
>> to increase accountability and create more opportunities for course
>> corrections and resourcing adjustments as necessary, Sue's asked me
>> and Howie Fung to set up a quarterly project evaluation process,
>> starting with our highest priority initiatives. These are, according
>> to Sue's narrowing focus recommendations which were approved by the
>> Board :
>> - Visual Editor
>> - Mobile (mobile contributions + Wikipedia Zero)
>> - Editor Engagement (also known as the E2 and E3 teams)
>> - Funds Dissemination Committe and expanded grant-making capacity
>> I'm proposing the following initial schedule:
>> - Editor Engagement Experiments
>> - Visual Editor
>> - Mobile (Contribs + Zero)
>> - Editor Engagement Features (Echo, Flow projects)
>> - Funds Dissemination Committee
>> We’ll try doing this on the same day or adjacent to the monthly
>> metrics meetings , since the team(s) will give a presentation on
>> their recent progress, which will help set some context that would
>> otherwise need to be covered in the quarterly review itself. This will
>> also create open opportunities for feedback and questions.
>> My goal is to do this in a manner where even though the quarterly
>> review meetings themselves are internal, the outcomes are captured as
>> meeting minutes and shared publicly, which is why I'm starting this
>> discussion on a public list as well. I've created a wiki page here
>> which we can use to discuss the concept further:
>> The internal review will, at minimum, include:
>> Sue Gardner
>> Howie Fung
>> Team members and relevant director(s)
>> Designated minute-taker
>> So for example, for Visual Editor, the review team would be the Visual
>> Editor / Parsoid teams, Sue, me, Howie, Terry, and a minute-taker.
>> I imagine the structure of the review roughly as follows, with a
>> duration of about 2 1/2 hours divided into 25-30 minute blocks:
>> - Brief team intro and recap of team's activities through the quarter,
>> compared with goals
>> - Drill into goals and targets: Did we achieve what we said we would?
>> - Review of challenges, blockers and successes
>> - Discussion of proposed changes (e.g. resourcing, targets) and other
>> action items
>> - Buffer time, debriefing
>> Once again, the primary purpose of these reviews is to create improved
>> structures for internal accountability, escalation points in cases
>> where serious changes are necessary, and transparency to the world.
>> In addition to these priority initiatives, my recommendation would be
>> to conduct quarterly reviews for any activity that requires more than
>> a set amount of resources (people/dollars). These additional reviews
>> may however be conducted in a more lightweight manner and internally
>> to the departments. We’re slowly getting into that habit in
>> As we pilot this process, the format of the high priority reviews can
>> help inform and support reviews across the organization.
>> Feedback and questions are appreciated.
>> All best,
>>  https://wikimediafoundation.org/wiki/Vote:Narrowing_Focus
>>  https://meta.wikimedia.org/wiki/Metrics_and_activities_meetings
>> Erik Möller
>> VP of Engineering and Product Development, Wikimedia Foundation
>> Support Free Knowledge: https://wikimediafoundation.org/wiki/Donate
>> Wikimedia-l mailing list
>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
> Tilman Bayer
> Senior Analyst
> Wikimedia Foundation
> IRC (Freenode): HaeB
IRC (Freenode): HaeB
We've just released a count of pageviews to the English-language
Wikipedia from 2015-03-16T00:00:00 to 2015-04-25T15:59:59, grouped by
timestamp (down to a one-second resolution level) and site (mobile or
The smallest number of events in a group is 645; because of this, we
are confident there should not be privacy implications of releasing
this data. We checked with legal first ;p. If you're interested in
getting your mitts on it, you can find it at DataHub
Would you be so kind as to document the mobile apps info that should be
present on X-analytics header for apps requests?
I thought the uuid that identifies a unique user of the app was "uuid" but
I also see a "wmfuuid" and I am not sure if these two are the same. Need to
clarify this to be able to calculate mobile sessions.