Hi,
in the week from 2014-11-03–2014-11-09 Andrew, Jeff, and I worked on
the following items around the Analytics Cluster and Analytics related
Ops:
* More research around making xmldumps available in the Analytics cluster
* 'research' database user
* Alter data type of time_firstbyte, and adding Range header to webrequest table
* Automatic cleanup of EventLogging logs on stat1002 and stat1003
* Per wiki CSVs with daily aggregates of webstatscollector numbers
(details below)
Have fun,
Christian
* More research around making xmldumps available in the Analytics cluster
In order make the xmldumps accessible easily from within the cluster,
more research around WikiHadoop as InputFormat and Avro as
serialization format was done. The proof of concept allowed to stream
the xmldumps through WikiHadoop and write it into AVRO files. This
approach would chunk the xmldumps into records containing the text
(and metadata) for a revision and it's parent revision. Those records
could be consumed directly from the cluster's processing platforms.
* 'research' database user
The needed code around making password changes of the 'research'
database user got merged. So future password changes should be more
frictionless. And the password was finally changed :-)
* Alter data type of time_firstbyte, and adding Range header to webrequest table
To be able to (at least mostly) disambiguate “seeking within a video
file” from “starting to watch a video file” in the logs, we needed to
add the Range header to the webrequest table.
Additionally, the data type of the time_firstbyte column needed to get
changed.
While both migrations should just work according to Hive's
documentation, testing them beforehand in labs showed that they
“sometimes” screw up the table. So we prepared scripts to resurrect
the table, if the migration blows up the table in the Analytics
cluster.
We migrated the table.
The table exploded (Bug 73095).
And the prepared scripts helped to rebuild it within a few minutes.
Now, webrequest has the needed range header, a more granular data
type for time_firstbyte, and all the partitions re-added.
* Automatic cleanup of EventLogging logs on stat1002 and stat1003
File logs of EventLogging data on stat1002 and stat1003 are now
automatically cleaned up after 90 days as required by the the data
retention guidelines.
* Per wiki CSVs with daily aggregates of webstatscollector numbers
In order to ease upcoming plotting of webstatscollector data in
Dashiki, we wrote code to automatically aggregate webstatscollector's
hourly projectcounts files into per wiki CSVs with daily numbers. And
we backfilled with data back to 2008. The code and data is still
under code-review.
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
(Moving this discussion to analytics@ and localization-team@ based on Nuria’s suggestion below.)
Hi Leila,
The output I posted in the message is the only output I am seeing. I do not see the URL-encoded section or the validation section. I think there may be something wrong with my testing setup.
Niklas Laxstöm has checked what is happening with our event logging in beta and he confirmed that we are sending events and the events are valid. The issue seems to be that we are logging events to the beta event logging db while what we checked earlier was the production event logging db.
Can you (or anyone who is available) check the event logging db in beta to see if the table has been created and has data? The schema name again is ContentTranslation. If you don’t find anything, let us know and we will do some more investigation.
If there is data in the beta db the next step would be to follow with Dan’s instructions to get a dashboard set up on limn1. I believe that most of Dan’s instructions need to be handled by someone on the analytics team, but let me know if there is anything I can help with.
Thanks again for your help!
Joel
Joel Sahleen, Software Engineer
Language Engineering
Wikimedia Foundation
jsahleen(a)wikimedia.org
On Nov 11, 2014, at 11:47 PM, Leila Zia <leila(a)wikimedia.org> wrote:
> Hi Joel,
>
> When you log events, the output will be the URL-encoded JSON sent by the browser, the event record (similar to what you pasted in your email), and whether the event validates against the schema. For the sample output you pasted earlier, or another sample output, can you let us know if validation section shows Valid?
>
> Leila
>
> On Mon, Nov 10, 2014 at 3:24 PM, Nuria Ruiz <nuria(a)wikimedia.org> wrote:
> Joel,
>
> For questions like these going forward you can contact analytics@ as you will be getting amore prompt response. Both Dan and Leila are OOTO the next couple of days.
>
> >There are configuration options for the dev server that need to be added. Do similar options need to be added when not using the dev server?
> No, there is no need.
>
> You would need sample rates to determine at which sampling rate you are logging if you are not logging all events, that is.
>
> Thanks,
>
> Nuria
>
> On Mon, Nov 10, 2014 at 2:39 PM, Dan Andreescu <dandreescu(a)wikimedia.org> wrote:
> Adding Nuria as she can probably help
>
> On Monday, November 10, 2014, Joel Sahleen <jsahleen(a)wikimedia.org> wrote:
> Hi Leila,
>
> I have tested our EventLogging code and it seems to be working fine with the event logging dev server. I can see the events coming through and they are valid. Here is some sample output:
>
> {"wiki": "wiki", "uuid": "e9dde14cf18552269ae81a7897f45d0c", "webHost": "localhost", "timestamp": 1415651367, "clientValidated": true, "recvFrom": "1.0.0.127.in-addr.arpa", "seqId": 2, "clientIp": "80f7683f3565e3d365740a1c8d1771ba95caaaaa", "schema": "ContentTranslation", "event": {"action": "create-translated-page", "targetLanguage": "ca", "token": "Tester", "version": 1, "contentLanguage": "es"}, "revision": 7146627}
>
> Are there additional configuration options we need to add to get EL working aside from just requiring the main extension file. There are configuration options for the dev server that need to be added. Do similar options need to be added when not using the dev server?
>
> Any help on this would be much appreciated.
>
> Thanks,
>
> Joel
>
> On Nov 7, 2014, at 3:52 PM, Joel Sahleen <jsahleen(a)wikimedia.org> wrote:
>
>> No problem, Dan. Enjoy your vacation!
>>
>> I will read through the document at the link you sent. I still need to fix our event logging code so it may be a couple days before we are ready anyway. If I have any questions I will contact Leila or Nuria.
>>
>> Thanks,
>>
>> Joel
>>
>> Joel Sahleen, Software Engineer
>> Language Engineering
>> Wikimedia Foundation
>> jsahleen(a)wikimedia.org
>>
>>
>>
>>
>> On Nov 7, 2014, at 3:10 PM, Dan Andreescu <dandreescu(a)wikimedia.org> wrote:
>>
>>> Joel, re: visualization,
>>>
>>> I'm going on vacation tomorrow and will be back on November 19th. If that's not too late, I can set up a limn instance then. If it's too late, that's ok, I wrote up the steps needed. Someone with access to the limn1.eqiad.wmflabs instance can perform them: https://wikitech.wikimedia.org/wiki/Analytics/Dashboards
>>>
>>> If you have the data or are generating the data in some other way, then you don't need half of that setup, you just need the part that sets up the limn dashboard which is only an hour or so of work. Sorry I'm running out the door and can't take care of that for you.
>>>
>>> Dan
>>>
>>> On Fri, Nov 7, 2014 at 7:37 AM, Joel Sahleen <jsahleen(a)wikimedia.org> wrote:
>>> Thank you for the information, Pau. Very helpful. As you say, this does not change our current plans or hold us up in any way. I was just wasn’t clear about the relationship between the "high priorities" and "other metrics” sections. Knowing these came from different people at different times clarifies things a lot.
>>> Joel
>>>
>>> On Nov 7, 2014, at 3:44 AM, Pau Giner <pginer(a)wikimedia.org> wrote:
>>>
>>>> @Pau, @Amir There is a section called High priorities for product management on the Content translation analytics page. Did these priorities come from outside the team or does this just represent our own internal view of the high priorities?
>>>>
>>>> Here is the story of that page as I'm aware of it:
>>>>
>>>> In September 2013, I was in a meeting with the analytics team in SF presenting an initial proposal for metrics. On that meeting, Dario recommended to create hierarchy of metrics based on the project goals. I created such image and a description for those metrics (the image is on top of our analytics page and the metrics are described in what it now the "Other metrics for created articles" section.
>>>>
>>>> In a meeting between Amir and Howie, they captured which should be the most important metrics from the product perspective in the "High priorities for product management". If I recalled correctly, as an outcome of later meetings between Howie and Amir, Howie was happy focusing on articles published as a single (initial?) metric for success. Amir can provide more details since I was not on those meetings.
>>>>
>>>> In short: The analytics page has pieces contributed by different people during the last year, and although there are many ideas to organise and detail, measuring the number of published articles seems to be the solid candidate to get started with, learn from the value we get from it and polish the rest of our goal-to-signal process for detecting better metrics.
>>>>
>>>>
>>>> Pau
>>>>
>>>> On Fri, Nov 7, 2014 at 1:57 AM, Joel Sahleen <jsahleen(a)wikimedia.org> wrote:
>>>> Hi All,
>>>>
>>>> I have been reviewing our requirements for Content translation analytics and I have a few questions/requests. I am sending them to the language team list and Leila and Dan in the hopes of getting some more clarity. I will add the same content to the Trello card.
>>>>
>>>> In the weekly team meeting earlier today we agreed that the first metric we want to collect data for is the number of articles created in each language over time. This is something has Amir has already set up our current Event Logging to track. Now that Kartik has enabled EL in beta, that part should be done. Since we are only barely turning it on, there will be very little data until people create more articles using CX. However, we should be set up to collect any new data that comes in.
>>>>
>>>> @Leila, can you verify that the db table now exists for the ContentTranslation schema? If it doesn’t, can you point us to right people we need to work with to troubleshoot the issue? Also you mentioned in our meeting that personal data may soon be purged after 90 days as part of a new privacy policy. Could you explain that a bit more or point us to more information? If this is the case, it may affect some of the metrics we would like to collect in the future.
>>>>
>>>> @Dan, what do we need to do next in order to set up a very simple visualization that would show the number of articles created per week by language. Pau has an image of what he would like on the Trello card. You mentioned something about being able to host a dashboard for us on one of the Limn servers you already have set up.
>>>>
>>>> @Santhosh, I believe you said earlier you have a script you use to export the data for the ULS analytics. If so can you share that please in case we need a similar script for CX so I don’t have to write a new script from scratch?
>>>>
>>>> @Pau, @Amir There is a section called High priorities for product management on the Content translation analytics page. Did these priorities come from outside the team or does this just represent our own internal view of the high priorities? If the latter, have these priorities been reviewed by anyone outside the team? I think we are safe to proceed with our current plan, but it would be good to have product sign off on things more generally.
>>>>
>>>> Thanks,
>>>>
>>>> Joel
>>>>
>>>> Joel Sahleen, Software Engineer
>>>> Language Engineering
>>>> Wikimedia Foundation
>>>> jsahleen(a)wikimedia.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Localisation-team mailing list
>>>> Localisation-team(a)lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/localisation-team
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Pau Giner
>>>> Interaction Designer
>>>> Wikimedia Foundation
>>>> _______________________________________________
>>>> Localisation-team mailing list
>>>> Localisation-team(a)lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/localisation-team
>>>
>>>
>>
>
>
>
I am trying to migrate Limn graphs to our own handling. Currently Zero
graphs are generated as Limn dashboards, and after I applied this filter
(taken from the HQL for counting article pageviews), i got matching (about
10% discrepancy) between our graphs and limn. Yet, one partner has
discrepancy of 10 times, and I would like to see where that mismatch comes
from. I looked at https://github.com/wikimedia/analytics-wp-zero but it
seems there is other code that's missing from that repo. Any suggestions
are welcome. Thanks!
WHERE
webrequest_source IN ('text', 'mobile')
AND year=${year}
AND month=${month}
AND day=${day}
AND x_analytics LIKE '%zero=%'
AND SUBSTR(uri_path, 1, 6) = '/wiki/'
AND (
(
SUBSTR(ip, 1, 9) != '10.128.0.'
AND SUBSTR(ip, 1, 11) NOT IN (
'208.80.152.',
'208.80.153.',
'208.80.154.',
'208.80.155.',
'91.198.174.'
)
) OR x_forwarded_for != '-'
)
AND SUBSTR(uri_path, 1, 31) != '/wiki/Special:CentralAutoLogin/'
AND http_status NOT IN ( '301', '302', '303' )
AND uri_host RLIKE
'^[A-Za-z0-9-]+(\\.(zero|m))?\\.[a-z]*\\.org$'
AND NOT (SPLIT(TRANSLATE(SUBSTR(uri_path, 7), ' ', '_'),
'#')[0] RLIKE '^[Uu]ndefined$')
Is it being used for anything at all? Puppet has been failing for a
while because it's trying to run the kraken role which hasn't been
around for a while. Can it be deleted?
--
Yuvi Panda T
http://yuvi.in/blog
Streaming for this event will be starting shortly at
https://www.youtube.com/watch?v=-FQ-TtTCdJo
On Thu, Nov 13, 2014 at 10:51 AM, Aaron Halfaker <ahalfaker(a)wikimedia.org>
wrote:
> Hey folks,
>
> This month we're holding a special edition of the Research and Data
> showcase
> <https://www.mediawiki.org/wiki/Analytics/Research_and_Data/Showcase>.
> We've invited Dr. Yan Chen, Professor from the UMuch iSchool to present her
> work studying community dynamics with Kiva (micro-lending platform) and
> what her results might imply for Wikimedia's sites. To take advantage of
> her travel schedule, we'll be holding the event on *Friday** November 14
> at 11.30 PST (UTC-8) *rather than the usually 3rd Wednesday. The event
> will be live streamed and recorded as usual. You can join the conversation
> via IRC on freenode.net in the the #wikimedia-research channel.
>
> We look forward to seeing you there,
>
> -Aaron
>
> *Does Team Competition Increase Pro-Social Lending? Evidence from Online
> Microfinance.*By Yan Chen <http://yanchen.people.si.umich.edu/>In the
> first half of the talk, I will present our empirical analysis of the
> effects of team competition on pro-social lending activity on Kiva.org, the
> first microlending website to match lenders with entrepreneurs in
> developing countries. Using naturally occurring field data, we find that
> lenders who join teams contribute 1.2 more loans per month than those who
> do not. Furthermore, teams differ in activity levels. To investigate this
> heterogeneity, we run a field experiment by posting forum messages.
> Compared to the control, we find that lenders from inactive teams make
> significantly more loans when exposed to a goal-setting message and that
> team coordination increases the magnitude of this effect.In the second
> part of the talk, I will discuss a randomized field experiment we did in
> May 2014, when we recommend teams to lenders on Kiva. We find that lenders
> are more likely to join teams in their local area. However, after joining
> teams, those who join popular teams (on the leaderboard) are more active in
> lending.
>
Hi,
a test to move the m2 database cluster behind a proxy towards high
availability (ah the irony!) failed for EventLogging, and caused
EventLogging to not be able to write events to the database between
00:59 and 01:21.
The data for that period is not lost, but is available in backup
files, waiting to get injected again into the database.
Just wanted to let you know what happened, in case you notice drops in
graphs / dashboards during that period.
Sorry for the inconveniences,
Christian
P.S.: Documentation are at the (currently still rather empty) Incident
Report:
https://wikitech.wikimedia.org/wiki/Incident_documentation/20141113-EventLo…
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
Hey folks,
This month we're holding a special edition of the Research and Data showcase
<https://www.mediawiki.org/wiki/Analytics/Research_and_Data/Showcase>.
We've invited Dr. Yan Chen, Professor from the UMuch iSchool to present her
work studying community dynamics with Kiva (micro-lending platform) and
what her results might imply for Wikimedia's sites. To take advantage of
her travel schedule, we'll be holding the event on *Friday** November 14
at 11.30 PST (UTC-8) *rather than the usually 3rd Wednesday. The event
will be live streamed and recorded as usual. You can join the conversation
via IRC on freenode.net in the the #wikimedia-research channel.
We look forward to seeing you there,
-Aaron
*Does Team Competition Increase Pro-Social Lending? Evidence from Online
Microfinance.*By Yan Chen <http://yanchen.people.si.umich.edu/>In the first
half of the talk, I will present our empirical analysis of the effects of
team competition on pro-social lending activity on Kiva.org, the first
microlending website to match lenders with entrepreneurs in developing
countries. Using naturally occurring field data, we find that lenders who
join teams contribute 1.2 more loans per month than those who do not.
Furthermore, teams differ in activity levels. To investigate this
heterogeneity, we run a field experiment by posting forum messages.
Compared to the control, we find that lenders from inactive teams make
significantly more loans when exposed to a goal-setting message and that
team coordination increases the magnitude of this effect.In the second part
of the talk, I will discuss a randomized field experiment we did in May
2014, when we recommend teams to lenders on Kiva. We find that lenders are
more likely to join teams in their local area. However, after joining
teams, those who join popular teams (on the leaderboard) are more active in
lending.
Hi,
in the week from 2014-10-27–2014-11-02 Andrew, Jeff, and I worked on
the following items around the Analytics Cluster and Analytics related
Ops:
* Hive UDF to parse user agents with ua_parser
* More kafkatee issues
* Database replication getting stuck on 'Duplicate entry'
* Ganglia's Views broke
* Fixing sync of “aggregate-datasets” rsync
* Turning down logstash logging
* 'research' database user
(details below)
Have fun,
Christian
* Hive UDF to parse user agents with ua_parser
A Hive UDF to parse User-Agent strings with ua_parser was merged and
got deployed to the Analytics cluster. So people with Hive access can
now use this UDF to automatically extract browser, OS, and device
information.
* More kafkatee issues
After previous week's deployment of the new kafkatee build, we took a
closer look at the generated files. While up to now, no partitions got
dropped, it turned out that kafkatee is loosing lines if other
processes have more disk activity.
While---even if other processes have more disk activity---the kafkatee
output files are still better than what udp2log can produce, we're
investigating if the kafkatee output is good enough for users that
need to stream data.
(For non-streaming needs, it currently looks like Hive would be the more
reliable choice.)
* Database replication getting stuck on 'Duplicate entry'
This week we've been having two more replication lag issues. From the
five lag issues of October, the last three times, replication stopped
with 'Duplicate entries'. Since this seems to be an emergent pattern,
it has been called out with Ops and while they are aware of it, there
is currently no fix for this issue.
* Ganglia's Views broke
Ganglia allows to have custom predefined dashboards (see Ganglia's
“View” tab), which we use to watch kafka's and varnishkafka's key
metrics. It seems that some puppet refactorings broke the existing
Ganglia dashboards. As it seems we're one of the few teams using
Ganglia dashboards regularly, we fixed puppet's Ganglia View setup.
* Fixing sync of “aggregate-datasets” rsync
Some weeks back, work was started to have stat1002's
“aggregate-datasets” directory automatically publish it's content the
website at
http://datasets.wikimedia.org
. Now final tweaks there have been put into place, and automatic
publishing now works as expected.
* Turning down logstash logging
It turned out that the combination of Analytics cluster and other new
log producers amount for more traffic than the current logstash setup
can handle nicely. So the log level from the Analytics cluster got
turned down until logstash itself got scaled up.
* 'research' database user
Many researchers and other WMFers are using the 'research' credentials
to access the analytics databases, and the time came to switch those
credentials to a new password. Since the password was not properly
puppetized, discussions were started on how disruptive a change would
be and how to best change it. Also puppetization work around it
started.
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
"How I store 200k messages/sec... Using Kafka & librdkafka" :D
Infrastructure for Data Streams
Tadas Vilkeliskis – Nov 10
See the top news stories shared by Jeff Gage's friends
Sent from my iPhone
Three identical queries from the 'research_prod' user have just passed one
month execution time on s1-anlytics-slave:
select count(*)
from staging.ourvision r
where exists (
select *
from staging.ourvision r1
inner join
staging.ourvision r2
on r2.sha1 = r1.sha1
where r1.page_id = r.page_id
and r2.page_id = r.page_id
and DATE_ADD(r.timestamp, INTERVAL 1 HOUR)
and r2.timestamp between r.timestamp and DATE_SUB(r.timestamp ,
INTERVAL 1 HOUR)
and r1.sha1!= r.sha1
);
I havn't checked to see if the queries are just that amazingly slow, or if
they're part of a larger ongoing transaction. In any case, three month-long
transactions is pushing the resource limits of the slave and will soon
result in either mass replication lag or some other interesting lockup that
may in turn take days to rollback :-)
Can we kill these? Can we optimize and/or redesign the jobs? Happy to
help...