Hi all,
For all Hive users using stat1002/1004, you might have seen a deprecation
warning when you launch the hive client - that claims it's being replaced
with Beeline. The Beeline shell has always been available to use, but it
required supplying a database connection string every time, which was
pretty annoying. We now have a wrapper
<https://github.com/wikimedia/operations-puppet/blob/production/modules/role…>
script
setup to make this easier. The old Hive CLI will continue to exist, but we
encourage moving over to Beeline. You can use it by logging into the
stat1002/1004 boxes as usual, and launching `beeline`.
There is some documentation on this here:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline.
If you run into any issues using this interface, please ping us on the
Analytics list or #wikimedia-analytics or file a bug on Phabricator
<http://phabricator.wikimedia.org/tag/analytics>.
(If you are wondering stat1004 whaaat - there should be an announcement
coming up about it soon!)
Best,
--Madhu :)
Hello!
The Analytics team would like to announce that we have migrated the
reportcard to a new domain:
https://analytics.wikimedia.org/dashboards/reportcard/#pageviews-july-2015-…
The migrated reportcard includes both legacy and current pageview data,
daily unique devices and new editors data. Pageview and devices data is
updated daily but editor data is still updated ad-hoc.
The team is working at this time on revamping the way we compute edit data
and we hope to be able to provide monthly updates for the main edit metrics
this quarter. Some of those will be visible in the reportcard but the new
wikistats will have more detailed reports.
You can follow the new wikistats project here:
https://phabricator.wikimedia.org/T130256
Thanks,
Nuria
We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
FYI
The deployment-kafka-jumbo-1 VM in Beta Cluster will be offline for a
while next week. Please be aware and take appropriate action.
Greg
----- Forwarded message from Andrew Bogott <abogott(a)wikimedia.org> -----
> Date: Fri, 29 Sep 2017 15:57:54 -0500
> From: Andrew Bogott <abogott(a)wikimedia.org>
> To: Cloud-announce(a)lists.wikimedia.org
> Subject: [Cloud] [Cloud-announce] Downtime for select VMs next week, 2017-10-04
> Reply-To: cloud(a)lists.wikimedia.org
>
> In order to rebuild a server of questionable stability, I'm going to move
> the following instances on Wednesday:
>
> |+--------------------------+---------------------+--------+||
> ||| Name | Tenant ID | Status | ||
> ||+--------------------------+---------------------+--------+||
> ||| cindy | pluggableauth | ACTIVE | ||
> ||| deployment-kafka-jumbo-1 | deployment-prep | ACTIVE | ||
> ||| oidc-google | pluggableauth | ACTIVE | ||
> ||| proton-staging | reading-web-staging | ACTIVE | ||
> ||| search-jessie | search | ACTIVE | ||
> ||| smtp-test1 | project-smtp | ACTIVE | ||
> ||| suggestbot-prod | suggestbot | ACTIVE | ||
> ||| twlight-prod | twl | ACTIVE | ||
> ||| twlight-staging | twl | ACTIVE | ||||
> ||| wikibrain-embeddings-02 | wikibrain | ACTIVE | ||
> ||| wikikids | wmam | ACTIVE | ||
> ||| zim-proto | mobile | ACTIVE | ||
> ||+--------------------------+---------------------+--------+|
>
> Migration will cause the affected instances to be offline for some time
> (potentially more than an hour depending on the size of the instance) and
> rebooted. If you need me work on your server at a particular time of day,
> or need a stay of execution, please let me know. Otherwise I'll start going
> down the list at the beginning of my workday on Wednesday, around 14:00 UTC.
>
> Sorry for the inconvenience!
>
> -Andrew
>
>
> _______________________________________________
> Cloud-announce mailing list
> Cloud-announce(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/cloud-announce
> _______________________________________________
> Cloud mailing list
> Cloud(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/cloud
----- End forwarded message -----
--
| Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E |
| Release Team Manager A18D 1138 8E47 FAC8 1C7D |
Hi Everyone,
The next Research Showcase will be live-streamed this Wednesday, September
20, 2017 at 11:30 AM (PST) 18:30 UTC.
YouTube stream: https://www.youtube.com/watch?v=VR5JwqyVGSk
As usual, you can join the conversation on IRC at #wikimedia-research. And,
you can watch our past research showcases here
<https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#September_2017>.
This month's presentation:
A Glimpse into BabelAn Analysis of Multilinguality in WikidataBy *Lucie-Aimée
Kaffee*Multilinguality is an important topic for knowledge bases,
especially Wikidata, that was build to serve the multilingual requirements
of an international community. Its labels are the way for humans to
interact with the data. In this talk, we explore the state of languages in
Wikidata as of now, especially in regard to its ontology, and the
relationship to Wikipedia. Furthermore, we set the multilinguality of
Wikidata in the context of the real world by comparing it to the
distribution of native speakers. We find an existing language
maldistribution, which is less urgent in the ontology, and promising
results for future improvements. An outlook on how users interact with
languages on Wikidata will be given.
Science is Shaped by WikipediaEvidence from a Randomized Control TrialBy *Neil
C. Thompson and Douglas Hanley*As the largest encyclopedia in the world, it
is not surprising that Wikipedia reflects the state of scientific
knowledge. However, Wikipedia is also one of the most accessed websites in
the world, including by scientists, which suggests that it also has the
potential to shape science. This paper shows that it does. Incorporating
ideas into a Wikipedia article leads to those ideas being used more in the
scientific literature. This paper documents this in two ways:
correlationally across thousands of articles in Wikipedia and causally
through a randomized experiment where we added new scientific content to
Wikipedia. We find that fully a third of the correlational relationship is
causal, implying that Wikipedia has a strong shaping effect on science. Our
findings speak not only to the influence of Wikipedia, but more broadly to
the influence of repositories of scientific knowledge. The results suggest
that increased provision of information in accessible repositories is a
very cost-effective way to advance science. We also find that such gains
are equity-improving, disproportionately benefitting those without
traditional access to scientific information.
Many kind regards,
Sarah R. Rodlund
Senior Project Coordinator-Product & Technology, Wikimedia Foundation
srodlund(a)wikimedia.org
Hello everyone,
I'm having trouble with the entries in wmf.wdqs_extract on 23.08.2017,
specifically the hours 9-17. The call
hive -e "insert overwrite local directory 'temp' row format delimited
fields terminated by '\t' select uri_query, uri_path, user_agent, ts,
agent_type, hour, http_status from wmf.wdqs_extract where
uri_query<>\"\" and year='2017' and month='8' and day='23' and hour='8'"
works fine, as do all hours in 0-8 and 18-23, but if I change hour to
anything between 9 and 17 it fails with the following message:
Error: java.lang.RuntimeException: Error in configuring
object
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:449)
at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native
Method)
at
javax.security.auth.Subject.doAs(Subject.java:421)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
at
org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
... 17 more
Caused by: java.lang.RuntimeException: Map operator initialization failed
at
org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:147)
... 22 more
Caused by: java.lang.RuntimeException: Hive internal error: conversion
of string to array<string>not supported yet.
at
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$ListConverter.<init>(ObjectInspectorConverters.java:313)
at
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters.getConverter(ObjectInspectorConverters.java:158)
at
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.<init>(ObjectInspectorConverters.java:374)
at
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters.getConverter(ObjectInspectorConverters.java:155)
at
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.<init>(ObjectInspectorConverters.java:374)
at
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters.getConverter(ObjectInspectorConverters.java:155)
at
org.apache.hadoop.hive.ql.exec.MapOperator.initObjectInspector(MapOperator.java:199)
at
org.apache.hadoop.hive.ql.exec.MapOperator.setChildren(MapOperator.java:355)
at
org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:116)
... 22 more
Can anybode help me out with this?
Greetings,
Adrian
Hi all!
tl;dr: Stop using stat100[23] by September 1st.
We’re finally replacing stat1002 and stat1003. These boxes are out of
warranty, and are running Ubuntu Trusty, while most of the production fleet
is already on Debian Jessie or even Debian Stretch.
stat1005 is the new stat1002 replacement. If you have access to stat1002,
you also have access to stat1005. I’ve copied over home directories from
stat1002.
stat1006 is the new stat1003 replacement. If you have access to stat1003,
you also have access to stat1006. I’ve copied over home directories from
stat1003.
I have not migrated any personal cron jobs running on stat1002 or
stat1003. I need your help for this!
Both of these boxes are running Debian Stretch. As such, packages that
your work depends on may have upgraded. Please log into the new boxes and
try stuff out! If you find anything that doesn’t work, please let me know
by commenting on https://phabricator.wikimedia.org/T152712.
Please be fully migrated to the new nodes by September 1st. This will give
us enough time to fully decommission stat1002 and stat1003 by the end of
this quarter.
I’ve only done a single rsync of home directories. If there is new data on
stat1002 or stat1003 that you want rsynced over, let me know on the ticket.
A few notes:
- stat1002 used to have /a. This has been removed in favor of /srv. /a no
longer exists.
- Home directories are now much larger. You no longer need to create
personal directories in /srv.
- /tmp is still small, so please be careful. If you are running long jobs
that generate temporary data, please have those jobs write into your home
directory, rather than /tmp.
- We might implement user home directory quotas in the future.
Thanks all! I’ll send another email in about a months time to remind you
of the impending deadline of Sept 1.
-Andrew Otto