For all Hive users using stat1002/1004, you might have seen a deprecation
warning when you launch the hive client - that claims it's being replaced
with Beeline. The Beeline shell has always been available to use, but it
required supplying a database connection string every time, which was
pretty annoying. We now have a wrapper
setup to make this easier. The old Hive CLI will continue to exist, but we
encourage moving over to Beeline. You can use it by logging into the
stat1002/1004 boxes as usual, and launching `beeline`.
There is some documentation on this here:
If you run into any issues using this interface, please ping us on the
Analytics list or #wikimedia-analytics or file a bug on Phabricator
(If you are wondering stat1004 whaaat - there should be an announcement
coming up about it soon!)
We are in the process of upgrading the analytics infrastructure to Debian
Stretch. Along the way, we will be moving (superset|turnilo|yarn|hue).
wikimedia.org to new VMs. Superset and Turnilo will be upgraded as well.
I plan to move hue and yarn today or tomorrow, and superset and turnilo
either this week or next. The move should be transparent to you all (you
might have to re-log in). Just in case, if you encounter any issues please
report them here: https://phabricator.wikimedia.org/T202011
- Andrew Otto
Systems Engineer, WMF
if you are not a Archiva user (https://archiva.wikimedia.org/) you can stop
reading this email. Tomorrow morning EU time I am going to move
archiva.wikimedia.org to a new host, as explained in details in T192639.
- Archiva gets upgraded to the latest upstream version, 2.2.3 (4yrs of
upstream development from 2.0.0, the current one)
- The archiva-deploy user will not be active anymore, and people belonging
to the (new) archiva-deployers LDAP group will be able to log in with their
(LDAP) credentials and get the same permissions in archiva (for example to
upload jars, etc..).
- The admin user (used by the SRE team) will not be needed anymore since
anybody belonging to the 'ops' LDAP group will be able to log in and have
the same permissions.
I already added some people to archiva-deployers (all the ones that I knew
had worked in the past on it), but if you want to be sure to be on it
please ping me on IRC or comment on T192639.
as part of T198623 the Analytics and Traffic team worked on a better set of
firewall rules for ipv4/ipv6 traffic generated within the Analytics VLAN
and going towards Production. For example, we are now enforcing the usage
of https://wikitech.wikimedia.org/wiki/HTTP_proxy for all the http/https
connections originated from the Analytics VLAN, so if you have any
important cron job that runs periodically on any Analytics host (most
likely the stat boxes) please check that it complies to this policy as soon
as possible. Please note that the policy itself is not new (
but it will be enforced very soon during the next couple of days. We have
run several tcpdump sessions to check the current traffic (we are
reasonably sure that nothing will break), but better safe than sorry :)
For any comment/suggestion/question/etc.. please follow up in the task or
with me in the #wikimedia-analytics IRC channel.
Thanks in advance!
Luca (on behalf of the Analytics team)
More changes are coming for dumps, this time for Hungarian Wikipedia
(approximately 436,000 articles) and Arabic Wikipedia.(approximately
( https://meta.wikimedia.org/wiki/User:Pine )
---------- Forwarded message ---------
From: Ariel Glenn WMF <ariel(a)wikimedia.org>
Date: Mon, Aug 20, 2018 at 10:27 AM
Subject: [Wikitech-l] huwiki, arwiki to be treated as 'big wikis' and run
To: Wikipedia Xmldatadumps-l <Xmldatadumps-l(a)lists.wikimedia.org>,
Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
Starting September 1, huwiki and arwiki, which both take several days to
complete the revsion history content dumps, will be moved to the 'big
wikis' list, meaning that they will run jobs in parallel as do frwiki,
ptwiki and others now, for a speedup.
Please update your scripts accordingly. Thanks!
Task for this: https://phabricator.wikimedia.org/T202268
Wikitech-l mailing list
I’d like to announce that we’ve done a bit of work to make Jupyter
Notebooks in SWAP <https://wikitech.wikimedia.org/wiki/SWAP> support Spark
kernels. This means that you can now run Spark shells in both local mode
(on the notebook server) or YARN mode (distributed on the Hadoop Cluster)
inside of a Jupyter notebook. You can then take advantage of fancy Jupyter
plotting libraries to make graphs directly from data in Spark.
See https://wikitech.wikimedia.org/wiki/SWAP#Spark for documentation.
This is a new feature, and I’m sure there will be kinks to work out. If
you encounter issues of have questions, please respond on this phabricator
ticket <https://phabricator.wikimedia.org/T190443>, or create a new one and
add the Analytics tag.
-Andrew Otto & Analytics Engineering
The next Wikimedia Research Showcase will be live-streamed Wednesday,
August 13 2018 at 11:30 AM (PDT) 18:30 UTC.
YouTube stream: https://www.youtube.com/watch?v=OGPMS4YGDMk
As usual, you can join the conversation on IRC at #wikimedia-research. And,
you can watch our past research showcases here.
Hope to see you there!
This month's presentations is:
*Quicksilver: Training an ML system to generate draft Wikipedia articles
and Wikidata entries simultaneously*
John Bohannon and Vedant Dharnidharka, Primer
The automatic generation and updating of Wikipedia articles is usually
approached as a multi-document summarization task: Given a set of source
documents containing information about an entity, summarize the entity.
Purely sequence-to-sequence neural models can pull that off, but getting
enough data to train them is a challenge. Wikipedia articles and their
reference documents can be used for training, as was recently done
<https://arxiv.org/abs/1801.10198> by a team at Google AI. But how do you
find new source documents for new entities? And besides having humans read
all of the source documents, how do you fact-check the output? What is
needed is a self-updating knowledge base that learns jointly with a
summarization model, keeping track of data provenance. Lucky for us, the
world’s most comprehensive public encyclopedia is tightly coupled with
Wikidata, the world’s most comprehensive public knowledge base. We have
built a system called Quicksilver uses them both.
On Monday August 6 we are making EventStreams multi-DC, and this should be
transparent to users.
Due to a recent outage
of the our main eqiad Kafka cluster, we want to make the EventStreams
service support multiple datacenters for better high availability. To do
so, we need to hide the Kafka cluster message offsets from the
SSE/EventSource clients. On Monday August 6th, we will deploy a change to
EventStreams that will make it use message timestamps instead of message
offsets in the SSE/EventSource id field that is returned for every received
message. This will allow EventStreams to be backed by any Kafka cluster,
with auto-resuming during reconnect based on timestamp instead of Kafka
cluster based logical offsets.
This deployment should be transparent to clients. SSE/EventSource clients
will reconnect automatically and begin to use timestamps instead of offsets
in the Last-Event-ID.
You can read more about this work here:
- Andrew Otto, Systems Engineer, WMF