wmfdata-python <https://github.com/wikimedia/wmfdata-python> (a package
that streamlines access to private analytics data) has been updated to
version 1.1. Here's what's new:
- The new presto module supports querying the Data Lake using Presto
- The spark module has been refactored to support local and custom
- A new utils.get_dblist function provides easy access to wiki database
lists, which is particularly useful with mariadb.run.
- The hive.run_cli function now creates its temp files in standard
location, to avoid creating distracting new entries in the current working
Many thanks to:
- Andrew Otto and Adam Roses Wight for writing significant new code
- Mikhail Popov, Andrew Otto, and Luca Toscano for careful code review
As always, if you have questions or feedback about wmfdata-python, please
email Product Analytics at product-analytics(a)wikimedia.org.
senior data scientist, Product Analytics
Wikimedia Foundation <https://wikimediafoundation.org/>
I am back with reboots, please be patient with me :)
I am going to reboot stat1004 / stat1006 / stat1007 (only these three for
the moment) on Wednesday Feb 17 at 9AM CET for Linux Kernel upgrades.
Please let me know if this impacts your work, in case we'll find another
maintenance window :)
Scheduled maintenance also outlined in
Luca (on behalf of the Data Engineering / Analytics team)
The upgrade day has been scheduled, we are going to migrate Hadoop to the
Apache Bigtop distribution on February 9th, during the EU morning. This
will require from 2 to 4 hours of Hadoop downtime, since the upgrade will
be very delicate and complex.
I created https://phabricator.wikimedia.org/T273711 to track more precisely
timings and updates, please use it to ask questions and to tell us if this
impacts your work or important deadlines for your team (in case we'll try
to find a different time window).
Since we are upgrading software that was released years ago, it may
probably happen that right after the upgrade some tools/workflows/etc..
don't work as expected anymore. We have tested a wide variety of use cases
in our testing environment, but some corner cases might have been missed.
In case you notice something weird right after the upgrade, please let us
know how to repro in the task, we'll follow up and hopefully fix promptly.
Thanks a lot for the support!
We just finished <https://phabricator.wikimedia.org/T269160> setting up an
internal instance of EventStreams called eventstreams-internal. This
instance is not public, but does expose all streams declared in stream
I've added documentation about how to access this here:
This instance isn't particularly useful for building any services (in
production you should just consume from Kafka), but it may be very useful
for debugging and troubleshooting events in production. EventStreams has a
GUI that will allow you to see events in Kafka as they flow in. In
production, this will allow you to see events right after they are emitted,
without having to wait a few hours for them to be ingested into Hive. You
can use this to make sure events you trigger in production make it through
EventGate into Kafka as you expect.
Big thanks to Marcel and Luca for their work on this! :)
- Andrew Otto
* i.e. those that use Event Platform, not legacy EventLogging events.