For all Hive users using stat1002/1004, you might have seen a deprecation
warning when you launch the hive client - that claims it's being replaced
with Beeline. The Beeline shell has always been available to use, but it
required supplying a database connection string every time, which was
pretty annoying. We now have a wrapper
setup to make this easier. The old Hive CLI will continue to exist, but we
encourage moving over to Beeline. You can use it by logging into the
stat1002/1004 boxes as usual, and launching `beeline`.
There is some documentation on this here:
If you run into any issues using this interface, please ping us on the
Analytics list or #wikimedia-analytics or file a bug on Phabricator
(If you are wondering stat1004 whaaat - there should be an announcement
coming up about it soon!)
Forwarding this question to the public Analytics list, where it's good to
have these kinds of discussions. If you're interested in this data and how
it changes over time, do subscribe and watch for updates, notices of
Ok, so on to your question. You'd like the *total # of articles for each
wiki*. I think the simplest way right now is to query the AQS (Analytics
Query Service) API, documented here:
To get the # of articles for a wiki, let's say en.wikipedia.org, you can
get the timeseries of new articles per month since the beginning of time:
And to get a list of all wikis, to plug into that URL instead of "
en.wikipedia.org", the most up-to-date information is here:
https://meta.wikimedia.org/wiki/Special:SiteMatrix in table form or via the
Sometimes new sites won't have data in the AQS API for a month or two until
we add them and start crunching their stats.
The way I figured this out is to look at how our UI uses the API:
So if you were interested in something else, you can browse around there
and take a look at the XHR requests in the browser console. Have fun!
On Thu, Mar 29, 2018 at 12:54 AM, Zainan Zhou (a.k.a Victor) <zzn(a)google.com
> Hi Dan,
> How are you! This is Victor, It's been a while since we meet at the 2018
> Wikimedia Dev Summit. I hope you are doing great.
> As I mentioned to you, my team works on extracting the knowledge from
> Wikipedia. Currently it's undergoing a project that expands language
> coverage. My teammate Yuan Gao(cc'ed here) is tech leader of this
> project.She plans to *monitor the list of all the current available
> wikipedia's sites and the number of articles for each language*, so that
> we can compare with our extraction system's output to sanity-check if there
> is a massive breakage of the extraction logic, or if we need to add/remove
> languages in the event that a new wikipedia site is introduced to/remove
> from the wikipedia family.
> I think your team at Analytics at Wikimedia probably knows the best where
> we can find this data. Here are 4 places we already know, but doesn't seem
> to have the data.
> - https://en.wikipedia.org/wiki/List_of_Wikipedias. has the
> information we need, but the list is manually edited, not automatic
> - https://stats.wikimedia.org/EN/Sitemap.htm, has the full list, but
> the information seems pretty out of date(last updated almost a month ago)
> - StatsV2 UI: https://stats.wikimedia.org/v2/#/all-projects, I can't
> find the full list nor the number of articles
> - API https://wikimedia.org/api/rest_v1/ suggested by elukey on
> #wikimedia-analytics channel, it doesn't seem to have # of article
> Do you know what is a good place to find this information? Thank you!
> * • **Zainan Zhou(**周载南**) a.k.a. "Victor" * <http://who/zzn>
> * • *Software Engineer, Data Engine
> * •* Google Inc.
> * • *zzn(a)google.com <ecarmeli(a)google.com> - 650.336.5691
> * • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
> ---------- Forwarded message ----------
> From: Yuan Gao <gaoyuan(a)google.com>
> Date: Wed, Mar 28, 2018 at 4:15 PM
> Subject: Monitor the number of Wikipedia sites and the number of articles
> in each site
> To: Zainan Victor Zhou <zzn(a)google.com>
> Cc: Wenjie Song <wenjies(a)google.com>, WikiData <wikidata(a)google.com>
> Hi Victor,
> as we discussed in the meeting, I'd like to monitor:
> 1) the number of Wikipedia sites
> 2) the number of articles in each site
> Can you help us to contact with WMF to get a realtime or at least daily
> update of these numbers? What we can find now is
> https://en.wikipedia.org/wiki/List_of_Wikipedias, but the number of
> Wikipedia sites is manually updated, and possibly out-of-date.
> The monitor can help us catch such bugs.
> Yuan Gao
Page Previews is now fully deployed to all but 2 of the Wikipedias. In
deploying it, we've created a new way to interact with pages without
navigating to them. This impacts the overall and per-page pageviews metrics
that are used in myriad reports, e.g. to editors about the readership of
their articles and in monthly reports to the board. Consequently, we need
to be able to report a user reading the preview of a page just like we do
them navigating to it.
Readers Web are planning to instrument Page Previews such that when a
preview is available and open for longer than X ms, a "page interaction" is
recorded. We're aware of a couple of mechanisms for recording something
like this from the client:
1. All files viewed with the media viewer are recorded by the client
requesting the /beacon/media?duration=X&uri=Y URL at some point  – as
Nuria points out in that thread, requests to /beacon/... are already
filtered and a canned response is sent immediately by Varnish .
2. Requesting a URL with the X-Analytics header  set to "preview". In
this context, we'd make a HEAD request to the URL of the page with the
IMO #1 is preferable from the operations and performance perspectives as
the response is always served from the edge and includes very few headers,
whereas the request in #2 may be served by the application servers if the
user is logged in (or in the mobile site's beta cohort). However, the
requests in #2 are already
We're currently considering recording page interactions when previews are
open for longer than 1000 ms. We estimate that this would increase overall
web requests by 0.3% .
Are there other ways of recording this information? We're fairly confident
that #1 seems like the best choice here but it's referred to as the
"virtual file view hack". Is this really the case? Moreover, should we
request a distinct URL, e.g. /beacon/preview?duration=X&uri=Y, or should we
consolidate the URLs as both represent the same thing essentially?
IRC (Freenode): phuedx
*tl;dr stop using notebook1001 by Monday April 2nd, use notebook1003
*(If you don’t have production access, you can ignore this email.)*
As part of https://phabricator.wikimedia.org/T183145, we’ve ordered new
hardware to replace the aging notebook1001. The new servers are ready to
go, so we need to schedule a deprecation timeline for notebook1001. That
timeline is Monday April 2nd. After that, your work on notebook1001 will
not longer be accessible. Instead you should use notebook1003 (or
But there is good news too! Last week I rsynced everyone’s home
directories from notebook1001 over to notebook1003. I also upgraded the
default virtualenv your notebooks run from. Your notebook files should all
be accessible on notebook1003. However, the version of Python3 changed
from 3.4 to 3.5 during this upgrade. Dependencies that your notebook uses
that you installed on notebook1001 may not be available at first. You
might need to redo a pip install those dependencies into the new notebook
Python 3.5 virtualenv. (I can’t really give you explicit instructions to
do that, as I don’t know what you use for your notebooks.)
I’ll do a final rsync any newer files in home directories from notebook1001
on Monday April 2nd. If you’ve been working on notebook1001 since after
March 15th, this should get everything up to date on notebook1003 before
notebook1001 goes away. BUT! *Do not work on both notebook1001 and
notebook1003*! My final rsync will keep the most recently modified version
of files from either server.
OOooOo and there’s even more good news! I’ve made the notebooks able to
access system site packages, and installed a ton of useful packages
by default. pandas, scipy, requests, etc. If there’s something else you
think you might need, let us know. Or just pip install it into your
Additionally, pyhive has been installed too, so you should be able to more
easily access Hive directly from a python notebook.
I’ve updated docs at https://wikitech.wikimedia.org/wiki/SWAP#Usage, please
take a look.
If you have any questions, please don’t hesitate to ask, either here on or
- Andrew Otto & Analytics Engineering
Hey stat1005|6 users!
The underlying host currently providing all of your dumps and datasets
needs over NFS (at /mnt/data) is being replaced soon. All datasets will be
continue to be accessible on the stat boxes at the current path, but there
will be a transition time of a few hours. During that time, you may
encounter stale data or the files may simply be inaccessible. Please
schedule your work accordingly.
Dates: The migration is scheduled for April 2nd starting at 14:30 UTC, and
is expected to last a few hours.
Thanks! We'll send more updates closer to the migration date. If you have
any questions, just let us know.
Madhumitha Viswanathan & Ariel Glenn
General notification in case there are others consuming from
eventlogging_NavigationTiming: The performance team recently instituted
oversampling of data based on configurable criteria. This means that in
some cases, the data stream on this topic may not be representative of wiki
If you wish to parse NavigationTiming data in a representative way, you
should check the attribute 'is_oversample' in the event object, and filter
out the message if true. (In the event that a single page load is part of
the regular sample as well, two messages will be emitted with the same
data, but with different values for the is_oversample parameter.)
Please let me know if you have any questions.
The next Research Showcase will be live-streamed this Wednesday, March 21,
2018 at 11:30 AM (PDT) 18:30 UTC.
YouTube stream: https://www.youtube.com/watch?v=ACevHs0sMMw
As usual, you can join the conversation on IRC at #wikimedia-research. And,
you can watch our past research showcases here
Over the past years, the Research team at Wikimedia Foundation and some of
our formal collaborators have been focused on doing research and building
technologies that can help editors across Wikimedia languages find tasks
for contributions. While the early effort was heavily focused on article
recommendation for creation (horizontal expansion), in 2016 we started a
new direction of research with a focus on vertical expansion of Wikipedia
articles. The two talks in the March 2018 Research Showcase will share some
of what we have learned from this research. More specifically, we will talk
about Wikipedia category network as a great signal for creating
templates/structures for Wikipedia articles as well as ongoing research to
learn what content (sections) are missing from Wikipedia across its many
languages. The two corresponding abstracts with more details are below.
Join us! :)
Using Wikipedia categories for research: opportunities, challenges, and
solutionsBy *Tiziano Piccardi, EPFL*The category network in Wikipedia is
used by editors as a way to label articles and organize them in a
hierarchical structure. This manually created and curated network of 1.6
million nodes in English Wikipedia generated by arranging the categories in
a child-parent relation (i.e., Scientists-People, Cities-Human Settlement)
allows researchers to infer valuable relations between concepts. A clean
structure in this format would be a valuable resource for a variety of
tools and application including automatic reasoning tools. Unfortunately,
Wikipedia category network contains some "noise" since in many cases the
association as subcategory does not define an is-a relation (Scientists
is-a People vs. Billionaires is-a Wealth). Inspired to develop a model for
recommending sections to be added to the already existing Wikipedia
articles, we developed a method to clean this network and to keep only the
categories that have a high chance to be associated with their children by
an is-a relation. The strategy is based on the concept of "pure"
categories, and the algorithm uses the types of the attached articles to
determine how homogenous the category is. The approach does not rely on any
linguistic feature and therefore is suitable for all Wikipedia languages.
In this talk, we will discuss the high-level overview of the algorithm and
some of the possible applications for the generated network beyond article
Beyond Automatic Translation: Aligning Wikipedia sections across multiple
languagesBy *Diego Saez-Trumper*Sections are the building blocks of
Wikipedia articles. For editors, they can be used as an entry point for
creating and expanding articles. For readers, they enhance readability of
Wikipedia content. In this talk, we present an ongoing research to align
article sections across Wikipedia languages. We show how the available
technology for automatic translations are not good enough for translating
section titles. We then show a complementary approach for section
alignment, using Wikidata and cross-lingual word embeddings. We will
present some of the use-cases of a methodology for aligning sections across
languages, including improved section recommendation, especially in medium
to smaller size languages where the language itself may not contain enough
signal about the structure of the articles and signals can be inferred from
other larger Wikipedia languages.
Sarah R. Rodlund
Senior Project Coordinator-Product & Technology, Wikimedia Foundation
as part of a lecture on Information Retrieval I am giving we work a lot
with Simple Wikipedia articles. It's a great data set because it's
comprehensive and not domain specific so when building search on top of it
humans can easily judge result quality, and it's still small enough to be
handled by a regular computer.
This year I want to cover the topic of Machine Learning for search. The
idea is to look at result clicks from an internal search search engine,
feed that into the Machine Learning and adjust search accordingly so that
the top-clicked results actually rank best. We will be using Solr LTR for
I would love to base this on Simple Wikipedia data since it would fit well
into the rest of the lecture. Unfortunately, I could not find that data.
The closest I came is
https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream but this
covers neither Simple Wikipedia nor does it specify internal search queries.
Did I miss something? Is this data available somewhere? Can I produce it
myself from raw data? Ideally I would need (query-document) pairs with the
number of occurrences.
*Georg M. Sorst I CTO*
[image: FINDOLOGIC Logo]
Jakob-Haringer-Str. 5a | 5020
I T.: +43 662 456708 <+43%20662%20456708>
www.findologic.com Folgen Sie uns auf: XING
Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle
A6 Stand E130 in München*! Hier
Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! Hier
<beratung(a)findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin vereinbaren!
Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand G.17
in Zürich*! Hier <beratung(a)findologic.com?subject=SOM%20Z%C3%BCrich> Termin
Hier <http://www.findologic.com> geht es zu unserer *Homepage*!