Hi all,
For all Hive users using stat1002/1004, you might have seen a deprecation
warning when you launch the hive client - that claims it's being replaced
with Beeline. The Beeline shell has always been available to use, but it
required supplying a database connection string every time, which was
pretty annoying. We now have a wrapper
<https://github.com/wikimedia/operations-puppet/blob/production/modules/role…>
script
setup to make this easier. The old Hive CLI will continue to exist, but we
encourage moving over to Beeline. You can use it by logging into the
stat1002/1004 boxes as usual, and launching `beeline`.
There is some documentation on this here:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline.
If you run into any issues using this interface, please ping us on the
Analytics list or #wikimedia-analytics or file a bug on Phabricator
<http://phabricator.wikimedia.org/tag/analytics>.
(If you are wondering stat1004 whaaat - there should be an announcement
coming up about it soon!)
Best,
--Madhu :)
Hello!
The Analytics team would like to announce that we have migrated the
reportcard to a new domain:
https://analytics.wikimedia.org/dashboards/reportcard/#pageviews-july-2015-…
The migrated reportcard includes both legacy and current pageview data,
daily unique devices and new editors data. Pageview and devices data is
updated daily but editor data is still updated ad-hoc.
The team is working at this time on revamping the way we compute edit data
and we hope to be able to provide monthly updates for the main edit metrics
this quarter. Some of those will be visible in the reportcard but the new
wikistats will have more detailed reports.
You can follow the new wikistats project here:
https://phabricator.wikimedia.org/T130256
Thanks,
Nuria
We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
Hi,
trying to improve the mess of our docs for developers on mediawiki.org,
I've been wondering if anyone's aware of any visualization tool that
draws a graph showing which wiki pages are linked from which other wiki
pages (up to a certain depth), ignores pages which include {{Outdated}}
or {{Historical}} templates, ignores pages in certain namespaces like
"Talk:" or "User:", and ignores pages which are just translations (like
"PageName/qqx").
Or at least some of all this. :)
Thanks in advance for any ideas!
andre
--
Andre Klapper | Wikimedia Bugwrangler
http://blogs.gnome.org/aklapper/
Hi,
I've been using the very helpful pagecount dumps described at:
https://dumps.wikimedia.org/other/pagecounts-ez/
And it describes:
Line format:
wiki code (subproject.project)
article title
monthly total (with interpolation when data is missing)
hourly counts
In the wiki code field, the subproject is the language code (fr, el,
ja, etc) or meta, commons etc.
The project is one of b (wikibooks), k (wiktionary), n (wikinews), o
(wikivoyage), q (wikiquote), s (wikisource), v (wikiversity), z (wikipedia).
However, I've been coming across a large number of wiki codes "en.m". The
"m" code is undocumented. It appears to be the mobile version of Wikipedia,
but can anyone confirm that? Should the page be updated with this
information?
Thanks,
Michael
Hi everybody,
the Analytics team needs to make some changes to the current configuration
and deployment of the Analytics databases. Before starting a little refresh
to be on the same page:
- db1046 - eventlogging master database
- db1047 - also known as analytics-slave.eqiad.wmnet - replicates via mysql
s1/s2 and the log database (on db1046) using a custom replication script.
- dbstore1002 - also known as analytics-store.eqiad.wmnet and
x1-analytics-slave.eqiad.wmnet - replicates most of the S shards and X1 via
mysql, and the log database using a custom replication script.
- db1108 (brand new host) - replicates the log database using a custom
replication script.
We have been suffering during the past months some space and performance
issues on dbstore1002 (https://phabricator.wikimedia.org/T168303), so we
came up with the following plan:
- db1108, a brand new host with SSD disks, replaces db1047 and becomes the
CNAME of analytics-slave.eqiad.wmnet. This new host will be a replica of
the log database only, no other database will be replicated.
- dbstore1002 will loose the support of the log database, that will be
dropped from the host.
- db1047 will eventually be decommissioned (after backing up data and alert
people beforehand - T156844).
This will allow us to:
1) Reduce the load on dbstore1002 and free a lot of space on the host.
2) Offer a more performant way to query eventlogging analytics data.
3) Reduce the current performance issues that we have been experiencing
while trying to sanitize/purge old event-logging data (https://phabricator.
wikimedia.org/T156933)
The plan is the following:
- November 13th: the analytics-slave CNAME moves from db1047 to db1108
- November 20th: the log database will be dropped from
dbstore1002/analytics-store together with the event-logging replication
script
- December 4th: shutdown of db1047 (prior backup of non-log database tables)
More info in https://phabricator.wikimedia.org/T156844
To summarize what will change from the users perspective:
- dbstore1002 (analytics-store) will offer all the S/X shards replication
(wikis) and all the databases like staging that everybody is used to work
with. It will only loose the support of the log database.
- db1108 will offer the log database replication and a staging database.
- the db1047's (analytics-slave) staging database will be moved or copied
with a different name (like staging_db1047) to dbstore1002.
Please let us know in the task your opinion in T156844, we'd love to hear
some feedback before proceeding, especially about extra requirements that
we haven't thought of.
Thanks!
Luca (on behalf of the Analytics team)
Hi everybody,
the Analytics team needs to do the following maintenance operations:
1) migrate the Event-Logging master db ('log', currently on db1046) to the
new host db1107 (T156844). This should happen on *Wed Nov 15th (EU morning)*,
and it should be transparent to all the Event Logging users. The only
drawback that might be observed is a delay in getting the latest records on
the analytics db replicas (db1108, db1047, dbstore1002).
2) Reboot thorium and all the stat boxes for Linux kernel updates.
- Thorium hosts all the analytics websites like pivot.wikimedia.org,
yarn.wikimedia.org, analytics.wikimedia.org, etc.. and will be rebooted on *Wed
Nov 15th (EU morning)*, the websites downtime should be minimal (range of
minutes).
- stat boxes (stat1004, stat1005, stat1006) are usually running a lot of
screen/tmux sessions with various data crunching activities, so I'll try to
follow up with all the users currently running something on them to verify
if I can proceed or not. I'd tentatively schedule the reboots on *Thu Nov
16h (EU morning)*, but please follow up with me asap if this needs to be
postponed.
Thanks in advance and sorry for the trouble!
Luca (on behalf of the Analytics team)
Google Code-in is an annual contest for 13-17 year old students. It
will take place from Nov28 to Jan17 and is not only about coding tasks.
While we wait whether Wikimedia will get accepted:
* You have small, self-contained bugs you'd like to see fixed?
* Your documentation needs specific improvements?
* Your user interface has small design issues?
* Your Outreachy/Summer of Code project welcomes small tweaks?
* You'd enjoy helping someone port your template to Lua?
* Your gadget code uses some deprecated API calls?
* You have tasks in mind that welcome some research?
Also note that "Beginner tasks" (e.g. "Set up Vagrant" etc) and
"generic" tasks are very welcome (e.g. "Choose & fix 2 PHP7 issues
from the list in https://phabricator.wikimedia.org/T120336 ").
Because we will need hundreds of tasks. :)
And we also have more than 400 unassigned open 'easy' tasks listed:
https://phabricator.wikimedia.org/maniphest/query/HCyOonSbFn.z/#R
Would you be willing to mentor some of those in your area?
Please take a moment to find / update [Phabricator etc.] tasks in your
project(s) which would take an experienced contributor 2-3 hours. Check
https://www.mediawiki.org/wiki/Google_Code-in/Mentors
and please ask if you have any questions!
For some achievements from last round, see
https://blog.wikimedia.org/2017/02/03/google-code-in/
Thanks!,
andre
--
Andre Klapper | Wikimedia Bugwrangler
http://blogs.gnome.org/aklapper/
Hi,
We are three graduate students at UC Berkeley, and we are currently working on a machine learning project for a class that we’re taking.
We’re using the page views data that we believe you maintain: https://dumps.wikimedia.org/other/pagecounts-raw/ <https://dumps.wikimedia.org/other/pagecounts-raw/>
We have two quick questions that we were hoping you could answer:
1) We found views with a size of -1 or 0. Does this mean the page doesn’t exist?
2) We found some articles have `size` that widely varies throughout the hourly snapshots of a day. Is that legitimate, or is there something odd with the data?
Thanks,
Ugur
Hi Everyone,
The next Research Showcase will be live-streamed this Wednesday, November
15, 2017 at 11:30 AM (PST) 18:30 UTC.
YouTube stream: https://www.youtube.com/watch?v=nMENRAkeHnQ
As usual, you can join the conversation on IRC at #wikimedia-research. And,
you can watch our past research showcases here
<https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#November_2017>.
This month's presentation:
Conversation Corpora, Emotional Robots, and Battles with BiasBy *Lucas
Dixon (Google/Jigsaw)*I'll talk about interesting experimental setups for
doing large-scale analysis of conversations in Wikipedia, and what it even
means to grapple with the concept of conversation when one is talking about
revisions on talk pages. I'll also describe challenges with having good
conversations at scale, some of the dreams one might have for AI in the
space, and I'll dig into measuring unintended bias in machine learning and
what one can do to make ML more inclusive. This talk will cover work from
the WikiDetox <https://meta.wikimedia.org/wiki/Research:Detox> project as
well as ongoing research on the nature and impact of harassment in
Wikipedia discussion spaces
<https://meta.wikimedia.org/wiki/Research:Study_of_harassment_and_its_impact> –
part of a collaboration between Jigsaw, Cornell University, and the
Wikimedia Foundation. The ML model training code, datasets, and the
supporting tooling developed as part of this project are openly available.
Many kind regards,
Sarah R. Rodlund
Senior Project Coordinator-Product & Technology, Wikimedia Foundation
srodlund(a)wikimedia.org