I wrote this query to find all page moves done in fawiki in 2017, and
determine how many edits the performing user had prior to that page move.
The query tries to use indexes, as much as I could think of, and yet it
runs for a very long time (more than 20 min, at which point it gets killed).
Any ideas on how to further optimize this query is appreciated!
case when ug_group = 'bot' then 1 else 0 end as user_is_bot,
rev_user = log_user
and rev_timestamp < log_timestamp
) as rev_count_before_move
on page_id = log_page
left join user_groups
on log_user = ug_user
and ug_group = 'bot'
where log_action = 'move'
and log_timestamp > '20170101000000'
I feel like Quarry is slower than before (before being last week or last
month). Queries almost always get queued, and once executed, simple queries
take longer to result.
I have no idea how to investigate this though. Any thoughts?
I've been hacking on a new tool and I thought I'd share what (little) I
have so far to get some comments and learn of related approaches from the
The basic idea would be to have a browser extension that tells the user if
the current page they're viewing looks like a good reference for a
Wikipedia article, for some whitelisted domains like news websites. This
would hopefully prompt casual/opportunistic edits, especially for articles
that may be overlooked normally.
As a proof of concept for a backend, I built a simple bag-of-words model of
the TextExtracts of enwiki's
Category:All_articles_needing_additional_references. I then set up a tool
 to receive HTML input and retrieve the 5 most similar articles to that
input. You can try it out in your browser , or on the command line .
The results could definitely be better, but having tried it on a few
different articles over the past few days, I think there's some potential
I'd be interested in hearing your thoughts on this. Specifically:
* If such a backend/API were available, would you be interested in using it
for other tools? If so, what functionality would you expect from it?
* I'm thinking of just throwing away the above proof of concept and using
ElasticSearch, though I don't know a lot about it. Is anyone aware of a
similar dataset that already exists there, by any chance? Or any reasons
not to go that way?
* Any other comments on the overall idea or implementation?
3- Example: curl
| curl -X POST http://tools.wmflabs.org/similarity/search --form "text=<-"
Guilherme P. Gonçalves
In the new database setup user databases are no longer possible on the
same servers as where the production databases are. I noticed on
https://phabricator.wikimedia.org/T142807 Daniel saying "Death blow for
GHEL coordinate extraction and WikiMiniAtlas." and on
https://phabricator.wikimedia.org/T183066 several tools broke down.
Do we have an overview of tools that are now broken? Did the database
admins actually contact the tool maintainers about the loss of
functionality or was this just send to the -announce list?
One of tools.dplbot's daily tasks has been having repeated problems
since yesterday. A script that ran without errors and completed in about
10 minutes on Friday ran for over 90 minutes on Saturday, and died with
a "MySQL server has gone away" error. There were no edits to the script
in between Friday and Saturday, so I have to assume that something
changed on the server side.
The script reads from enwiki.analytics.db.svc.eqiad.wmflabs, and both
reads from and writes to tools.labsdb. All of the errors occurred on
writes to the user database. I was able to work around the errors by
dropping the database connection and opening a new one immediately
before writing (I have no idea why this works, since the timeout setting
on the database for inactive connections is 8 hours, and this script was
not even running for two hours; but it did work). However, the script
continues to run for an order of magnitude longer than it did on Friday
(~100 minutes vs. ~10 minutes). Is anyone else experiencing similar
Given that Trusty was released almost 4 years ago, is there any plans for
getting a newer platform for grid users? This is partially in relation to
T183090, there are some areas where the k8s just fail. What prospects are
there for moving to a newer grid exec nodes? I would start to expect that
we will be seeing more and more cases of software incompatibility or
security issues arise as time passes, and that given the glacial speed at
which such a move would take I am surprised we have not seen the first
stages of a migration already in progress.