Hey everyone,
As part of T195491 <https://phabricator.wikimedia.org/T195491>, Erik has
been looking into the details of our regex processing and ways to handle
ridiculously long-running regex queries. He pulled all the regex queries
over the last 90 days to get a sense of what features people are using and
what impact certain changes he was considering would have on users. Turns
out there are a lot more users than I would have thought—which is good
news! And a lot of them look like bots.
He also made the mistake of pointing me to the data and highlighting a
common pattern—searches for interwiki links. I couldn't help myself—I
started digging around found that the majority of the searches are looking
for those interwiki links, and the vast majority of regex searches fall
into three types—interwiki links, URLs, and Library of Congress collection
IDs.
Overall, there are 5,613,506 regexes total across all projects and all
languages, over a 90-day period. That comes out to ~62K/day—which is a lot
more than I'd expected, though I hadn't thought about bots using regexes.
Read more on MediaWiki
<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Regular_Ex…>
.
—Trey
Trey Jones
Sr. Software Engineer, Search Platform
Wikimedia Foundation
At the Barcelona Hackathon, one of my projects was to carry around a sign
that said, “Tell Me Why Your Search Sucks!” in about 20 languages. A number
of people shared their thoughts, which I've summarized on Phab ticket
T189791 <https://phabricator.wikimedia.org/T189791#4226596>.
—Trey
Trey Jones
Sr. Software Engineer, Search Platform
Wikimedia Foundation
*Hello,We recently announced the new Wikimedia Technical Conference
(TechConf) during the closing session of the Barcelona Hackathon on May 20,
2018. We are sending this email to give an update on the planning and
organization, and to also let everyone know how the nomination process will
work for those interested in attending. The Wikimedia Technical Conference
will take place in Portland, OR, USA on October 22-25, 2018. And, as
mentioned in previous emails [1][2] and on the wiki page [3], this
conference will be focused on the cross-departmental program called
Platform Evolution. We will be providing more information and context as we
go along in the process.For this conference, we are looking for diverse
stakeholders, perspectives, and experiences that will help us to make
informed decisions for the future evolution of the platform. We need people
who can create and architect solutions, as well as those who actually make
decisions on funding and prioritization for the projects.Later this week,
we will send out a form to provide more detailed information on the
nomination process and how to nominate people (or it can be yourself) to
attend this conference, along with the skills, experiences, and/or
backgrounds that we are looking for. Due to the time needed for visa
application and other constraints, the deadline for nominations will be
June 8th. Please make sure that you don’t miss the deadline!If you have any
questions, please post them on the talk page [4][1]
https://lists.wikimedia.org/pipermail/mediawiki-l/2018-April/047367.html
<https://lists.wikimedia.org/pipermail/mediawiki-l/2018-April/047367.html>
[2] https://lists.wikimedia.org/pipermail/wikitech-l/2018-April/089738.html
<https://lists.wikimedia.org/pipermail/wikitech-l/2018-April/089738.html>
[3] https://mediawiki.org/wiki/Wikimedia_Technical_Conference/2018
<https://mediawiki.org/wiki/Wikimedia_Technical_Conference/2018> [4]
https://www.mediawiki.org/wiki/Talk:Wikimedia_Technical_Conference/2018
<https://www.mediawiki.org/wiki/Talk:Wikimedia_Technical_Conference/2018> *
*Cheers from the Program Committee:*
*Kate, Corey, Joaquin, Greg, Birgit and TheDJ*
--
deb tankersley
Program Manager, Engineering
Wikimedia Foundation
Hi everyone,
I just finished putting together an annotated list of potential
applications of natural language processing to on-wiki search
<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Potential_Applicatio…>.
There are dozens and dozens of ideas there—including many that are
interesting but probably not practical. If you have any additional ideas,
questions, suggestions, recommendations, or preferences, please
share!—either on the mailing list or on the talk page.
The goal is to narrow it down to one or two things to pursue over the next
two to four quarters, along with other projects we are working on.
Thanks!
—Trey
Trey Jones
Sr. Software Engineer, Search Platform
Wikimedia Foundation
Привет!
Another update from the Search Platform team for the week starting 2018-05-07
**Programming note:** Due to the upcoming Wikimedia Hackathon and some
(personal) holiday time, the next update will be the week of
2018-05-28. Until then, and as always, feedback and questions are
welcome.
== Highlights==
* Map internationalization launched everywhere, and embedded maps
(mapframe) are now live on 276 Wikipedias [0]
* ''"Hello, my name is _____"'' is an in-depth blog post by Trey that
was published earlier this week where he details the irony that
searching for names is not always as straightforward as you might
think. [1]
== Discussions ==
=== Search ===
* Erik updated a script that was populating lots of 500 errors in the logs [2]
* Erik also did a lot of research to evaluate impact of adding ~2700
new shards to production cluster (there is a pdf attached to the last
comment in the ticket that contains more information) [3] There is a
follow-up ticket as well for the next steps [4]
* Trey worked on the analysis config for the new Slovak stemmer that
was deployed this week—but the plugin still needs to be deployed and
the wikis re-indexed. [5]
* Stas and others worked on looking up entities by external
identifiers - the work is done for now, but it needs a re-index to be
fully ready [6]
* David worked on externalizing the parsing logic from
SimpleKeywordFeature and FullTextQueryStringQueryBuilder and it was
pushed into production in April 2018 [7]
== Other Noteworthy Stuff ==
* Trey's most recent updates to transliteration on the Crimean Tatar
Wikipedia are live; after a year of part-time 10% project work, the
transliteration infrastructure for Crimean Tatar is done and the
accuracy is in the high 90% range. [8]
== Did you know? ==
* The English word “dove”, as the past tense of “dive”, is one of the
rare cases where a conjugation has become more irregular over time.
The verb “dive” picked up the strong conjugation [9] by analogy with
other strong verbs, particularly “drive/drove”. [10] Going in the more
typical direction of regularization, Swedish strong verbs slowly lost
some of their distinctive plural forms. [11] The change started in the
16th century, and was still in progress as late as the 1940s. From
the search perspective, regular forms are easier to deal with—so, way
to go Swedish!
[0] https://lists.wikimedia.org/pipermail/wikitech-l/2018-May/089964.html
[1] https://blog.wikimedia.org/2018/05/08/searching-for-names-is-not-always-str…
[2] https://phabricator.wikimedia.org/T179266
[3] https://phabricator.wikimedia.org/T192972
[4] https://phabricator.wikimedia.org/T193654
[5] https://phabricator.wikimedia.org/T191544
[6] https://phabricator.wikimedia.org/T99899
[7] https://phabricator.wikimedia.org/T188530
[8] https://phabricator.wikimedia.org/T188321
[9] https://en.wikipedia.org/wiki/Germanic_strong_verb
[10] https://en.wiktionary.org/wiki/dove#Etymology_2
[11] https://en.wikipedia.org/wiki/Swedish_grammar#Historical_plural_forms
---
Subscribe to receive on-wiki (or opt-in email) notifications of the
Discovery weekly update.
https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly
The archive of all past updates can be found on MediaWiki.org:
https://www.mediawiki.org/wiki/Discovery/Status_updates
Interested in getting involved? See tasks marked as "Easy" or
"Volunteer needed" in Phabricator.
[1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R
[2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R
Yours,
Chris Koerner
Community Liaison
Wikimedia Foundation
Hi all,
sorry for the mixup, it somehow fell through our cracks that deepcategory
is not yet working on all wikis. For wikis that are not indexed yet, for
now deepcategory seems to be returning all results of the category whose
name was entered.
>Now, since we are now indexing categories only for select wikis - the
>list is here:
> https://noc.wikimedia.org/conf/dblists/categories-rdf.dblist
<https://noc.wikimedia.org/conf/dblists/categories-rdf.dblist> - we may
>consider adding more wikis to it. E.g. see:
>https://phabricator.wikimedia.org/T194139
<https://phabricator.wikimedia.org/T194139>
>So which wikis need to be added?
Deep category should work on all wikis, so can we allow deep category on
all wikis? Stas, I saw that you already added all wikis with > 1000
categories for indexing. Do you know when this will start working? And do
you know when we can have this on all wikis?
For the time being, we will add a note to the information of the "Pages in
this category" info text, so people don't get confused.
Best,
Lea
--
Lea Voget
Product Manager Technical Wishlist
Produktmanager Technische Wunschliste
Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24 | 10963 Berlin
Tel. (030) 219 158 26-0
http://wikimedia.de
Stellen Sie sich eine Welt vor, in der jeder Mensch an der Menge allen
Wissens frei teilhaben kann. Helfen Sie uns dabei!
http://spenden.wikimedia.de/
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/029/42207.
(cross-post)
Dear all,
we are really happy to announce that the new AdvancedSearch interface got
deployed as a beta feature to all wikis just now. [1]
The search has great options to perform advanced queries e.g. by using
keywords like "hastemplate" or "intitle", but often even experienced
editors don't know about it - this is what we found out in a workshop
series on advanced searches in 2016 and this is why we have built the
AdvancedSearch extension. [2]
AdvancedSearch enhances Special:Search through an advanced parameters form.
It serves as an interface for some of the search options that the Wikimedia
Foundation's search team has been implemented over the past years. The way
the interface works, users don't have to know the syntax behind each search
field, but they can learn about it if they want to.
*From small beta to full beta*
The feature already has been a beta feature on deWP, arWP, huWP, faWP and
mediawiki.org for more than 5 months. During this "small beta phase" (=
base version with a set of features, deployed to a few wikis, both ltr and
rtl wikis) support for more search options got added (searches in
categories and sub categories, searches for content in a specific language
in wikis that have the translate extension enabled, searches for subpages
of a page), the way how to select and configure namespaces got improved and
several bugs were fixed.
Everyone is invited to test the now full beta feature!
If you want to give us feedback or if you find a bug, please use the main
feedback page (or file a ticket in phabricator):
https://www.mediawiki.org/wiki/Help_talk:Extension:AdvancedSearch
If you want to learn more about the project, the functional scope of the
AdvancedSearch extension and the usage, please see
* the help page: https://www.mediawiki.org/wiki/Help:AdvancedSearch
*the main project page: https://meta.wikimedia.org/
wiki/WMDE_Technical_Wishes/AdvancedSearch
* the list of supported search options:
https://meta.wikimedia.org/wiki/WMDE_Technical_Wishes/AdvancedSearch/Functi…
*Thanks, thanks, thanks :-)*A huge thanks to everyone who has tested the
feature and gave feedback over the last 5 months and to everyone who has
translated software messages and announcements - this is much appreciated!
And a huge thanks to the WMF's search team who has done all the backend
work and has built great options for advanced search queries that now can
be accessed through the AdvancedSearch interface. It was and is great to
work with you :-)
Looking forward to more testing and feedback to further improve the feature,
Thanks a lot,
Birgit
(for WMDE's Technical Wishes team)
[1] https://phabricator.wikimedia.org/T193182
<https://phabricator.wikimedia.org/T180147>(Deployment ticket)
[2] https://meta.wikimedia.org/wiki/WMDE_Technical_Wishes/AdvancedSearch
/Workshop
<https://meta.wikimedia.org/wiki/WMDE_Technical_Wishes/AdvancedSearch>
--
Birgit Müller
Community Communications Manager
Software Development and Engineering
Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin
Tel. (030) 219 158 26-0
http://wikimedia.de
Stellen Sie sich eine Welt vor, in der jeder Mensch an der Menge allen
Wissens frei teilhaben kann. Helfen Sie uns dabei!
http://spenden.wikimedia.de/
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/681/51985.
With the hackathon coming up I thought we could ponder what could be done
while there. I've been constructing a list of horrible ideas over the last
couple weeks:
Web UI for cirrus debug/devel features:
- Settings dump
- Mappings dump
- Copy version of settings+mappings suitable to create index with curl
- cirrusDumpQuery
- cirrusDumpResult
- cirrusExplain
- cirrusUserTesting
Top level idea is to make it easy to access all of these things. Could be
a userscript run on-page in the wiki. Could be an SPA run from tool labs
(or even people.wikimedia.org).
============
docker setup to initialize elasticsearch, import latest cirrus dump, and
attach a kibana instance for UI. Probably with a modified mapping more
amicable to kibana inspection.
============
Some script to manage elasticsearch allocation manually via api? Pointless,
but
perhaps fun.
===========
phabricator formatted export for jupyter
- problem: images?
-- Seems would need to upload separately and then reference them in final
output
-- There is an api for this, but then we can't just emit something to paste
into a field
the whole export needs to happen over api then.
- better, but worse: data-uri's would be great. But i dunno if phab is
built for megabyte sized posts. They also
don't support data-uri's. Browsers also hate when you copy/paste
excessive amounts of data.
==========
Custom implementation to find similar images in commons:
-
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.673.5151&rep=rep1&…
-
http://www.deepideas.net/building-content-based-search-engine-quantifying-s…
- Convert image into a feature vector
- Use clustering to generate an image signature
- Find k-nearest-neighbors via Earth Mover Distance (EMD), can utilize
pyemd library.
- It's very not-obvious how the signature + weight gets plugged into pyemd
- EMD is expensive, no clue how this would scale to millions of images
- This would probably perform poorly, more interesting to get to understand
some of the history of similar image retrieval
=========
https://github.com/beniz/deepdetect.git ?
- Use pre-trained ML to detect objects in images and then label those
objects.
- Can compare similarity of objects detected for similar images. Can
probably
extend with color information
- Do we actually have a use case for images similar to other images?
Perhaps on upload?
==========
Elasticsearch cluster balance simulator
- Allow to Simulate valuate how the cluster balancing performs under
various simulated conditions
- no way this could be done in a weekend hackathon. It would probably be
completely wrong as well and simulate some idealized cluster that doesn't
act
like ours.
==========
Prototype Lire plugin for elasticsearch
- Lire = Lucene Image REtrieval
- I know nothing about it, other than it exists
- Plugin already exists plugging it into solr, so how hard could it be?
- Maybe try it out standalone with some small test set to see what it does