Happy Friday the 13th,* everyone!
I realized that I've gotten more lax about defining terms in my write ups
over time, because I am always talking about types (pre- and post-analysis)
and tokens and monolithic and unpacked analyzers, etc, etc. So, I
reorganized the Search Glossary a bit into topical sections, and added a
big section on Language Analysis, which I will point to in my write ups.
Please review the new section it if you have time and interest. If you know
this stuff, please correct any errors. If you don't know this stuff, please
ask questions about anything that's unclear so I can improve it. Thanks!
The new section of the glossary is under "Language Analysis
<https://www.mediawiki.org/wiki/Wikimedia_Discovery/Search/Glossary#Language…>
".
Cheers,
—Trey
*Word of the day: friggatriskaidekaphobia
<https://en.wiktionary.org/wiki/friggatriskaidekaphobia>.
Trey Jones
Sr. Software Engineer, Search Platform
Wikimedia Foundation
Hi everyone,
I've got an update on the NLP project selection. We've narrowed things down
to a handful of projects we could work on with a consultant, and a handful
we could work on internally.
David, Erik, and I reviewed a selection of the most promising-seeming
and/or most interesting projects and gave them a very rough cost estimate
based on how big of a relative impact they would have, technologically how
hard they would be, and how difficult the UI aspect would be. The scores
are not definitive, but helped guide the discussion. You can see the list
of projects we looked at and more details of the scoring on MediaWiki
<https://www.mediawiki.org/w/index.php?title=User:TJones_(WMF)/Notes/Potenti…>
.
For the possibility of working with an outside consultant, we also
considered how easily separated each project would be from our overall
system (making it easier for someone new to get up to speed), how projects
feed into each other, how easily we could work on projects ourselves (like,
we know pretty much what to do, we just have to do it), etc.
Our current *recommendation for an outside consultant* would be to start
with (1) *spelling correction/did you mean improvements,* with an option to
extend the project to include either (2) *"more like" suggestion
improvements,* or (3) *query reformulation mining,* specifically for typo
corrections.
For spelling correction (#1), we are envisioning an approach that
integrates generic intra-word and inter-word statistical models, optional
language-specific features, and explicit weighted corrections. We believe
we could mine redirects flagged as typo correction for explicit
corrections, and the query reformulation mining (#3) would also provide
frequency-weighted explicit corrections. Our hope is that a system built
initially for English would be readily applicable to other alphabetic
languages, most probably other Indo-European languages, based on statistics
available from Elastic; and that some elements of the system could be
applied to other non-alphabetic languages and languages that are
typologically <https://en.wikipedia.org/wiki/Morphological_typology> dissimilar
to Indo-European languages.
Looking at the rest of the list, (a) *wrong keyboard detection* seems like
something we should work on internally, since we already have a few good
ideas on how to approach it. (b) *Acronym support* is a pet peeve for
several members of the team, and seems to be straightforward to improve. (c)
*Automatic stemmer building* and (d) *automatic stop word* generation
aren't so much projects we should work on as things we should research to
see if there are already tools or lists out there we could use to make the
projects much easier.
Comments and questions here or on the talk page are welcome.
Cheers,
—Trey
Trey Jones
Sr. Software Engineer, Search Platform
Wikimedia Foundation
On Tue, May 15, 2018 at 11:30 AM, Trey Jones <tjones(a)wikimedia.org> wrote:
> Hi everyone,
>
> I just finished putting together an annotated list of potential
> applications of natural language processing to on-wiki search
> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Potential_Applicatio…>.
> There are dozens and dozens of ideas there—including many that are
> interesting but probably not practical. If you have any additional ideas,
> questions, suggestions, recommendations, or preferences, please
> share!—either on the mailing list or on the talk page.
>
> The goal is to narrow it down to one or two things to pursue over the next
> two to four quarters, along with other projects we are working on.
>
> Thanks!
> —Trey
>
> Trey Jones
> Sr. Software Engineer, Search Platform
> Wikimedia Foundation
>
>
Hello again,
This is the weekly update from the Search Platform team for the weeks
starting 2018-07-02 and 2018-07-09.
Programming Note: With the Wikimania Hackathon, Wikimania proper, and
resulting travel for folks in coming weeks, the next update will be
for the week starting 2018-07-30.
As always, feedback and questions welcome.
== Discussions ==
=== Search ===
* David and Stas worked on fine tuning of search configs to
mediawiki-config for Wikidata [0]
* Stas and Addshore helped to catch and clean up some bad lookups and
report them properly [1]
* A "Wrong document type" error was corrected by Erik by fixing
Sanitizer MetaStore integration [2]
* Erik worked on tracking queries that run on the Elastic Search
clusters longer than both server side and client side timeouts by
fixing some slow logging functionality [3]
* There was a Meta-wiki error where search suggests non-existent title
due to namespace/redirect mixup. Erik's note: "it's a bit awkward, but
typing Help:Glo into autocomplete on metawiki suggests 'Global
Account' from the main namespace, and selecting it takes you to
Help:Unified login" [4]
* In order to dispatch queries to a particular search setup (cirrus
defaults vs wikibase custom query builder), David created a flexible
way to classify queries, meant to replace the 'getSyntaxUsed' approach
currently in SearchContext. [5]
* Trey and a community volunteer, Athena, created a basic Mirandese
analysis chain. It was tested on RelForge and pushed into production
this week [6]. Trey kicked off, completed and tested the re-indexing
of the Mirandese Wikis [7].
* The Re-Re-Index of the Serbian Wikis after refactored plugins were
deployed has been completed [8] and the re-index of the Croatian,
Serbo-Croatian, and Bosnian Wikis was also done [9]
* We currently mix a tiny number of namespace documents into the
regular indices, which seems inefficient; so Erik built a unified
namespace index [10]
* Erik updated the 'OtherIndex' to operate on a cluster other than the
one holding the wiki [11]
* Trey updated a variety of things on the Analysis Tools with lots
little fixes and improvements as well as a few small errors in the
analysis code that conflated post-analysis types and pre-analysis
types [12]
== Did you know? ==
* The period of this status update includes Friday, July 13, 2018. The
fear of the number thirteen is called "triskaidekaphobia" [13]. There
are two words for fear of Friday the 13th: "paraskavedekatriaphobia"
[14] and "friggatriskaidekaphobia" [15]—the first maintains a
consistent etymology with the Greek word for Friday, "Paraskeví",
while the second invokes "Frigg", the Norse Goddess after whom Friday
is named in English.
[0] https://phabricator.wikimedia.org/T182717
[1] https://phabricator.wikimedia.org/T198091
[2] https://phabricator.wikimedia.org/T197446
[3] https://phabricator.wikimedia.org/T196180
[4] https://phabricator.wikimedia.org/T115756
[5] https://phabricator.wikimedia.org/T197774
[6] https://phabricator.wikimedia.org/T194941
[7] https://phabricator.wikimedia.org/T197890
[8] https://phabricator.wikimedia.org/T196404
[9] https://phabricator.wikimedia.org/T196658
[10] https://phabricator.wikimedia.org/T192699
[11] https://phabricator.wikimedia.org/T194678
[12] https://phabricator.wikimedia.org/T199273
[13] https://en.wiktionary.org/wiki/triskaidekaphobia
[14] https://en.wiktionary.org/wiki/paraskavedekatriaphobia
[15] https://en.wiktionary.org/wiki/friggatriskaidekaphobia
---
Subscribe to receive on-wiki (or opt-in email) notifications of the
Discovery weekly update.
https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly
The archive of all past updates can be found on MediaWiki.org:
https://www.mediawiki.org/wiki/Discovery/Status_updates
Interested in getting involved? See tasks marked as "Easy" or
"Volunteer needed" in Phabricator.
[1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R
[2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R
Yours,
Chris Koerner
Community Relations Specialist
Wikimedia Foundation
Cross-posting...!
--
deb tankersley
Program Manager, Engineering
Wikimedia Foundation
---------- Forwarded message ---------
From: Yaron Koren <yaron(a)wikiworks.com>
Date: Tue, Jul 10, 2018 at 11:01 AM
Subject: [MediaWiki-l] New episode of "Between the Brackets": Stas Malyshev
To: MediaWiki announcements and site admin list <
mediawiki-l(a)lists.wikimedia.org>
Hi,
A new episode of the MediaWiki podcast "Between the Brackets" has been
released, featuring an interview with Wikimedia Foundation developer Stas
Malyshev, who works on search in both MediaWiki and Wikidata. You can
listen to the interview here:
http://betweenthebrackets.libsyn.com/episode-12-stas-malyshev
-Yaron
_______________________________________________
MediaWiki-l mailing list
To unsubscribe, go to:
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Hello!
As you might already know, Wikdiata Query Service has been misbehaving
in the last 24 hours. Our public SPARQL endpoint [1] was slow and
throwing timeouts. Sadly, exposing a public SPARQL endpoint is a hard
problem and we don't have a final solution to this. Still we have some
improvements. Have a look at the incident report [2] if you want
details.
I also started to write a runbook for WDQS [3]. This should be
interesting mostly to our SRE team, but feel free to also have a look
and suggest improvements / clarifications.
Note that our internal WDQS endpoint was stable during that time (as expected).
Thanks for your help and your patience!
Guillaume
[1] https://query.wikidata.org/
[2] https://wikitech.wikimedia.org/wiki/Incident_documentation/20180625-wdqs
[3] https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
--
Guillaume Lederrey
Operations Engineer, Search Platform
Wikimedia Foundation
UTC+2 / CEST
I thought that this video, published in May 2018, was somewhat interesting
and I am sharing it in case others are also interested. The presenter uses
a change of design of Wikipedia's front page search box from 2010 (see
https://blog.wikimedia.org/2010/06/15/usability-why-did-we-move-the-search-…)
as an example, though I would hope that the lesson from this video isn't
that it's okay to frequently disrupt the workflows of existing users with
design changes regardless of the amount of complaints from existing users.
The main points that I drew from this presentation are that interfaces
should be intuitive and should have relatively light cognitive load. Those
points may sound obvious to experienced UX designers, but may be of
interest to people whose areas of expertise are in other domains.
I also appreciated that the presenter shared an example of a situation in
which people said one thing in surveys but behaved in the opposite way in
practice.
Here is the link to the video: https://www.youtube.com/watch?v=mxzK4sWfvH8
Regards,
Pine
( https://meta.wikimedia.org/wiki/User:Pine )