Hello,
This is the weekly update from the Search Platform team for the week
starting 2019-01-14.
As always, feedback and questions welcome.
== Discussions ==
=== Search ===
* Trey updated TextCat with models for detecting Russian typed on an
English keyboard and vice-versa, and UTF-8 Russian text improperly
encoded as Windows-1251, [0] as a precursor to providing
wrong-keyboard/encoding detection and suggestion. [1]
* Erik and the team did a lot of work on an epic ticket (with several
sub tasks) to explore and figure out next steps in using user click
data to tune Wikidata search parameters [2] and [3]. The team will
ship the newly tuned wbsearchentities profile for en soon with de, fr,
es afterward.
* The team also had lots of discussions and exploration on how to
transform Wikidata autocomplete click logs into a useful dataset. They
are now transformed: Relevance Forge now has a utility for taking in
the Wikidata completion search logs and tuning the parameters of
search based on those logs. [4]
* David fixed a minor regression where search request failures when
offset+limit is out of bounds (cirrussearch-backend-error) [5]
* Mathew discovered that the required metrics have been exposed by the
prometheus exporter but they are displaying and fixed the issue with
help from David and Gehel [6]
* David reconfigured the ElasticSearch crosscluster on production
search servers to have persistent configs [7]
=== WDQS ===
* Stas & Guillaume finished moving categories namespace into a
separate Blazegraph instance [8]
== Did you know? ==
English text, like many others, is written left-to-right (LTR), but
some languages—most notably Arabic, Hebrew, Persian, and Urdu, but
also many others [9]—are written right-to-left (RTL). To handle
different writing directions—especially in mixed LTR and RTL
texts—Unicode classifies characters as having "strong", "weak", or
"neutral" directionality. Strong characters definitely go in a
particular direction, like ABC or אבג. Weak characters have a "vague"
directionality, but can be changed in context, mostly numbers. Neutral
characters pick up their directionality from context, like punctuation
and whitespace characters used across scripts.
Mirrored characters change the way they display based on context. For
example "A>B>C" and "א>ב>ג" both only have the greater than character
(>) in them, but, if you are reading this somewhere that follows the
Unicode bidirectional algorithm, the ones between Latin letters point
to the right and those between Hebrew letters point to the left.
The algorithms are complicated [10], and when they don't work, there
are explicit characters that indicate things like "text should flow
left to right from here". The explicit formatting characters have the
most potential to cause trouble for search because they are usually
invisible, and you can pick one up without realizing it. For example,
when copying an Arabic word from a page in English, or a French word
from a page in Hebrew, the word that is "the other way around" from
the main text might have one of these marks at the beginning or end of
it. Fortunately, we can usually identify them and strip them out.
Finally, there are some scripts that have been written in other
interesting directions. Vertical text includes Chinese, Japanese, and
Korean, [11] and Mongolian. [12]. Hanunó'o [13] and Ogham [14] were
written bottom-to-top! My [Trey's] favorite "direction" is
"boustrophedon," [15] which means "like an ox ploughs" and alternates
left-to-right and right-to-left, and was used particularly in old
manuscripts and inscriptions in may writing systems. Why jump from one
side of the page to the other when you can just curve around where you
are or flip to mirrored letters and keep going?!
[0] https://phabricator.wikimedia.org/T213931
[1] https://phabricator.wikimedia.org/T138958
[2] https://phabricator.wikimedia.org/T193701
[3] https://phabricator.wikimedia.org/T213105
[4] https://phabricator.wikimedia.org/T205111
[5] https://phabricator.wikimedia.org/T213745
[6] https://phabricator.wikimedia.org/T210592
[7] https://phabricator.wikimedia.org/T213150
[8] https://phabricator.wikimedia.org/T213212
[9] https://en.wikipedia.org/wiki/Right-to-left#List_of_RTL_scripts
[10] https://www.w3.org/International/articles/inline-bidi-markup/uba-basics
[11] https://en.wikipedia.org/wiki/Horizontal_and_vertical_writing_in_East_Asian…
[12] https://en.wikipedia.org/wiki/Mongolian_script
[13] https://en.wikipedia.org/wiki/Hanun%C3%B3%27o_alphabet
[14] https://en.wikipedia.org/wiki/Ogham
[15] https://en.wikipedia.org/wiki/Boustrophedon
----
Subscribe to receive on-wiki (or opt-in email) notifications of the
Discovery weekly update.
https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly
The archive of all past updates can be found on MediaWiki.org:
https://www.mediawiki.org/wiki/Discovery/Status_updates
Interested in getting involved? See tasks marked as "Easy" or
"Volunteer needed" in Phabricator.
[1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R
[2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R
Yours,
Chris Koerner (he/him)
Community Relations Specialist
Wikimedia Foundation
This is the weekly update from the Search Platform team for the week
starting 2019-01-07.
As always, feedback and questions welcome.
== Discussions ==
=== Search ===
* David discovered an issue with the click-through rate on one of the
Search dashboards for mobile apps [0] and enlisted Chelsy's help in
fixing it quickly (done!) [1]
* Mathew worked on increasing the number of shards for enwiki_general [2]
* David helped to augmenting the list of known clusters using cluster
conf for Mjolnir [3]
* David updated the completion suggester: TP50 [Top percentile 50%]
was increased from 9ms to 24ms [4]
* The Search team worked on supporting searching multiple filetypes at
once, based on input from the Multimedia team [5]
* David and Mathew worked on allowing ElasticSearch machines to be
able to communicate with each other on port 9500 and 9700 [6]
* We found that most of the dashboards in grafana are designed to have
a cluster per DC, and we needed to refactor them so that we can select
a specific cluster (by adding chi, psi and omega selectors) [7]
* The multi-instance support code added for ExternalIndex was designed
without the group+replica concepts in mind, so we fixed ExternalIndex
to support groups & replica topology [8]
* There was a recent spike of fatal timeouts from API search
suggestions (prefixsearch) that caused a number of user queries to
become stalled for 60 seconds and then receive a generic error page
without any results. We fixed this by merging a patch for language
detection to not be run when rewriting is not enabled [9]
=== WDQS ===
* We have added a new keyboard shortcuts to WDQS UI, for those systems
where Ctrl-Space is already taken - Ctrl-Alt-Space and Alt-Enter [10]
* Stas found an issue where the WDQS puppet/hiera configs were too
distributed, Mathew and Gehel worked on it with assistance from SRE
(thanks!) [11]
* Our database in WDQS seems to hit Blazegraph internal limits, which
requires some careful work of rearranging the data to stay away from
the limit. This work now has started [12]
* Stas have fixed an issue where a large update could crash Updater [13]
* Stas have fixed an issue where due to database replication delay,
Updater could read an old version of the data from Wikidata [14]
* Stas fixed an issue where SERVICE SILENT construct was producing
errors despite standards saying it should not do that [15]
[0] http://discovery.wmflabs.org/metrics/#app_events
[1] https://phabricator.wikimedia.org/T211306
[2] https://phabricator.wikimedia.org/T212224
[3] https://phabricator.wikimedia.org/T211752
[4] https://phabricator.wikimedia.org/T212768
[5] https://phabricator.wikimedia.org/T212776
[6] https://phabricator.wikimedia.org/T212434
[7] https://phabricator.wikimedia.org/T211956
[8] https://phabricator.wikimedia.org/T212120
[9] https://phabricator.wikimedia.org/T212455
[10] https://phabricator.wikimedia.org/T203320
[11] https://phabricator.wikimedia.org/T210431
[12] https://phabricator.wikimedia.org/T213210
[13] https://phabricator.wikimedia.org/T210235
[14] https://phabricator.wikimedia.org/T210901
[15] https://phabricator.wikimedia.org/T196859
----
Subscribe to receive on-wiki (or opt-in email) notifications of the
Discovery weekly update.
https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly
The archive of all past updates can be found on MediaWiki.org:
https://www.mediawiki.org/wiki/Discovery/Status_updates
Interested in getting involved? See tasks marked as "Easy" or
"Volunteer needed" in Phabricator.
[1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R
[2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R
Yours,
Chris Koerner (he/him)
Community Relations Specialist
Wikimedia Foundation
Hello all!
We are having some issues with 2 of the Wikidata Query Service
servers. So far, the issue looks like data corruption, probably
related to an issue in Blazegraph itself (the database engine behind
Wikidata Query Service). The issue prevents updates to the data, but
reads are unaffected as far as we can tell.
The 2 affected servers are part of the internal WDQS cluster, so the
public wdqs endpoint [1] is not affected. Data is lagging on the
internal eqiad endpoint, so Mediawiki functionalities that use WDQS
are at the moment not seeing the latest updates to Wikidata.
We are reaching out to the Blazegraph team via Github [2] and via
private contacts that we have. We hope to identify the root cause of
the issue so that we can fix it for good, but this looks like a hard
problem. Failing that, we will reimport the full data set.
You can follow the upstream issue on Github [2] and on Phabricator on
our side [3].
Sorry for the inconvenience and thank you for your patience!
Have fun,
Guillaume
[1] https://query.wikidata.org/
[2] https://github.com/blazegraph/database/issues/114
[3] https://phabricator.wikimedia.org/T213134
--
Guillaume Lederrey
Operations Engineer, Search Platform
Wikimedia Foundation
UTC+1 / CET
The Search Platform Team
<https://www.mediawiki.org/wiki/Wikimedia_Search_Platform> usually holds
office hours the first Wednesday of each month—but since this month that
would have been Jan 2nd, we’ve delayed for a week. Come ask us anything
about Wikimedia search!
We’re particularly interested in:
* Opportunities for collaboration—internally or externally to the Wikimedia
Foundation
* Challenges you have with on-wiki search, in any of the languages we
support
But we're happy to talk about anything search-related. Feel free to add
your items to the Etherpad Agenda for the next meeting.
Details for our next meeting:
Date: Wednesday, December 9th, 2018
Time: 16:00-17:00 GMT / 08:00-9:00 PST / 11:00-12:00 EST / 17:00-18:00 CET
Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
Google Meet link: https://meet.google.com/vyc-jvgq-dww
*N.B.:* Google Meet System Requirements
<https://support.google.com/meet/answer/7317473>
—Trey
Trey Jones
Sr. Software Engineer, Search Platform
Wikimedia Foundation
Hi,
The previously announced schedule for Search Platform Team office hours was
that these office hours would happen on the first Wednesday of each month.
My guess is that January 1st is the last day of WMF's end of year holidays,
but maybe WMF's holiday break extends further than the 1st. There has been
no announcement of an office hour January 2nd. Am I correct in guessing
that the office hour will occur on January 9th?
Pine
( https://meta.wikimedia.org/wiki/User:Pine )