Chiming in as a member of the Wikimedia Foundation Research team
<https://research.wikimedia.org/> (so you'll see that likely biases the
examples I'm aware of). I'd say that the most common type of NLP that shows
up in our applications is tokenization / language analysis -- i.e. split
wikitext into words/sentences. As Trey said, this tokenization is
non-trivial for English and gets much harder in other languages that have
more complex constructions / don't use spaces to delimit words
<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Spaceless_Writing_Systems_and_Wiki-Projects>.
These tokens often then become inputs into other types of models that
aren't necessarily NLP. There are a number of more complex NLP technologies
too that don't just identify words but try to identify similarities between
them, translate them, etc.
Some examples below. Additionally, I indicated whether each application was
rule-based (follow a series of deterministic heuristics) or ML (learned,
probabilistic model) in case that's of interest:
- Copyedit
<https://meta.wikimedia.org/wiki/Research:Copyediting_as_a_structured_task>:
identifying potential grammar/spelling issues in articles (rule-based). I
believe there are a number of volunteer-run bots on the wikis as well as
the under-development tool I linked to, which is a collaboration between
the Wikimedia Foundation Research team <https://research.wikimedia.org/>
and Growth team <https://www.mediawiki.org/wiki/Growth> that builds on
an open-source tool
<https://meta.wikimedia.org/wiki/Research:Copyediting_as_a_structured_task/LanguageTool>
.
- Link recommendation
<https://www.mediawiki.org/wiki/Growth/Personalized_first_day/Structured_tasks/Add_a_link#Link_recommendation_algorithm>:
detecting links that could be added to Wikipedia articles. The NLP aspect
mainly involves accurately parsing wikitext into sentences/words
(rule-based) and comparing the similarity of the source article and pages
that are potential target links (ML). Also collaboration between Research
team and Growth team.
- Content similarity: various tools such as SuggestBot
<https://en.wikipedia.org/wiki/User:SuggestBot>, RelatedArticles
Extension <https://www.mediawiki.org/wiki/Extension:RelatedArticles>, or
GapFinder <https://www.mediawiki.org/wiki/GapFinder> use the morelike
functionality of CirrusSearch
<https://www.mediawiki.org/wiki/Help:CirrusSearch#Page_weighting>
backend maintained by the Search team to find Wikipedia articles with
similar topics -- this is largely finding keyword overlap between content
with clever pre-processing/weighting as described by Trey.
- Readability
<https://meta.wikimedia.org/wiki/Research:Multilingual_Readability_Research>:
score content based on its readability. Under development by Research team.
- Topic classification: predict what high-level topics are associated
with Wikipedia articles. The current model for English Wikipedia
<https://www.mediawiki.org/wiki/ORES#Topic_routing> uses word embeddings
from the article to make predictions (ML) and a proposed model
<https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Language_agnostic_link-based_article_topic_model_card>
from the Research team will use NLP models but with article links instead
to support more (all) language editions.
- Citation needed <https://meta.wikimedia.org/wiki/Citation_Detective>:
detecting sentences in need of citations (ML). Protoype developed by
Research team.
- Edit Types
<https://meta.wikimedia.org/wiki/Research:Wikipedia_Edit_Types>:
summarizing how much text changed between two revisions of a Wikipedia
article -- e.g., how many words/sentences changed (rule-based). Protoype
developed by Research team.
- Vandalism detection: a number of different approaches in use on the
wikis generally have some form of a "bad word" list (generally a mix of
auto/manually-generated), extract words from new edits and compare these
words to the bad word list, and then use this to help judge how likely the
edit is to be vandalism. Examples include many filters in AbuseFilter
<https://www.mediawiki.org/wiki/Extension:AbuseFilter>, volunteer-led
efforts such as ClueBot NG
<https://en.wikipedia.org/wiki/User:ClueBot_NG#Bayesian_Classifiers>
(English Wikipedia) and Salebot
<https://fr.wikipedia.org/wiki/Utilisateur:Salebot> (French Wikipedia)
as well as the Wikimedia Foundation ORES edit quality models
<https://www.mediawiki.org/wiki/ORES/BWDS_review> (many wikis).
- Sockpuppet detection
<https://www.mediawiki.org/wiki/User:Ladsgroup/masz>: finding editors
who have similar stylistic patterns in their comments (volunteer tool).
- Content Translation was mentioned -- there are numerous potential
translation models available
<https://www.mediawiki.org/wiki/Content_translation/Machine_Translation/MT_Clients#Machine_translation_clients>,
of which some are rule-based and some are ML. Tool maintained by Wikimedia
Foundation Language team
<https://www.mediawiki.org/wiki/Wikimedia_Language_engineering> but
depends on several external APIs.
I've also done some thinking that might be of interest about what a natural
language modeling strategy looks like for Wikimedia that balances
effectiveness of models with equity/sustainability of supporting so many
different language communities:
Hope that helps.
Best,
Isaac
On Wed, Jun 22, 2022, 10:43 Trey Jones <tjones(a)wikimedia.org> wrote:
Do you have examples of projects using NLP in
Wikimedia communities.
I do! Defining NLP is something of a moving target, and the most common
definition, which I learned when I worked in industry, is that "NLP" has
often been used as a buzzword that means "any language processing you do
that your competitors don't". Getting away from profit-driven buzzwords, I
have a pretty generous definition of NLP, as any software that improves
language-based interactions between people and computers.
Guillaume mentioned CirrusSearch in general, but there are lots of
specific parts within search. I work on a lot of NLP-type stuff for search,
and I write a lot of documentation on Mediawiki, so this is biased towards
stuff I have worked on or know about.
Language analysis is the general process of converting text (say, of
Wikipedia articles) into tokens (approximately "words" in English) to be
stored in the search index. There are lots of different levels of
complexity in the language analysis. We currently use Elasticsearch, and
they provide a lot of language-specific analysis tools (link to Elastic
language analyzers
<https://www.elastic.co/guide/en/elasticsearch/reference/7.10/analysis-lang-analyzer.html>),
which we customize and build on.
Here is part of the config for English, reordered to be chronological,
rather than alphabetical, and annotated:
"text": {
"type": "custom",
"char_filter": [
"word_break_helper", — break_up.words:with(uncommon)separators
"kana_map" — map Japanese Hiragana to Katakana (notes
<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Hiragana_to_Katakana_Mapping_for_English_and_Japanese
)
],
"tokenizer": "standard" — break text into tokens/words; not
trivial
for English, very hard for other languages (blog post
<https://wikimediafoundation.org/news/2018/08/07/anatomy-search-token-affection/
)
"filter": [
"aggressive_splitting", —splitting of more likely *multi-part*
*ComplexTokens*
"homoglyph_norm", —correct typos/vandalization which mix Latin
and Cyrillic letters (notes
<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Homoglyphs>)
"possessive_english", —special processing for *English's*
possessive forms
"icu_normalizer", —normalization of text (blog post
<https://wikimediafoundation.org/news/2018/09/13/anatomy-search-variation-under-nature/
)
"stop", —removal of stop words (blog post
<https://wikimediafoundation.org/news/2018/11/28/anatomy-search-the-root-of-the-problem/>,
section "To be or not to be indexed")
"icu_folding", —more aggressive normalization
"remove_empty", —misc bookkeeping
"kstem", —stemming (blog post
<https://wikimediafoundation.org/news/2018/11/28/anatomy-search-the-root-of-the-problem/
)
"custom_stem" —more stemming
],
},
Tokenization, normalization, and stemming can vary wildly between
languages. Some other elements (from Elasticsearch or custom-built by us):
- Stemmers and stop words for specific languages, including some
open-source ones that we ported, and some developed with community help.
- Elision processing (*l'homme* == *homme*)
- Normalization for digits (١ ٢ ٣ / १ २ ३ / ①②③ / 123)
- Custom lowercasing—Greek, Irish, and Turkish have special processing
(notes
<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language-Specific_Lowercasing_and_ICU_Normalization
)
- Normalization of written Khmer (blog post
<https://techblog.wikimedia.org/2020/06/02/permuting-khmer-restructuring-khmer-syllables-for-search/
)
- Notes on lots more
<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#Elasticsearch_Analysis_Chain_Analysis
...
We also did some work improving "Did you mean" suggestions, which
currently uses both the built-in suggestions from Elasticsearch (not always
great, but there are lots of them) and new suggestions from a module we
called "Glent
<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#Glent_%22Did_You_Mean%22_Suggestions>"
(much better, but not as many suggestions).
We have some custom language detection available on some Wikipedias, so
that if you don't get very many results and your query looks like it is
another language, we show results from that other language. Example, searching
for Том Хэнкс on English Wikipedia
<https://en.wikipedia.org/w/index.php?search=%D0%A2%D0%BE%D0%BC+%D0%A5%D1%8D%D0%BD%D0%BA%D1%81&ns0=1>
will
show results from Russian Wikipedia. (too many notes
<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#TextCat,_Language_ID,_Etc.
)
Outside of our search work, there are lots more. Some that come to mind:
- Language Converter supports languages with multiple writing systems,
which is sometimes easy and sometimes really hard. (blog post
<https://diff.wikimedia.org/2018/03/12/supporting-languages-multiple-writing-systems/
)
- There's a Wikidata gadget on French Wikipedia and others that
appends results from Wikidata and generates descriptions in various
languages based on the Wikidata information. For example, searching for Molenstraat
Vught on French Wikipedia
<https://fr.wikipedia.org/w/index.php?search=Molenstraat+Vught&title=Sp%C3%A9cial:Recherche&profile=advanced&fulltext=1&ns0=1>,
gives no local results, but shows two "Results from Wikidata" /
"Résultats
sur Wikidata" (if you are logged in you get results in your preferred
language, if possible, otherwise the language of the project):
- Molenstraat ; hameau de la commune de Vught (in French, when I'm
not logged in)
- Molenstraat ; street in Vught, the Netherlands (fallback to
English for some reason)
- The whole giant Content Translation project that uses machine
translation to assist translating articles across wikis. (blog post
<https://wikimediafoundation.org/news/2019/01/09/you-can-now-use-google-translate-to-translate-articles-on-wikipedia/
)
There's lots more out there, I'm sure—but I gotta run!
—Trey
Trey Jones
Staff Computational Linguist, Search Platform
Wikimedia Foundation
UTC–4 / EDT
_______________________________________________
Wikitech-l mailing list -- wikitech-l(a)lists.wikimedia.org
To unsubscribe send an email to wikitech-l-leave(a)lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/