Hi all Do you have examples of projects using NLP in Wikimedia communities.
Kind regards
Ilario Valdelli
Hi Ilario,
On Tue, 2022-06-21 at 13:18 +0200, Ilario Valdelli wrote:
Do you have examples of projects using NLP in Wikimedia communities.
https://meta.wikimedia.org/wiki/Abstract_Wikipedia comes to my mind.
Cheers, andre
To some extent, Search [1] is all about making sense of natural language. So depending on your definition of NLP, I think that qualifies.
[1] https://www.mediawiki.org/wiki/Extension:CirrusSearch
On Tue, 21 Jun 2022 at 17:32, Andre Klapper aklapper@wikimedia.org wrote:
Hi Ilario,
On Tue, 2022-06-21 at 13:18 +0200, Ilario Valdelli wrote:
Do you have examples of projects using NLP in Wikimedia communities.
https://meta.wikimedia.org/wiki/Abstract_Wikipedia comes to my mind.
Cheers, andre -- Andre Klapper (he/him) | Bugwrangler / Developer Advocate https://blogs.gnome.org/aklapper/ _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Do you have examples of projects using NLP in Wikimedia communities.
I do! Defining NLP is something of a moving target, and the most common definition, which I learned when I worked in industry, is that "NLP" has often been used as a buzzword that means "any language processing you do that your competitors don't". Getting away from profit-driven buzzwords, I have a pretty generous definition of NLP, as any software that improves language-based interactions between people and computers.
Guillaume mentioned CirrusSearch in general, but there are lots of specific parts within search. I work on a lot of NLP-type stuff for search, and I write a lot of documentation on Mediawiki, so this is biased towards stuff I have worked on or know about.
Language analysis is the general process of converting text (say, of Wikipedia articles) into tokens (approximately "words" in English) to be stored in the search index. There are lots of different levels of complexity in the language analysis. We currently use Elasticsearch, and they provide a lot of language-specific analysis tools (link to Elastic language analyzers https://www.elastic.co/guide/en/elasticsearch/reference/7.10/analysis-lang-analyzer.html), which we customize and build on.
Here is part of the config for English, reordered to be chronological, rather than alphabetical, and annotated:
"text": { "type": "custom", "char_filter": [ "word_break_helper", — break_up.words:with(uncommon)separators "kana_map" — map Japanese Hiragana to Katakana (notes https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Hiragana_to_Katakana_Mapping_for_English_and_Japanese ) ], "tokenizer": "standard" — break text into tokens/words; not trivial for English, very hard for other languages (blog post https://wikimediafoundation.org/news/2018/08/07/anatomy-search-token-affection/ ) "filter": [ "aggressive_splitting", —splitting of more likely *multi-part* *ComplexTokens* "homoglyph_norm", —correct typos/vandalization which mix Latin and Cyrillic letters (notes https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Homoglyphs) "possessive_english", —special processing for *English's* possessive forms "icu_normalizer", —normalization of text (blog post https://wikimediafoundation.org/news/2018/09/13/anatomy-search-variation-under-nature/ ) "stop", —removal of stop words (blog post https://wikimediafoundation.org/news/2018/11/28/anatomy-search-the-root-of-the-problem/, section "To be or not to be indexed") "icu_folding", —more aggressive normalization "remove_empty", —misc bookkeeping "kstem", —stemming (blog post https://wikimediafoundation.org/news/2018/11/28/anatomy-search-the-root-of-the-problem/ ) "custom_stem" —more stemming ], },
Tokenization, normalization, and stemming can vary wildly between languages. Some other elements (from Elasticsearch or custom-built by us):
- Stemmers and stop words for specific languages, including some open-source ones that we ported, and some developed with community help. - Elision processing (*l'homme* == *homme*) - Normalization for digits (١ ٢ ٣ / १ २ ३ / ①②③ / 123) - Custom lowercasing—Greek, Irish, and Turkish have special processing ( notes https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language-Specific_Lowercasing_and_ICU_Normalization ) - Normalization of written Khmer (blog post https://techblog.wikimedia.org/2020/06/02/permuting-khmer-restructuring-khmer-syllables-for-search/ ) - Notes on lots more https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#Elasticsearch_Analysis_Chain_Analysis ...
We also did some work improving "Did you mean" suggestions, which currently uses both the built-in suggestions from Elasticsearch (not always great, but there are lots of them) and new suggestions from a module we called " Glent https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#Glent_%22Did_You_Mean%22_Suggestions" (much better, but not as many suggestions).
We have some custom language detection available on some Wikipedias, so that if you don't get very many results and your query looks like it is another language, we show results from that other language. Example, searching for Том Хэнкс on English Wikipedia https://en.wikipedia.org/w/index.php?search=%D0%A2%D0%BE%D0%BC+%D0%A5%D1%8D%D0%BD%D0%BA%D1%81&ns0=1 will show results from Russian Wikipedia. (too many notes https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#TextCat,_Language_ID,_Etc. )
Outside of our search work, there are lots more. Some that come to mind:
- Language Converter supports languages with multiple writing systems, which is sometimes easy and sometimes really hard. (blog post https://diff.wikimedia.org/2018/03/12/supporting-languages-multiple-writing-systems/ ) - There's a Wikidata gadget on French Wikipedia and others that appends results from Wikidata and generates descriptions in various languages based on the Wikidata information. For example, searching for Molenstraat Vught on French Wikipedia https://fr.wikipedia.org/w/index.php?search=Molenstraat+Vught&title=Sp%C3%A9cial:Recherche&profile=advanced&fulltext=1&ns0=1, gives no local results, but shows two "Results from Wikidata" / "Résultats sur Wikidata" (if you are logged in you get results in your preferred language, if possible, otherwise the language of the project): - Molenstraat ; hameau de la commune de Vught (in French, when I'm not logged in) - Molenstraat ; street in Vught, the Netherlands (fallback to English for some reason) - The whole giant Content Translation project that uses machine translation to assist translating articles across wikis. (blog post https://wikimediafoundation.org/news/2019/01/09/you-can-now-use-google-translate-to-translate-articles-on-wikipedia/ )
There's lots more out there, I'm sure—but I gotta run! —Trey
Trey Jones Staff Computational Linguist, Search Platform Wikimedia Foundation UTC–4 / EDT
Chiming in as a member of the Wikimedia Foundation Research team https://research.wikimedia.org/ (so you'll see that likely biases the examples I'm aware of). I'd say that the most common type of NLP that shows up in our applications is tokenization / language analysis -- i.e. split wikitext into words/sentences. As Trey said, this tokenization is non-trivial for English and gets much harder in other languages that have more complex constructions / don't use spaces to delimit words https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Spaceless_Writing_Systems_and_Wiki-Projects. These tokens often then become inputs into other types of models that aren't necessarily NLP. There are a number of more complex NLP technologies too that don't just identify words but try to identify similarities between them, translate them, etc.
Some examples below. Additionally, I indicated whether each application was rule-based (follow a series of deterministic heuristics) or ML (learned, probabilistic model) in case that's of interest:
- Copyedit https://meta.wikimedia.org/wiki/Research:Copyediting_as_a_structured_task: identifying potential grammar/spelling issues in articles (rule-based). I believe there are a number of volunteer-run bots on the wikis as well as the under-development tool I linked to, which is a collaboration between the Wikimedia Foundation Research team https://research.wikimedia.org/ and Growth team https://www.mediawiki.org/wiki/Growth that builds on an open-source tool https://meta.wikimedia.org/wiki/Research:Copyediting_as_a_structured_task/LanguageTool . - Link recommendation https://www.mediawiki.org/wiki/Growth/Personalized_first_day/Structured_tasks/Add_a_link#Link_recommendation_algorithm: detecting links that could be added to Wikipedia articles. The NLP aspect mainly involves accurately parsing wikitext into sentences/words (rule-based) and comparing the similarity of the source article and pages that are potential target links (ML). Also collaboration between Research team and Growth team. - Content similarity: various tools such as SuggestBot https://en.wikipedia.org/wiki/User:SuggestBot, RelatedArticles Extension https://www.mediawiki.org/wiki/Extension:RelatedArticles, or GapFinder https://www.mediawiki.org/wiki/GapFinder use the morelike functionality of CirrusSearch https://www.mediawiki.org/wiki/Help:CirrusSearch#Page_weighting backend maintained by the Search team to find Wikipedia articles with similar topics -- this is largely finding keyword overlap between content with clever pre-processing/weighting as described by Trey. - Readability https://meta.wikimedia.org/wiki/Research:Multilingual_Readability_Research: score content based on its readability. Under development by Research team. - Topic classification: predict what high-level topics are associated with Wikipedia articles. The current model for English Wikipedia https://www.mediawiki.org/wiki/ORES#Topic_routing uses word embeddings from the article to make predictions (ML) and a proposed model https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Language_agnostic_link-based_article_topic_model_card from the Research team will use NLP models but with article links instead to support more (all) language editions. - Citation needed https://meta.wikimedia.org/wiki/Citation_Detective: detecting sentences in need of citations (ML). Protoype developed by Research team. - Edit Types https://meta.wikimedia.org/wiki/Research:Wikipedia_Edit_Types: summarizing how much text changed between two revisions of a Wikipedia article -- e.g., how many words/sentences changed (rule-based). Protoype developed by Research team. - Vandalism detection: a number of different approaches in use on the wikis generally have some form of a "bad word" list (generally a mix of auto/manually-generated), extract words from new edits and compare these words to the bad word list, and then use this to help judge how likely the edit is to be vandalism. Examples include many filters in AbuseFilter https://www.mediawiki.org/wiki/Extension:AbuseFilter, volunteer-led efforts such as ClueBot NG https://en.wikipedia.org/wiki/User:ClueBot_NG#Bayesian_Classifiers (English Wikipedia) and Salebot https://fr.wikipedia.org/wiki/Utilisateur:Salebot (French Wikipedia) as well as the Wikimedia Foundation ORES edit quality models https://www.mediawiki.org/wiki/ORES/BWDS_review (many wikis). - Sockpuppet detection https://www.mediawiki.org/wiki/User:Ladsgroup/masz: finding editors who have similar stylistic patterns in their comments (volunteer tool). - Content Translation was mentioned -- there are numerous potential translation models available https://www.mediawiki.org/wiki/Content_translation/Machine_Translation/MT_Clients#Machine_translation_clients, of which some are rule-based and some are ML. Tool maintained by Wikimedia Foundation Language team https://www.mediawiki.org/wiki/Wikimedia_Language_engineering but depends on several external APIs.
I've also done some thinking that might be of interest about what a natural language modeling strategy looks like for Wikimedia that balances effectiveness of models with equity/sustainability of supporting so many different language communities: https://meta.wikimedia.org/wiki/User:Isaac_(WMF)/Language_modeling
Hope that helps.
Best, Isaac
On Wed, Jun 22, 2022, 10:43 Trey Jones tjones@wikimedia.org wrote:
Do you have examples of projects using NLP in Wikimedia communities.
I do! Defining NLP is something of a moving target, and the most common definition, which I learned when I worked in industry, is that "NLP" has often been used as a buzzword that means "any language processing you do that your competitors don't". Getting away from profit-driven buzzwords, I have a pretty generous definition of NLP, as any software that improves language-based interactions between people and computers.
Guillaume mentioned CirrusSearch in general, but there are lots of specific parts within search. I work on a lot of NLP-type stuff for search, and I write a lot of documentation on Mediawiki, so this is biased towards stuff I have worked on or know about.
Language analysis is the general process of converting text (say, of Wikipedia articles) into tokens (approximately "words" in English) to be stored in the search index. There are lots of different levels of complexity in the language analysis. We currently use Elasticsearch, and they provide a lot of language-specific analysis tools (link to Elastic language analyzers https://www.elastic.co/guide/en/elasticsearch/reference/7.10/analysis-lang-analyzer.html), which we customize and build on.
Here is part of the config for English, reordered to be chronological, rather than alphabetical, and annotated:
"text": { "type": "custom", "char_filter": [ "word_break_helper", — break_up.words:with(uncommon)separators "kana_map" — map Japanese Hiragana to Katakana (notes https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Hiragana_to_Katakana_Mapping_for_English_and_Japanese ) ], "tokenizer": "standard" — break text into tokens/words; not trivial for English, very hard for other languages (blog post https://wikimediafoundation.org/news/2018/08/07/anatomy-search-token-affection/ ) "filter": [ "aggressive_splitting", —splitting of more likely *multi-part* *ComplexTokens* "homoglyph_norm", —correct typos/vandalization which mix Latin and Cyrillic letters (notes https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Homoglyphs) "possessive_english", —special processing for *English's* possessive forms "icu_normalizer", —normalization of text (blog post https://wikimediafoundation.org/news/2018/09/13/anatomy-search-variation-under-nature/ ) "stop", —removal of stop words (blog post https://wikimediafoundation.org/news/2018/11/28/anatomy-search-the-root-of-the-problem/, section "To be or not to be indexed") "icu_folding", —more aggressive normalization "remove_empty", —misc bookkeeping "kstem", —stemming (blog post https://wikimediafoundation.org/news/2018/11/28/anatomy-search-the-root-of-the-problem/ ) "custom_stem" —more stemming ], },
Tokenization, normalization, and stemming can vary wildly between languages. Some other elements (from Elasticsearch or custom-built by us):
- Stemmers and stop words for specific languages, including some
open-source ones that we ported, and some developed with community help.
- Elision processing (*l'homme* == *homme*)
- Normalization for digits (١ ٢ ٣ / १ २ ३ / ①②③ / 123)
- Custom lowercasing—Greek, Irish, and Turkish have special processing
- Normalization of written Khmer (blog post
https://techblog.wikimedia.org/2020/06/02/permuting-khmer-restructuring-khmer-syllables-for-search/ )
- Notes on lots more
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#Elasticsearch_Analysis_Chain_Analysis ...
We also did some work improving "Did you mean" suggestions, which currently uses both the built-in suggestions from Elasticsearch (not always great, but there are lots of them) and new suggestions from a module we called "Glent https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#Glent_%22Did_You_Mean%22_Suggestions" (much better, but not as many suggestions).
We have some custom language detection available on some Wikipedias, so that if you don't get very many results and your query looks like it is another language, we show results from that other language. Example, searching for Том Хэнкс on English Wikipedia https://en.wikipedia.org/w/index.php?search=%D0%A2%D0%BE%D0%BC+%D0%A5%D1%8D%D0%BD%D0%BA%D1%81&ns0=1 will show results from Russian Wikipedia. (too many notes https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#TextCat,_Language_ID,_Etc. )
Outside of our search work, there are lots more. Some that come to mind:
- Language Converter supports languages with multiple writing systems,
which is sometimes easy and sometimes really hard. (blog post https://diff.wikimedia.org/2018/03/12/supporting-languages-multiple-writing-systems/ )
- There's a Wikidata gadget on French Wikipedia and others that
appends results from Wikidata and generates descriptions in various languages based on the Wikidata information. For example, searching for Molenstraat Vught on French Wikipedia https://fr.wikipedia.org/w/index.php?search=Molenstraat+Vught&title=Sp%C3%A9cial:Recherche&profile=advanced&fulltext=1&ns0=1, gives no local results, but shows two "Results from Wikidata" / "Résultats sur Wikidata" (if you are logged in you get results in your preferred language, if possible, otherwise the language of the project): - Molenstraat ; hameau de la commune de Vught (in French, when I'm not logged in) - Molenstraat ; street in Vught, the Netherlands (fallback to English for some reason) - The whole giant Content Translation project that uses machine translation to assist translating articles across wikis. (blog post https://wikimediafoundation.org/news/2019/01/09/you-can-now-use-google-translate-to-translate-articles-on-wikipedia/ )
There's lots more out there, I'm sure—but I gotta run! —Trey
Trey Jones Staff Computational Linguist, Search Platform Wikimedia Foundation UTC–4 / EDT
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Hello Ilario,
You might find this blog post I wrote a while back interesting https://blog.kensho.com/announcing-the-kensho-derived-wikimedia-dataset-5d11...
In it you can find a brief (and definitely not comprehensive) review of NLP with Wiki* along with links to an open source Kaggle dataset I built connecting the plain text of wikipedia, the anchor links between pages, and the links to wikidata. There are a few notebooks that demonstrate its use ... my favorite are probably,
* Pointwise Mutual Information embeddings https://www.kaggle.com/code/kenshoresearch/kdwd-pmi-word-vectors * Analyzing the "subclass of" graph from wikidata https://www.kaggle.com/code/gabrielaltay/kdwd-subclass-path-ner * Explicit topic modeling https://www.kaggle.com/code/kenshoresearch/kdwd-explicit-topic-models
and if you are still looking for more after that, this is the query that gives more every time you use it :)
https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&ter...
best, -G
On Thu, Jun 23, 2022 at 4:17 PM Isaac Johnson isaac@wikimedia.org wrote:
Chiming in as a member of the Wikimedia Foundation Research team https://research.wikimedia.org/ (so you'll see that likely biases the examples I'm aware of). I'd say that the most common type of NLP that shows up in our applications is tokenization / language analysis -- i.e. split wikitext into words/sentences. As Trey said, this tokenization is non-trivial for English and gets much harder in other languages that have more complex constructions / don't use spaces to delimit words https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Spaceless_Writing_Systems_and_Wiki-Projects. These tokens often then become inputs into other types of models that aren't necessarily NLP. There are a number of more complex NLP technologies too that don't just identify words but try to identify similarities between them, translate them, etc.
Some examples below. Additionally, I indicated whether each application was rule-based (follow a series of deterministic heuristics) or ML (learned, probabilistic model) in case that's of interest:
- Copyedit
https://meta.wikimedia.org/wiki/Research:Copyediting_as_a_structured_task: identifying potential grammar/spelling issues in articles (rule-based). I believe there are a number of volunteer-run bots on the wikis as well as the under-development tool I linked to, which is a collaboration between the Wikimedia Foundation Research team https://research.wikimedia.org/ and Growth team https://www.mediawiki.org/wiki/Growth that builds on an open-source tool https://meta.wikimedia.org/wiki/Research:Copyediting_as_a_structured_task/LanguageTool .
- Link recommendation
https://www.mediawiki.org/wiki/Growth/Personalized_first_day/Structured_tasks/Add_a_link#Link_recommendation_algorithm: detecting links that could be added to Wikipedia articles. The NLP aspect mainly involves accurately parsing wikitext into sentences/words (rule-based) and comparing the similarity of the source article and pages that are potential target links (ML). Also collaboration between Research team and Growth team.
- Content similarity: various tools such as SuggestBot
https://en.wikipedia.org/wiki/User:SuggestBot, RelatedArticles Extension https://www.mediawiki.org/wiki/Extension:RelatedArticles, or GapFinder https://www.mediawiki.org/wiki/GapFinder use the morelike functionality of CirrusSearch https://www.mediawiki.org/wiki/Help:CirrusSearch#Page_weighting backend maintained by the Search team to find Wikipedia articles with similar topics -- this is largely finding keyword overlap between content with clever pre-processing/weighting as described by Trey.
- Readability
https://meta.wikimedia.org/wiki/Research:Multilingual_Readability_Research: score content based on its readability. Under development by Research team.
- Topic classification: predict what high-level topics are associated
with Wikipedia articles. The current model for English Wikipedia https://www.mediawiki.org/wiki/ORES#Topic_routing uses word embeddings from the article to make predictions (ML) and a proposed model https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Language_agnostic_link-based_article_topic_model_card from the Research team will use NLP models but with article links instead to support more (all) language editions.
- Citation needed https://meta.wikimedia.org/wiki/Citation_Detective:
detecting sentences in need of citations (ML). Protoype developed by Research team.
- Edit Types
https://meta.wikimedia.org/wiki/Research:Wikipedia_Edit_Types: summarizing how much text changed between two revisions of a Wikipedia article -- e.g., how many words/sentences changed (rule-based). Protoype developed by Research team.
- Vandalism detection: a number of different approaches in use on the
wikis generally have some form of a "bad word" list (generally a mix of auto/manually-generated), extract words from new edits and compare these words to the bad word list, and then use this to help judge how likely the edit is to be vandalism. Examples include many filters in AbuseFilter https://www.mediawiki.org/wiki/Extension:AbuseFilter, volunteer-led efforts such as ClueBot NG https://en.wikipedia.org/wiki/User:ClueBot_NG#Bayesian_Classifiers (English Wikipedia) and Salebot https://fr.wikipedia.org/wiki/Utilisateur:Salebot (French Wikipedia) as well as the Wikimedia Foundation ORES edit quality models https://www.mediawiki.org/wiki/ORES/BWDS_review (many wikis).
- Sockpuppet detection
https://www.mediawiki.org/wiki/User:Ladsgroup/masz: finding editors who have similar stylistic patterns in their comments (volunteer tool).
- Content Translation was mentioned -- there are numerous potential
translation models available https://www.mediawiki.org/wiki/Content_translation/Machine_Translation/MT_Clients#Machine_translation_clients, of which some are rule-based and some are ML. Tool maintained by Wikimedia Foundation Language team https://www.mediawiki.org/wiki/Wikimedia_Language_engineering but depends on several external APIs.
I've also done some thinking that might be of interest about what a natural language modeling strategy looks like for Wikimedia that balances effectiveness of models with equity/sustainability of supporting so many different language communities: https://meta.wikimedia.org/wiki/User:Isaac_(WMF)/Language_modeling
Hope that helps.
Best, Isaac
On Wed, Jun 22, 2022, 10:43 Trey Jones tjones@wikimedia.org wrote:
Do you have examples of projects using NLP in Wikimedia communities.
I do! Defining NLP is something of a moving target, and the most common definition, which I learned when I worked in industry, is that "NLP" has often been used as a buzzword that means "any language processing you do that your competitors don't". Getting away from profit-driven buzzwords, I have a pretty generous definition of NLP, as any software that improves language-based interactions between people and computers.
Guillaume mentioned CirrusSearch in general, but there are lots of specific parts within search. I work on a lot of NLP-type stuff for search, and I write a lot of documentation on Mediawiki, so this is biased towards stuff I have worked on or know about.
Language analysis is the general process of converting text (say, of Wikipedia articles) into tokens (approximately "words" in English) to be stored in the search index. There are lots of different levels of complexity in the language analysis. We currently use Elasticsearch, and they provide a lot of language-specific analysis tools (link to Elastic language analyzers https://www.elastic.co/guide/en/elasticsearch/reference/7.10/analysis-lang-analyzer.html), which we customize and build on.
Here is part of the config for English, reordered to be chronological, rather than alphabetical, and annotated:
"text": { "type": "custom", "char_filter": [ "word_break_helper", — break_up.words:with(uncommon)separators "kana_map" — map Japanese Hiragana to Katakana (notes https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Hiragana_to_Katakana_Mapping_for_English_and_Japanese ) ], "tokenizer": "standard" — break text into tokens/words; not trivial for English, very hard for other languages (blog post https://wikimediafoundation.org/news/2018/08/07/anatomy-search-token-affection/ ) "filter": [ "aggressive_splitting", —splitting of more likely *multi-part* *ComplexTokens* "homoglyph_norm", —correct typos/vandalization which mix Latin and Cyrillic letters (notes https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Homoglyphs) "possessive_english", —special processing for *English's* possessive forms "icu_normalizer", —normalization of text (blog post https://wikimediafoundation.org/news/2018/09/13/anatomy-search-variation-under-nature/ ) "stop", —removal of stop words (blog post https://wikimediafoundation.org/news/2018/11/28/anatomy-search-the-root-of-the-problem/, section "To be or not to be indexed") "icu_folding", —more aggressive normalization "remove_empty", —misc bookkeeping "kstem", —stemming (blog post https://wikimediafoundation.org/news/2018/11/28/anatomy-search-the-root-of-the-problem/ ) "custom_stem" —more stemming ], },
Tokenization, normalization, and stemming can vary wildly between languages. Some other elements (from Elasticsearch or custom-built by us):
- Stemmers and stop words for specific languages, including some
open-source ones that we ported, and some developed with community help.
- Elision processing (*l'homme* == *homme*)
- Normalization for digits (١ ٢ ٣ / १ २ ३ / ①②③ / 123)
- Custom lowercasing—Greek, Irish, and Turkish have special
processing (notes https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language-Specific_Lowercasing_and_ICU_Normalization )
- Normalization of written Khmer (blog post
https://techblog.wikimedia.org/2020/06/02/permuting-khmer-restructuring-khmer-syllables-for-search/ )
- Notes on lots more
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#Elasticsearch_Analysis_Chain_Analysis ...
We also did some work improving "Did you mean" suggestions, which currently uses both the built-in suggestions from Elasticsearch (not always great, but there are lots of them) and new suggestions from a module we called "Glent https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#Glent_%22Did_You_Mean%22_Suggestions" (much better, but not as many suggestions).
We have some custom language detection available on some Wikipedias, so that if you don't get very many results and your query looks like it is another language, we show results from that other language. Example, searching for Том Хэнкс on English Wikipedia https://en.wikipedia.org/w/index.php?search=%D0%A2%D0%BE%D0%BC+%D0%A5%D1%8D%D0%BD%D0%BA%D1%81&ns0=1 will show results from Russian Wikipedia. (too many notes https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#TextCat,_Language_ID,_Etc. )
Outside of our search work, there are lots more. Some that come to mind:
- Language Converter supports languages with multiple writing
systems, which is sometimes easy and sometimes really hard. (blog post https://diff.wikimedia.org/2018/03/12/supporting-languages-multiple-writing-systems/ )
- There's a Wikidata gadget on French Wikipedia and others that
appends results from Wikidata and generates descriptions in various languages based on the Wikidata information. For example, searching for Molenstraat Vught on French Wikipedia https://fr.wikipedia.org/w/index.php?search=Molenstraat+Vught&title=Sp%C3%A9cial:Recherche&profile=advanced&fulltext=1&ns0=1, gives no local results, but shows two "Results from Wikidata" / "Résultats sur Wikidata" (if you are logged in you get results in your preferred language, if possible, otherwise the language of the project): - Molenstraat ; hameau de la commune de Vught (in French, when I'm not logged in) - Molenstraat ; street in Vught, the Netherlands (fallback to English for some reason) - The whole giant Content Translation project that uses machine translation to assist translating articles across wikis. (blog post https://wikimediafoundation.org/news/2019/01/09/you-can-now-use-google-translate-to-translate-articles-on-wikipedia/ )
There's lots more out there, I'm sure—but I gotta run! —Trey
Trey Jones Staff Computational Linguist, Search Platform Wikimedia Foundation UTC–4 / EDT
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
There's also Wikispeech https://meta.wikimedia.org/wiki/Wikispeech a TTS tool that we (Wikimedia Sverige) are developing. It's currently on the back burner, but hopefully we will have more resources for development soon.
*Sebastian Berlin* Utvecklare/*Developer* Wikimedia Sverige (WMSE)
E-post/*E-Mail*: sebastian.berlin@wikimedia.se Telefon/*Phone*: (+46) 0707 - 92 03 84
On Fri, 24 Jun 2022 at 03:06, Gabriel Altay gabriel.altay@gmail.com wrote:
Hello Ilario,
You might find this blog post I wrote a while back interesting
https://blog.kensho.com/announcing-the-kensho-derived-wikimedia-dataset-5d11...
In it you can find a brief (and definitely not comprehensive) review of NLP with Wiki* along with links to an open source Kaggle dataset I built connecting the plain text of wikipedia, the anchor links between pages, and the links to wikidata. There are a few notebooks that demonstrate its use ... my favorite are probably,
- Pointwise Mutual Information embeddings
https://www.kaggle.com/code/kenshoresearch/kdwd-pmi-word-vectors
- Analyzing the "subclass of" graph from wikidata
https://www.kaggle.com/code/gabrielaltay/kdwd-subclass-path-ner
- Explicit topic modeling
https://www.kaggle.com/code/kenshoresearch/kdwd-explicit-topic-models
and if you are still looking for more after that, this is the query that gives more every time you use it :)
https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&ter...
best, -G
On Thu, Jun 23, 2022 at 4:17 PM Isaac Johnson isaac@wikimedia.org wrote:
Chiming in as a member of the Wikimedia Foundation Research team https://research.wikimedia.org/ (so you'll see that likely biases the examples I'm aware of). I'd say that the most common type of NLP that shows up in our applications is tokenization / language analysis -- i.e. split wikitext into words/sentences. As Trey said, this tokenization is non-trivial for English and gets much harder in other languages that have more complex constructions / don't use spaces to delimit words https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Spaceless_Writing_Systems_and_Wiki-Projects. These tokens often then become inputs into other types of models that aren't necessarily NLP. There are a number of more complex NLP technologies too that don't just identify words but try to identify similarities between them, translate them, etc.
Some examples below. Additionally, I indicated whether each application was rule-based (follow a series of deterministic heuristics) or ML (learned, probabilistic model) in case that's of interest:
- Copyedit
https://meta.wikimedia.org/wiki/Research:Copyediting_as_a_structured_task: identifying potential grammar/spelling issues in articles (rule-based). I believe there are a number of volunteer-run bots on the wikis as well as the under-development tool I linked to, which is a collaboration between the Wikimedia Foundation Research team https://research.wikimedia.org/ and Growth team https://www.mediawiki.org/wiki/Growth that builds on an open-source tool https://meta.wikimedia.org/wiki/Research:Copyediting_as_a_structured_task/LanguageTool .
- Link recommendation
https://www.mediawiki.org/wiki/Growth/Personalized_first_day/Structured_tasks/Add_a_link#Link_recommendation_algorithm: detecting links that could be added to Wikipedia articles. The NLP aspect mainly involves accurately parsing wikitext into sentences/words (rule-based) and comparing the similarity of the source article and pages that are potential target links (ML). Also collaboration between Research team and Growth team.
- Content similarity: various tools such as SuggestBot
https://en.wikipedia.org/wiki/User:SuggestBot, RelatedArticles Extension https://www.mediawiki.org/wiki/Extension:RelatedArticles, or GapFinder https://www.mediawiki.org/wiki/GapFinder use the morelike functionality of CirrusSearch https://www.mediawiki.org/wiki/Help:CirrusSearch#Page_weighting backend maintained by the Search team to find Wikipedia articles with similar topics -- this is largely finding keyword overlap between content with clever pre-processing/weighting as described by Trey.
- Readability
https://meta.wikimedia.org/wiki/Research:Multilingual_Readability_Research: score content based on its readability. Under development by Research team.
- Topic classification: predict what high-level topics are associated
with Wikipedia articles. The current model for English Wikipedia https://www.mediawiki.org/wiki/ORES#Topic_routing uses word embeddings from the article to make predictions (ML) and a proposed model https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Language_agnostic_link-based_article_topic_model_card from the Research team will use NLP models but with article links instead to support more (all) language editions.
- Citation needed https://meta.wikimedia.org/wiki/Citation_Detective:
detecting sentences in need of citations (ML). Protoype developed by Research team.
- Edit Types
https://meta.wikimedia.org/wiki/Research:Wikipedia_Edit_Types: summarizing how much text changed between two revisions of a Wikipedia article -- e.g., how many words/sentences changed (rule-based). Protoype developed by Research team.
- Vandalism detection: a number of different approaches in use on the
wikis generally have some form of a "bad word" list (generally a mix of auto/manually-generated), extract words from new edits and compare these words to the bad word list, and then use this to help judge how likely the edit is to be vandalism. Examples include many filters in AbuseFilter https://www.mediawiki.org/wiki/Extension:AbuseFilter, volunteer-led efforts such as ClueBot NG https://en.wikipedia.org/wiki/User:ClueBot_NG#Bayesian_Classifiers (English Wikipedia) and Salebot https://fr.wikipedia.org/wiki/Utilisateur:Salebot (French Wikipedia) as well as the Wikimedia Foundation ORES edit quality models https://www.mediawiki.org/wiki/ORES/BWDS_review (many wikis).
- Sockpuppet detection
https://www.mediawiki.org/wiki/User:Ladsgroup/masz: finding editors who have similar stylistic patterns in their comments (volunteer tool).
- Content Translation was mentioned -- there are numerous potential
translation models available https://www.mediawiki.org/wiki/Content_translation/Machine_Translation/MT_Clients#Machine_translation_clients, of which some are rule-based and some are ML. Tool maintained by Wikimedia Foundation Language team https://www.mediawiki.org/wiki/Wikimedia_Language_engineering but depends on several external APIs.
I've also done some thinking that might be of interest about what a natural language modeling strategy looks like for Wikimedia that balances effectiveness of models with equity/sustainability of supporting so many different language communities: https://meta.wikimedia.org/wiki/User:Isaac_(WMF)/Language_modeling
Hope that helps.
Best, Isaac
On Wed, Jun 22, 2022, 10:43 Trey Jones tjones@wikimedia.org wrote:
Do you have examples of projects using NLP in Wikimedia communities.
I do! Defining NLP is something of a moving target, and the most common definition, which I learned when I worked in industry, is that "NLP" has often been used as a buzzword that means "any language processing you do that your competitors don't". Getting away from profit-driven buzzwords, I have a pretty generous definition of NLP, as any software that improves language-based interactions between people and computers.
Guillaume mentioned CirrusSearch in general, but there are lots of specific parts within search. I work on a lot of NLP-type stuff for search, and I write a lot of documentation on Mediawiki, so this is biased towards stuff I have worked on or know about.
Language analysis is the general process of converting text (say, of Wikipedia articles) into tokens (approximately "words" in English) to be stored in the search index. There are lots of different levels of complexity in the language analysis. We currently use Elasticsearch, and they provide a lot of language-specific analysis tools (link to Elastic language analyzers https://www.elastic.co/guide/en/elasticsearch/reference/7.10/analysis-lang-analyzer.html), which we customize and build on.
Here is part of the config for English, reordered to be chronological, rather than alphabetical, and annotated:
"text": { "type": "custom", "char_filter": [ "word_break_helper", — break_up.words:with(uncommon)separators "kana_map" — map Japanese Hiragana to Katakana (notes https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Hiragana_to_Katakana_Mapping_for_English_and_Japanese ) ], "tokenizer": "standard" — break text into tokens/words; not trivial for English, very hard for other languages (blog post https://wikimediafoundation.org/news/2018/08/07/anatomy-search-token-affection/ ) "filter": [ "aggressive_splitting", —splitting of more likely *multi-part* *ComplexTokens* "homoglyph_norm", —correct typos/vandalization which mix Latin and Cyrillic letters (notes https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Homoglyphs) "possessive_english", —special processing for *English's* possessive forms "icu_normalizer", —normalization of text (blog post https://wikimediafoundation.org/news/2018/09/13/anatomy-search-variation-under-nature/ ) "stop", —removal of stop words (blog post https://wikimediafoundation.org/news/2018/11/28/anatomy-search-the-root-of-the-problem/, section "To be or not to be indexed") "icu_folding", —more aggressive normalization "remove_empty", —misc bookkeeping "kstem", —stemming (blog post https://wikimediafoundation.org/news/2018/11/28/anatomy-search-the-root-of-the-problem/ ) "custom_stem" —more stemming ], },
Tokenization, normalization, and stemming can vary wildly between languages. Some other elements (from Elasticsearch or custom-built by us):
- Stemmers and stop words for specific languages, including some
open-source ones that we ported, and some developed with community help.
- Elision processing (*l'homme* == *homme*)
- Normalization for digits (١ ٢ ٣ / १ २ ३ / ①②③ / 123)
- Custom lowercasing—Greek, Irish, and Turkish have special
processing (notes https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language-Specific_Lowercasing_and_ICU_Normalization )
- Normalization of written Khmer (blog post
https://techblog.wikimedia.org/2020/06/02/permuting-khmer-restructuring-khmer-syllables-for-search/ )
- Notes on lots more
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#Elasticsearch_Analysis_Chain_Analysis ...
We also did some work improving "Did you mean" suggestions, which currently uses both the built-in suggestions from Elasticsearch (not always great, but there are lots of them) and new suggestions from a module we called "Glent https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#Glent_%22Did_You_Mean%22_Suggestions" (much better, but not as many suggestions).
We have some custom language detection available on some Wikipedias, so that if you don't get very many results and your query looks like it is another language, we show results from that other language. Example, searching for Том Хэнкс on English Wikipedia https://en.wikipedia.org/w/index.php?search=%D0%A2%D0%BE%D0%BC+%D0%A5%D1%8D%D0%BD%D0%BA%D1%81&ns0=1 will show results from Russian Wikipedia. (too many notes https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#TextCat,_Language_ID,_Etc. )
Outside of our search work, there are lots more. Some that come to mind:
- Language Converter supports languages with multiple writing
systems, which is sometimes easy and sometimes really hard. (blog post https://diff.wikimedia.org/2018/03/12/supporting-languages-multiple-writing-systems/ )
- There's a Wikidata gadget on French Wikipedia and others that
appends results from Wikidata and generates descriptions in various languages based on the Wikidata information. For example, searching for Molenstraat Vught on French Wikipedia https://fr.wikipedia.org/w/index.php?search=Molenstraat+Vught&title=Sp%C3%A9cial:Recherche&profile=advanced&fulltext=1&ns0=1, gives no local results, but shows two "Results from Wikidata" / "Résultats sur Wikidata" (if you are logged in you get results in your preferred language, if possible, otherwise the language of the project): - Molenstraat ; hameau de la commune de Vught (in French, when I'm not logged in) - Molenstraat ; street in Vught, the Netherlands (fallback to English for some reason) - The whole giant Content Translation project that uses machine translation to assist translating articles across wikis. (blog post https://wikimediafoundation.org/news/2019/01/09/you-can-now-use-google-translate-to-translate-articles-on-wikipedia/ )
There's lots more out there, I'm sure—but I gotta run! —Trey
Trey Jones Staff Computational Linguist, Search Platform Wikimedia Foundation UTC–4 / EDT
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
wikitech-l@lists.wikimedia.org