Hi everyone!
I have a question concerning the relevance search on wikipedia articles, and Robert West from EPFL pointed me to this mailing list as the best chance to answer it. I have been checking the elasticsearch query performed by the wikipedia api when it runs a basic search on the articles. More precisely, I am talking of the following api call:
https://en.wikipedia.org/w/api.php?action=query&list=search&format=j...
The actual elasticsearch query is available with the cirrusDumpQuery parameter:
https://en.wikipedia.org/w/api.php?action=query&list=search&format=j...
There are many things going on in that query, but my question is related with the rescoring of the results that gives the final score. In particular, with the clause
{ "sltr": { "model": "enwiki-20220421-20180215-query_explorer", "params": { "query_string": "architecture mathematics" } } }
I understand that the results are passed together with the keywords to a stored machine learning model whose name is enwiki-20220421-20180215-query_explorer. This, as far as I understand, is done using the LTR plugin for elasticsearch (https://github.com/o19s/elasticsearch-learning-to-rank). My question is the following: Is this model openly available anywhere? If so, could you point me where? If not, do you know why is it not openly available and yet used by Wikipedia?
I posted this as part of a question on stackoverflow some days ago. Please check https://stackoverflow.com/questions/72213203/elasticsearch-query-for-wikiped... for more context and some more related questions.
I thank you all in advance, have a nice day!
Aitor Pérez Machine Learning Engineer EPFL Graph - CEDE - EPFL aitor.perez@epfl.ch
Hi!
These models have never been published, but not for any particular reason. I suppose no-one had ever asked about them. I copied the current models out of elasticsearch into https://people.wikimedia.org/~ebernhardson/cirrus_models.20220518/ if looking them over might help you. They are in the format the sltr plugin stores them, which seems useful as it includes both the feature definitions and the xgboost model in JSON.
Erik B.
Hi!
Oh that was very kind of you, thanks a lot. The format is indeed self-explanatory, it should not be a problem. However I was looking at the feature set and just to confirm: is each of these fields computed at query time with the provided query_string for each of the top 448 results after the first rescore? (in any case, this is what is suggested in the LTR plugin for elastic search docs).
If so, it would already be useful for me to have the actual mapping of a wikipedia page in elasticsearch (the definition of the fields “title” or “opening_text” are more or less evident, but not so much for “all_near_match” or “file_text.plain”). Is that available anywhere?
Thank you very much again and have a nice day!
Aitor
On 18 May 2022, at 23:17, ebernhardson@wikimedia.org wrote:
Hi!
These models have never been published, but not for any particular reason. I suppose no-one had ever asked about them. I copied the current models out of elasticsearch into https://people.wikimedia.org/~ebernhardson/cirrus_models.20220518/ if looking them over might help you. They are in the format the sltr plugin stores them, which seems useful as it includes both the feature definitions and the xgboost model in JSON.
Erik B. _______________________________________________ Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to wiki-research-l-leave@lists.wikimedia.org
wiki-research-l@lists.wikimedia.org