Hi!
I have been trying to gage the speed/efficiency of a database I have setup. In order to test it, I have filled it with a lot of wikipedia articles from a specific category (for example history). The database does multi-word queries and returns the articles that best match the multiword query. For example if I search up "history in Italy in the past 100 years" then the best matching articles should pop up.
I was wondering if anyone has any advice how to form sample test queries to model realistic situations/queries. I don't think it would be fair to do random phrases (such as "banana the string") and wanted to model queries based on my data to test performance and correctness of output. Does anyone have any advice? How or Is this done at wikipedia?
I have looked here ( http://blog.wikimedia.org/2012/09/19/what-are-readers-looking-for-wikipedia-...) but the data has been down for a while.
Cheers,
CCing the Search and Discovery list.
On Sun, May 8, 2016 at 12:24 PM, Stan Zonov stanzon@gmail.com wrote:
Hi!
I have been trying to gage the speed/efficiency of a database I have setup. In order to test it, I have filled it with a lot of wikipedia articles from a specific category (for example history). The database does multi-word queries and returns the articles that best match the multiword query. For example if I search up "history in Italy in the past 100 years" then the best matching articles should pop up.
I was wondering if anyone has any advice how to form sample test queries to model realistic situations/queries. I don't think it would be fair to do random phrases (such as "banana the string") and wanted to model queries based on my data to test performance and correctness of output. Does anyone have any advice? How or Is this done at wikipedia?
I have looked here (http://blog.wikimedia.org/2012/09/19/what-are-readers-looking-for-wikipedia-...) but the data has been down for a while.
Cheers,
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi,
Unfortunately search queries may contain private sensitive information that cannot be disclosed automatically[1]. However you can have a look at some notes from Trey[2] and more specially this one[3] Measuring an IR system efficiency is a tough task and does not only require access to query logs. We are currently building a set of tools to help us in offline evaluation of the system[4]. While it may be difficult for you to run them on your own system it can give you a rough idea of how we are trying to address this problem.
One of these tools has not yet been announced and will probably address your needs.
[1] https://meta.wikimedia.org/wiki/Discovery/Data_access_guidelines [2] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes [3] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Why_People_Use_Search... [4] https://github.com/wikimedia/wikimedia-discovery-relevanceForge
Le 09/05/2016 10:15, Tilman Bayer a écrit :
CCing the Search and Discovery list.
On Sun, May 8, 2016 at 12:24 PM, Stan Zonov stanzon@gmail.com wrote:
Hi!
I have been trying to gage the speed/efficiency of a database I have setup. In order to test it, I have filled it with a lot of wikipedia articles from a specific category (for example history). The database does multi-word queries and returns the articles that best match the multiword query. For example if I search up "history in Italy in the past 100 years" then the best matching articles should pop up.
I was wondering if anyone has any advice how to form sample test queries to model realistic situations/queries. I don't think it would be fair to do random phrases (such as "banana the string") and wanted to model queries based on my data to test performance and correctness of output. Does anyone have any advice? How or Is this done at wikipedia?
I have looked here (http://blog.wikimedia.org/2012/09/19/what-are-readers-looking-for-wikipedia-...) but the data has been down for a while.
Cheers,
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics