New subject: Etiquette: Comsumption of API wikipedia from backend (full-text search)

22 Dec 2015

Hello,

I am writing about limits of use and etiquette to comply with for consuming
API for full-text search *server side*.

I am building a site for visualization and knowledge discovery of
wikipedias.

It will be a personal funded project (at least initially!),  for public
use: investing more in indexing under Elastic Search would be beyond my
possibilities and also beyond the scope of my project - focus is on
visualization and discovery. And I also think there is no need to reinvent
the wheel :)

I want to figure out a best setup for usability and rate requests for of
full-text search API, complying with your policy.

Would you please take a minute to read below?

***

Currently my set up makes use of my own db: for full text search I use
elastic search at a very basic level.
I then use Wikipedia API for decoration of my data, *client-side (AJAX).*

Despite slower than what I have now, Wikipedia full-text api are much more
useful for a user.
It offer results on complex queries that I cannot provide, for I am
indexing only articles' titles.

I would like to include full-text search against WikiMedia API from server
side.
I want to ensure that I can meet policy of wikimedia foundation, if I will
make concurrent requests on behalf of users.

   - *Are there any limit to the number of request I can do from a web
   domain?*

I would like to use wikitool python library.
The query I need to run will use a *search generator *over article
namespace only:

action=query&*generator=search*&gsrnamespace=0&gsrsearch='my
query'&
gsrlimit=20

I tested it from my laptop, and I found it quite slow; as example, it took:

~1.2 seconds for querying 'DNA'

~1.6 s for 'terroristi attacks'

~1.7s for 'biology technology'

and I am currently on a very fast wifi network.

   -
*How would it be possible to improve performance? *
   - *Is it possible to apply for a desired rate of requests?*

I also read it would be a good etiquette practice to specify in *headers*
contacts, in case you need to communicate with the domain. It is not clear
to me what I should do.

   - *Could you please indicate how to do it with an example in python
   (here using flask framework)?*

Thank you very much for your help,
Luigi