Hello,
I am writing about limits of use and etiquette to comply with for consuming API for full-text search *server side*.
I am building a site for visualization and knowledge discovery of wikipedias.
It will be a personal funded project (at least initially!), for public use: investing more in indexing under Elastic Search would be beyond my possibilities and also beyond the scope of my project - focus is on visualization and discovery. And I also think there is no need to reinvent the wheel :)
I want to figure out a best setup for usability and rate requests for of full-text search API, complying with your policy.
Would you please take a minute to read below?
***
Currently my set up makes use of my own db: for full text search I use elastic search at a very basic level. I then use Wikipedia API for decoration of my data, *client-side (AJAX).*
Despite slower than what I have now, Wikipedia full-text api are much more useful for a user. It offer results on complex queries that I cannot provide, for I am indexing only articles' titles.
I would like to include full-text search against WikiMedia API from server side. I want to ensure that I can meet policy of wikimedia foundation, if I will make concurrent requests on behalf of users.
- *Are there any limit to the number of request I can do from a web domain?*
I would like to use wikitool python library. The query I need to run will use a *search generator *over article namespace only:
action=query&*generator=search*&gsrnamespace=0&gsrsearch='my query'& gsrlimit=20
I tested it from my laptop, and I found it quite slow; as example, it took:
~1.2 seconds for querying 'DNA'
~1.6 s for 'terroristi attacks'
~1.7s for 'biology technology'
and I am currently on a very fast wifi network.
- *How would it be possible to improve performance? * - *Is it possible to apply for a desired rate of requests?*
I also read it would be a good etiquette practice to specify in *headers* contacts, in case you need to communicate with the domain. It is not clear to me what I should do.
- *Could you please indicate how to do it with an example in python (here using flask framework)?*
Thank you very much for your help, Luigi
Le 22/12/2015 18:28, Luigi Assom a écrit :
I tested it from my laptop, and I found it quite slow; as example, it took:
~1.2 seconds for querying 'DNA'
~1.6 s for 'terroristi attacks'
~1.7s for 'biology technology'
For a single word query on english wikipedia this is more like 400ms for me, so I'm not sure to understand why you experienced such response times. Response times may vary depending on server load but I'm surprised you noticed more than 1 sec for simple queries like that. Did you check that you are receiving the result type/format you expect (i.e. format=json ) ? Could you re-check at different times of the day, servers may be busy around 8pm CET (time when both europe and america are active).
Your project seems to be very interesting, could you elaborate a bit more? Do you plan to use the api from a backend/automata which will need to send a lot of queries, do you have an estimation on your needs (number of queries and refresh rate)? If your process is like refreshing a set of queries regularly I'd suggest you build a daemon that send few queries (3 or 4) per minute rather than an aggressive batch with parallel processes run once a day/week/month. You should have a look at RCStream[1] which may be more appropriate to your needs (if you plan to track changes it's definitely better than refreshing the same set of queries regularly)
Thank you!
Hello David!
Your project seems to be very interesting, could you elaborate a bit more?
So much thank you! I will definitely be happy to elaborate more on it via a skype call: I could share the screen and show what I m boiling in the pot :D
Back to your reply now:
Yes, I was mainly testing during time both Europe and USA are connected. However, I am experiencing this type of delay from my laptop; maybe on deployment will speed up cause is my home network creepy?
I am concerned because I need to first fetch results from Wikipedia, then elaborate with my own data (that is fast enough <200ms) and then push it to the client. That is the reason of why I will put it server side and not client-side.
I need search generator only as *first entry point*: imagine you need to search for a topic, but you don't know exactly what. Imagine an input form, you type in some keywords, select one among results, and then you start your session.
I cannot estimate exactly the amount of FST query I need; let's say each user will need a search generator only once per session.
Maybe 30 user per seconds concurrent would be a good reference (it 's same number Parse of Facebook provide, Firebase up to 100... so maybe I could relay on similar order of magnitude...)
If I can provide people with a smooth user experience on search, that will be interesting because I could free resources up : I may extend a test of knowledge discovery to other languages, too. If the first user experience was too slow (~1.3s + bandwith transmission ~1.5+ per query) that could become critical.
I don't need search generator to operate in batch, or to track changes. It just serve the user to find a topic as entry point for discovery. I cannot use 'Opensearch' because it does not provide _IDs ; also, it searches against titles only.
Would it be possible to reserve somehow bandwith or requests for a domain?
On Wed, Dec 23, 2015 at 3:55 PM, David Causse dcausse@wikimedia.org wrote:
Le 22/12/2015 18:28, Luigi Assom a écrit :
I tested it from my laptop, and I found it quite slow; as example, it took:
~1.2 seconds for querying 'DNA'
~1.6 s for 'terroristi attacks'
~1.7s for 'biology technology'
For a single word query on english wikipedia this is more like 400ms for me, so I'm not sure to understand why you experienced such response times. Response times may vary depending on server load but I'm surprised you noticed more than 1 sec for simple queries like that. Did you check that you are receiving the result type/format you expect (i.e. format=json ) ? Could you re-check at different times of the day, servers may be busy around 8pm CET (time when both europe and america are active).
Your project seems to be very interesting, could you elaborate a bit more? Do you plan to use the api from a backend/automata which will need to send a lot of queries, do you have an estimation on your needs (number of queries and refresh rate)? If your process is like refreshing a set of queries regularly I'd suggest you build a daemon that send few queries (3 or 4) per minute rather than an aggressive batch with parallel processes run once a day/week/month. You should have a look at RCStream[1] which may be more appropriate to your needs (if you plan to track changes it's definitely better than refreshing the same set of queries regularly)
Thank you!
[1] https://wikitech.wikimedia.org/wiki/RCStream
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Le 23/12/2015 18:16, Luigi Assom a écrit :
I cannot estimate exactly the amount of FST query I need; let's say each user will need a search generator only once per session.
Maybe 30 user per seconds concurrent would be a good reference (it 's same number Parse of Facebook provide, Firebase up to 100... so maybe I could relay on similar order of magnitude...)
You should have a look at https://www.mediawiki.org/wiki/API:Etiquette
If I can provide people with a smooth user experience on search, that will be interesting because I could free resources up : I may extend a test of knowledge discovery to other languages, too. If the first user experience was too slow (~1.3s + bandwith transmission ~1.5+ per query) that could become critical.
As I said I think you can expect 400ms for this kind of API requests. You may want to double check that you enabled compression with Accept-Encoding and use connection keep alive. You can have a look at the search dashboard[1] which provides some metrics related to response times.
Would it be possible to reserve somehow bandwith or requests for a domain?
No I don't think so. But you are free to contribute to Mediawiki[2] or suggest a project that could be hosted by the wikitech labs infrastructure [3][4]
David.
[1] http://searchdata.wmflabs.org/metrics/ [2] https://www.mediawiki.org/wiki/How_to_contribute [3] https://wikitech.wikimedia.org/wiki/Help:FAQ [4] https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use
Hi David,
I read the etiquette that's why I posted additional questions.
Shall my server expect 400ms for each query of this type? How many concurrent request of this type could be done on behalf of users? (How many users could search per minute, let's say) Will such queries be dependent on the language platform? That is, if I hit en.wikipedia.api and fr.wikipedia.api from same server will it count within same wikimedia "limit" or not? I need to understand if I could handle a multilanguage test (two/three languages), always in terms of UX rather than technical.
What do you mean by "enabled compression with Accept-Encoding and use connection keep alive" ? I will make requests with wikitools a python library listed in mediawiki - where do I find such params?
# off topics question below :D Thank you for the link about the Labs. I m living in Europe and would love to move to san francisco - that is, gain a working visa :) would be such a project grant it ? - I've read you are from France and you may have some tips here, that's why I pose this question too :) and it's vacation time - should be :)
thank you and marry vacations :DD
On Thu, Dec 24, 2015 at 5:13 PM, David Causse dcausse@wikimedia.org wrote:
Le 23/12/2015 18:16, Luigi Assom a écrit :
I cannot estimate exactly the amount of FST query I need; let's say each user will need a search generator only once per session.
Maybe 30 user per seconds concurrent would be a good reference (it 's same number Parse of Facebook provide, Firebase up to 100... so maybe I could relay on similar order of magnitude...)
You should have a look at https://www.mediawiki.org/wiki/API:Etiquette
If I can provide people with a smooth user experience on search, that will be interesting because I could free resources up : I may extend a test of knowledge discovery to other languages, too. If the first user experience was too slow (~1.3s + bandwith transmission ~1.5+ per query) that could become critical.
As I said I think you can expect 400ms for this kind of API requests. You may want to double check that you enabled compression with Accept-Encoding and use connection keep alive. You can have a look at the search dashboard[1] which provides some metrics related to response times.
Would it be possible to reserve somehow bandwith or requests for a domain?
No I don't think so. But you are free to contribute to Mediawiki[2] or suggest a project that could be hosted by the wikitech labs infrastructure [3][4]
David.
[1] http://searchdata.wmflabs.org/metrics/ [2] https://www.mediawiki.org/wiki/How_to_contribute [3] https://wikitech.wikimedia.org/wiki/Help:FAQ [4] https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery