Hi all
Every now and again, i get ask to run this or that query for some people one some wikimedia project, to make a TODO list, or gather some statistics. Usually, the queries are easy, and it's a matter of a couple of minutes to run them. Bit it's tricky for people who need such a query run to find someone to do it.
So, I propose we run a "Query Service" - there's a quick writeup on http://wiki.ts.wikimedia.org/view/Query_service, basically the idea is to have a project for this in JIRA, and a couple of users handling tickets people file there. Simple enough.
So, have a look, and if you would like to help out and run a query every now and then, get yourself added to the query-service group.
Not sure how much we should advertise this right now - if no one knows about it, it's pointless, but if we get too many requests and too little help, it's not going to work. Perhaps it'll work this way: only advertise it if you are willing to handle requests :)
-- Daniel
How about a web-based form that can do SELECTs?
Sean
On 23/10/2007, Daniel Kinzler daniel@brightbyte.de wrote:
Hi all
Every now and again, i get ask to run this or that query for some people one some wikimedia project, to make a TODO list, or gather some statistics. Usually, the queries are easy, and it's a matter of a couple of minutes to run them. Bit it's tricky for people who need such a query run to find someone to do it.
So, I propose we run a "Query Service" - there's a quick writeup on http://wiki.ts.wikimedia.org/view/Query_service, basically the idea is to have a project for this in JIRA, and a couple of users handling tickets people file there. Simple enough.
So, have a look, and if you would like to help out and run a query every now and then, get yourself added to the query-service group.
Not sure how much we should advertise this right now - if no one knows about it, it's pointless, but if we get too many requests and too little help, it's not going to work. Perhaps it'll work this way: only advertise it if you are willing to handle requests :)
-- Daniel
Toolserver-l mailing list Toolserver-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/toolserver-l
On 10/23/07, Sean Whitton sean@silentflame.com wrote:
How about a web-based form that can do SELECTs?
Sean
SELECT * FROM commonswiki_p.image UNION SELECT * FROM enwiki_p.image
No. You don't want that.
Bryan
Sean Whitton wrote:
How about a web-based form that can do SELECTs?
NO!
Either it would only allow very simple queries, and would thus be pointless. Or it would frequently bog down the server, because people run insane queries.
Also, some legit queries may take so long that the http request would time out. And, last but not least: people filing such requests often don't know (enough) SQL.
You need some human judgment to see which requests to actually turn into queries, and how, and how often.
-- Daniel
Okay! Okay!
Not being much of a sysadmin, I never think of load :D
Sean
On 23/10/2007, Daniel Kinzler daniel@brightbyte.de wrote:
Sean Whitton wrote:
How about a web-based form that can do SELECTs?
NO!
Either it would only allow very simple queries, and would thus be pointless. Or it would frequently bog down the server, because people run insane queries.
Also, some legit queries may take so long that the http request would time out. And, last but not least: people filing such requests often don't know (enough) SQL.
You need some human judgment to see which requests to actually turn into queries, and how, and how often.
-- Daniel
Toolserver-l mailing list Toolserver-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/toolserver-l
On Tuesday 23 October 2007 22:48, Daniel Kinzler wrote:
Sean Whitton wrote:
How about a web-based form that can do SELECTs?
NO!
Either it would only allow very simple queries, and would thus be pointless. Or it would frequently bog down the server, because people run insane queries.
What if mysql.connect_timeout would be set to a low value, say 30 seconds? People could use it for moderately simple queries, and there are a lot of useful queries that can be executed that time, or for preparation of more complex queries (for example, prepare on simple a query for en).
What if mysql.connect_timeout would be set to a low value, say 30 seconds? People could use it for moderately simple queries, and there are a lot of useful queries that can be executed that time, or for preparation of more complex queries (for example, prepare on simple a query for en).
Killing the connection does not stop the query. Besides, from a security perspective, letting people run unchecked queries is kinda... stupid. If you want to automatically check the query - then you've pretty much arrived at api.php. Letting toolserver users check & run is the safest and simplest way to realize a query service.
--valhallasw
On 10/25/07, Merlijn van Deen valhallasw@arctus.nl wrote:
Killing the connection does not stop the query.
Killing the query when it hits 30 seconds could be easily arranged. That would not, however, stop people from running one complicated query in five hundred parts all at once, for instance, which would kill the server regardless.
In that case, you could do it with a queuing system. It wouldnt be hard to run something using a MySQL table and a per-IP max limit to only run several at once - you could run a different backend to process the queries and a PHP frontend to add/manage/view results of queries.
-Matt
Date: Thu, 25 Oct 2007 20:32:22 -0400> From: Simetrical+wikilist@gmail.com> To: toolserver-l@lists.wikimedia.org> Subject: Re: [Toolserver-l] Database Query Service> > On 10/25/07, Merlijn van Deen valhallasw@arctus.nl wrote:> > Killing the connection does not stop the query.> > Killing the query when it hits 30 seconds could be easily arranged.> That would not, however, stop people from running one complicated> query in five hundred parts all at once, for instance, which would> kill the server regardless.> > _______________________________________________> Toolserver-l mailing list> Toolserver-l@lists.wikimedia.org> http://lists.wikimedia.org/mailman/listinfo/toolserver-l
_________________________________________________________________ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
Just a point about implementation. As was already written, it's impossible to allow users to execute their own selects, but we need to implement new queries by their requests quick, without writing tonns of same code each time.
What we can do: create some kind of repository which can be modified by any of those who have accounts on Toolserver. It would contain list of queries with placeholders ("?") for parameters. For example, query_id | query_text 1 SELECT user_id FROM user_ids WHERE user_name=? 2 SELECT page_namespace FROM page WHERE page_title=? ORDER BY page_namespace (this list may be written into mysql table).
User can inflict query by accessing some script common for all queries, which would find query by it's number (query_id) and print HTML form on GET (with textfields instead of placeholders) or results on POST.
All toolserver programmers should be able add/modify/delete any query in another form and receive query_id to be sent to users. All modifications must be written to special log and it should be possible to revert'em, of course. Single login with other ts sites (like MediaWiki) for this script would be appreciated.
-- Ed
Or what if we only made the tool available only to those who know what they're doing? This way, we'll know that said people can go easy on the servers and if they screw up, they lose access.
On 10/26/07, Edward Chernenko edwardspec@gmail.com wrote:
Just a point about implementation. As was already written, it's impossible to allow users to execute their own selects, but we need to implement new queries by their requests quick, without writing tonns of same code each time.
What we can do: create some kind of repository which can be modified by any of those who have accounts on Toolserver. It would contain list of queries with placeholders ("?") for parameters. For example, query_id | query_text 1 SELECT user_id FROM user_ids WHERE user_name=? 2 SELECT page_namespace FROM page WHERE page_title=? ORDER BY page_namespace (this list may be written into mysql table).
User can inflict query by accessing some script common for all queries, which would find query by it's number (query_id) and print HTML form on GET (with textfields instead of placeholders) or results on POST.
All toolserver programmers should be able add/modify/delete any query in another form and receive query_id to be sent to users. All modifications must be written to special log and it should be possible to revert'em, of course. Single login with other ts sites (like MediaWiki) for this script would be appreciated.
-- Ed
Toolserver-l mailing list Toolserver-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/toolserver-l
On 10/26/07, James Hare messedrocker@gmail.com wrote:
Or what if we only made the tool available only to those who know what they're doing? This way, we'll know that said people can go easy on the servers and if they screw up, they lose access.
Given a test dabase with about 1000 articles (900 of them in the article namespace) and similar numbers in other fields, would it be possible to allow free testing on that test subject and then let people queue in for using an sql query service on the "real" dataset, possibly with a much lower priority that would save and display the results for later usage?
Mathias
We already do that, we call it the toolserver.
On 10/26/07, James Hare messedrocker@gmail.com wrote:
Or what if we only made the tool available only to those who know what they're doing? This way, we'll know that said people can go easy on the servers and if they screw up, they lose access.
On 10/26/07, Edward Chernenko edwardspec@gmail.com wrote:
Just a point about implementation. As was already written, it's impossible to allow users to execute their own selects, but we need to implement new queries by their requests quick, without writing tonns of same code each time.
What we can do: create some kind of repository which can be modified by any of those who have accounts on Toolserver. It would contain list of queries with placeholders ("?") for parameters. For example, query_id | query_text 1 SELECT user_id FROM user_ids WHERE user_name=? 2 SELECT page_namespace FROM page WHERE page_title=? ORDER BY page_namespace (this list may be written into mysql table).
User can inflict query by accessing some script common for all queries, which would find query by it's number (query_id) and print HTML form on GET (with textfields instead of placeholders) or results on POST.
All toolserver programmers should be able add/modify/delete any query in another form and receive query_id to be sent to users. All modifications must be written to special log and it should be possible to revert'em, of course. Single login with other ts sites (like MediaWiki) for this script would be appreciated.
-- Ed
Toolserver-l mailing list Toolserver-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/toolserver-l
Toolserver-l mailing list Toolserver-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/toolserver-l
So let's see... my idea is a duplicate of the toolserver, and we want a public query service, but it would prove too tempting for people to use it for a denial of service attack, not to mention the fact that it would bog down the server.
Does this *have* to be on the toolserver, or can we take the databases and put it on its own server dedicated for querying? Maybe we could have a sexy *multi-server* layout featuring a distributed computing scheme.
On 10/26/07, Dan Collins en.wp.st47@gmail.com wrote:
We already do that, we call it the toolserver.
On 10/26/07, James Hare messedrocker@gmail.com wrote:
Or what if we only made the tool available only to those who know what they're doing? This way, we'll know that said people can go easy on the servers and if they screw up, they lose access.
On 10/26/07, Edward Chernenko edwardspec@gmail.com wrote:
Just a point about implementation. As was already written, it's impossible to allow users to execute their own selects, but we need to implement new queries by their requests quick, without writing tonns of same code each time.
What we can do: create some kind of repository which can be modified by any of those who have accounts on Toolserver. It would contain list of queries with placeholders ("?") for parameters. For example, query_id | query_text 1 SELECT user_id FROM user_ids WHERE user_name=? 2 SELECT page_namespace FROM page WHERE page_title=? ORDER BY page_namespace (this list may be written into mysql table).
User can inflict query by accessing some script common for all queries, which would find query by it's number (query_id) and print HTML form on GET (with textfields instead of placeholders) or results on POST.
All toolserver programmers should be able add/modify/delete any query in another form and receive query_id to be sent to users. All modifications must be written to special log and it should be possible to revert'em, of course. Single login with other ts sites (like MediaWiki) for this script would be appreciated.
-- Ed
Toolserver-l mailing list Toolserver-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/toolserver-l
Toolserver-l mailing list Toolserver-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/toolserver-l
-- ST47 Administrator, en.wikipedia _______________________________________________ Toolserver-l mailing list Toolserver-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/toolserver-l
On 10/26/07, James Hare messedrocker@gmail.com wrote:
So let's see... my idea is a duplicate of the toolserver, and we want a public query service, but it would prove too tempting for people to use it for a denial of service attack, not to mention the fact that it would bog down the server.
Does this *have* to be on the toolserver, or can we take the databases and put it on its own server dedicated for querying? Maybe we could have a sexy *multi-server* layout featuring a distributed computing scheme.
We do have dedicated database servers. The toolserver is not one server.
Bryan
I'm not talking about a dedicated database server, but a dedicated *query* server where the only thing done on that server is handling queries, allowing maximum processor capacity to be dedicated to it.
On 10/26/07, Bryan Tong Minh bryan.tongminh@gmail.com wrote:
On 10/26/07, James Hare messedrocker@gmail.com wrote:
So let's see... my idea is a duplicate of the toolserver, and we want a public query service, but it would prove too tempting for people to use
it
for a denial of service attack, not to mention the fact that it would
bog
down the server.
Does this *have* to be on the toolserver, or can we take the databases
and
put it on its own server dedicated for querying? Maybe we could have a
sexy
*multi-server* layout featuring a distributed computing scheme.
We do have dedicated database servers. The toolserver is not one server.
Bryan
Toolserver-l mailing list Toolserver-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/toolserver-l
Again, that is exactly what our database servers do, they only run queries for toolserver users.
On 10/26/07, James Hare messedrocker@gmail.com wrote:
I'm not talking about a dedicated database server, but a dedicated *query* server where the only thing done on that server is handling queries, allowing maximum processor capacity to be dedicated to it.
On 10/26/07, Bryan Tong Minh bryan.tongminh@gmail.com wrote:
On 10/26/07, James Hare messedrocker@gmail.com wrote:
So let's see... my idea is a duplicate of the toolserver, and we want
a
public query service, but it would prove too tempting for people to
use it
for a denial of service attack, not to mention the fact that it would
bog
down the server.
Does this *have* to be on the toolserver, or can we take the databases
and
put it on its own server dedicated for querying? Maybe we could have a
sexy
*multi-server* layout featuring a distributed computing scheme.
We do have dedicated database servers. The toolserver is not one server.
Bryan
Toolserver-l mailing list Toolserver-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/toolserver-l
Toolserver-l mailing list Toolserver-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/toolserver-l
On 10/26/07, James Hare messedrocker@gmail.com wrote:
Does this *have* to be on the toolserver, or can we take the databases and put it on its own server dedicated for querying?
Of course you can. Then those dedicated servers get DoS'd, deliberately or otherwise, and you're back where you started.
My idea wasn't supposed to address the denial-of-service problem, it was supposed to solve the getting-bogged-down-in-requests problem.
Anyways, despite my ignorance, we have servers whose sole purpose is to process MySQL queries. Yipee. Leaving us with finding a feasible means of allowing this to be publicly accessible without it serving as the base of a denial of service attack. What if we forced queries to run at a really slow speed; in fact, people would have to put in their email addresses so they could be e-mailed the result? And once a limit has been reached for processor capability, the public query terminal would read something like "Sorry, but the query server is too busy now. Try again later."
On 10/26/07, Simetrical Simetrical+wikilist@gmail.com wrote:
On 10/26/07, James Hare messedrocker@gmail.com wrote:
Does this *have* to be on the toolserver, or can we take the databases
and
put it on its own server dedicated for querying?
Of course you can. Then those dedicated servers get DoS'd, deliberately or otherwise, and you're back where you started.
Toolserver-l mailing list Toolserver-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/toolserver-l
On 10/26/07, James Hare messedrocker@gmail.com wrote:
Anyways, despite my ignorance, we have servers whose sole purpose is to process MySQL queries. Yipee. Leaving us with finding a feasible means of
As far as I'm aware, the intensive part is done by the database server, not the server that passes through the query.
processor capability, the public query terminal would read something like
What I have been told is that disks are the bottleneck for a database server, but I don't know for sure.
The main problem I can think of is replication lag. Running very long queries will effect the replication lag. If we want to have an open access query server, I would say that we at least need two extra, separate database servers.
Or we just keep it as it is and have people just post their queries to JIRA. I think we are trying to solve a non-existent problem.
Bryan
And of course, I am not a server admin, so what I wrote above may have as well been entirely mistaken.
Or we just keep it as it is and have people just post their queries to JIRA. I think we are trying to solve a non-existent problem.
I think so too..when I needed to make a query I asked on the page at meta and then I got it very fast. So either watch the page on meta or use jira but I think people won't like jira if it requires a login to add an issue (assuming that ts users may accept requests from ips and the fact that many people are already registered on meta).
--user:alnokta
On 10/26/07, Mohamed Magdy mohamed.m.k@gmail.com wrote:
use jira but I think people won't like jira if it requires a login to add an issue (assuming that ts users may accept requests from ips and the fact that many people are already registered on meta).
We are providing service to them, so if they want something they should sign up.
Bryan
On 10/26/07, Edward Chernenko edwardspec@gmail.com wrote:
2 SELECT page_namespace FROM page WHERE page_title=? ORDER BY page_namespace (this list may be written into mysql table).
User can inflict query by accessing some script common for all queries, which would find query by it's number (query_id) and print HTML form on GET (with textfields instead of placeholders) or results on POST.
Permitting anonymous users to scan the page table seems like a pretty good DoS vector for whatever server is being sacrificed for this.
Maybe joining then sorting the rev tables of enwiki, dewiki, and frwiki. That would be a possibly useful query that someone might want to do, but we cant do it.
Can we limit the amount of CPU/processing a query could take? If not, that would be a very useful MySQL function. We could then run those queries but expect them to take a lot longer.-Matt
Date: Fri, 26 Oct 2007 13:56:49 -0400> From: Simetrical+wikilist@gmail.com> To: toolserver-l@lists.wikimedia.org> Subject: Re: [Toolserver-l] Database Query Service> > On 10/26/07, Edward Chernenko edwardspec@gmail.com wrote:> > 2 SELECT page_namespace FROM page WHERE page_title=? ORDER BY page_namespace> > (this list may be written into mysql table).> >> > User can inflict query by accessing some script common for all> > queries, which would find query by it's number (query_id) and print> > HTML form on GET (with textfields instead of placeholders) or results> > on POST.> > Permitting anonymous users to scan the page table seems like a pretty> good DoS vector for whatever server is being sacrificed for this.> > _______________________________________________> Toolserver-l mailing list> Toolserver-l@lists.wikimedia.org> http://lists.wikimedia.org/mailman/listinfo/toolserver-l
_________________________________________________________________ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
On 10/26/07, The Fearow fearow00@hotmail.com wrote:
Can we limit the amount of CPU/processing a query could take? If not, that would be a very useful MySQL function. We could then run those queries but expect them to take a lot longer.
It doesn't work that way, AFAIK. Using InnoDB, you would have to maintain a transaction for as long as the query runs. Many queries run simultaneously at low priority for a long time would therefore probably take a lot longer than if they were run quickly a few at a time, due to transactional overhead. InnoDB has to keep separate consistent copies for each query, more or less, which I suppose must involve copy-on-write -- needless to say, that slows down every write considerably if you have many transactions open. (In MyISAM, of course, it would be even worse: you would take out a read lock and prevent changes altogether.)
2007/10/26, Simetrical Simetrical+wikilist@gmail.com:
Permitting anonymous users to scan the page table seems like a pretty good DoS vector for whatever server is being sacrificed for this.
Okay, then we may set limit for max number of calls per hour for each query. For example, we can allow to run SELECT page_namespace FROM page WHERE page_title=? ORDER BY page_namespace about 100 times per hour, but SELECT page_title, page_namespace FROM page WHERE page_id=(SELECT rev_page FROM rev WHERE rev_summary = ?) only 10 times.
However, this solution has another side - someone may flood server with requests to our query script thus preventing normal users to access it.
About "anonymous users". We also can provide some kind of registration for trusted users, which would be unaffected by the limits I proposed above.
Why don't we all just watchlist [[m:Requests for queries]]? We can't do them automatically, but we can review them for sanity and push them through phpMyAdmin.
-Madman
Hi all
Every now and again, i get ask to run this or that query for some people one some wikimedia project, to make a TODO list, or gather some statistics. Usually, the queries are easy, and it's a matter of a couple of minutes to run them. Bit it's tricky for people who need such a query run to find someone to do it.
Madman wrote:
Why don't we all just watchlist [[m:Requests for queries]]? We can't do them automatically, but we can review them for sanity and push them through phpMyAdmin.
We can, but I for one like a ticket system better for this kind of stuff: it's clear what is pending and what is not, you get notified abotu new requests (and not about comments, if you don't want to), etc. Also, I for one already have way to much stuff on my watchlist on way too many wiki projects.
On the long run, I'd suggest to simply make [[m:Requests for queries]] a pointer to toolserver's query service. I would hope that requests on toolserver's jira would get processed more quickly, and get forgotten less often.
-- Daniel
toolserver-l@lists.wikimedia.org