Hello all,
there is a project in Russian Wikipedia to analyze all articles with scrict quality requirements (e.g. at least 500 symbols, or at least 1500 symbols for {{stub}}s, or at least 3 internal links and one external, or at least one section subheader etc.). The result should be: * Total number of "normal articles" * Lists of articles filtered by each requirement
To do so, I should either make a long query to iterate through all articles or running small queries like 'SELECT page_title, page_latest FROM page WHERE page_title > ? ORDER BY page_title LIMIT 1' (with substitution of previous page_title fetched).
The problem is that the first way is much more efficient but I'm not sure that someone will not kill this query.
From: "Edward Chernenko" Sent: Thursday, September 07, 2006 1:56 PM Subject: [Toolserver-l] About long query/queries
Hello all,
there is a project in Russian Wikipedia to analyze all articles with scrict quality requirements (e.g. at least 500 symbols, or at least 1500 symbols for {{stub}}s, or at least 3 internal links and one external, or at least one section subheader etc.). The result should be:
- Total number of "normal articles"
- Lists of articles filtered by each requirement
[......]
What about doing it locally with a dump? It seems much more efficient to me. Specially as the toolserver doesn't have direct access to article's text...
2006/9/7, Platonides platonides@gmail.com:
What about doing it locally with a dump? It seems much more efficient to me.
Good idea but I think that dump should be placed outside my account: 1. other users can use it for tasks which doesn't require making complex SQL queries; 2. I have 256 Mb disk quota while ruwiki dump is about 400 Mb.
2006/9/7, Gregory Maxwell gmaxwell@gmail.com
Are you talking about a query that will be run once or a query that will be executed from a cgi script.
No, that will be run manually (or using cron - one time per day).
select page_namespace, page_title from page; on ruwiki_p takes under a second... I wouldn't call that a long query.
Not all rows of result are fetched right after executing the query. Normal 'mysql' application receives all rows, prints it and exits. My application need (after getting one row of result) to:
1. make one more sql query: fetch page text SELECT old_text, old_flags FROM text WHERE old_id = (SELECT rev_text FROM revision WHERE rev_id = ? ) (where '?' is page_latest from first query) 2. uncompress text if there is 'gzip' in old_flags. 3. analyze text (that's fast, we can ignore this step).
As you can see, there is a small pause between fetching rows of result from first query. If this pause is only 0.05 seconds, the first query will be finished after ~ 83 minutes (for 100000 articles of ruwiki). For all this time this first query will be shown as being in progress (while not consuming real resources).
Text is not in the database on toolserver, thus you can't grab the text.
On 9/7/06, Edward Chernenko edwardspec@gmail.com wrote:
2006/9/7, Platonides platonides@gmail.com:
What about doing it locally with a dump? It seems much more efficient to me.
Good idea but I think that dump should be placed outside my account:
- other users can use it for tasks which doesn't require making
complex SQL queries; 2. I have 256 Mb disk quota while ruwiki dump is about 400 Mb.
2006/9/7, Gregory Maxwell gmaxwell@gmail.com
Are you talking about a query that will be run once or a query that will be executed from a cgi script.
No, that will be run manually (or using cron - one time per day).
select page_namespace, page_title from page; on ruwiki_p takes under a second... I wouldn't call that a long query.
Not all rows of result are fetched right after executing the query. Normal 'mysql' application receives all rows, prints it and exits. My application need (after getting one row of result) to:
- make one more sql query: fetch page text
SELECT old_text, old_flags FROM text WHERE old_id = (SELECT rev_text FROM revision WHERE rev_id = ? ) (where '?' is page_latest from first query) 2. uncompress text if there is 'gzip' in old_flags. 3. analyze text (that's fast, we can ignore this step).
As you can see, there is a small pause between fetching rows of result from first query. If this pause is only 0.05 seconds, the first query will be finished after ~ 83 minutes (for 100000 articles of ruwiki). For all this time this first query will be shown as being in progress (while not consuming real resources).
-- Edward Chernenko edwardspec@gmail.com _______________________________________________ Toolserver-l mailing list Toolserver-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/toolserver-l
From: "Edward Chernenko" edwardspec@gmail.com Sent: Thursday, September 07, 2006 6:28 PM Subject: Re: [Toolserver-l] About long query/queries
2006/9/7, Platonides
What about doing it locally with a dump? It seems much more efficient to me.
Good idea but I think that dump should be placed outside my account:
- other users can use it for tasks which doesn't require making
complex SQL queries; 2. I have 256 Mb disk quota while ruwiki dump is about 400 Mb.
Uh? I was thinking in doing it on your computer, not neccesarily on the toolserver. Thus you can control everything on it. About placing it, well, it's a public download :P
2006/9/7, Platonides platonides@gmail.com:
Uh? I was thinking in doing it on your computer, not neccesarily on the toolserver. Thus you can control everything on it. About placing it, well, it's a public download :P
If I only could spend money on 400 Mb traffic per day I'd never use Toolserver at all...
If i understood well your fist email, you need to check current version of all ruwiki pages not necessarily the 'last minute' one, as a once check. You can download the ruwiki dump (last dump is of a month ago, but surely a new one will done shortly), which is 87.2 MB (102.8 MB if you also want discussion and user pages).
So you download it, have your computer running for two centuries ;-) measuring it and, at last, upload the results. Even if you update it once a month, doesn't seem so exhaustive...
Are you connecting by 56k?
On 9/7/06, Edward Chernenko edwardspec@gmail.com wrote: [snip]
The problem is that the first way is much more efficient but I'm not sure that someone will not kill this query.
Are you talking about a query that will be run once or a query that will be executed from a cgi script.
select page_namespace, page_title from page; on ruwiki_p takes under a second... I wouldn't call that a long query.
toolserver-l@lists.wikimedia.org