Problem

List overview All Threads
Download

newer

older

disk usage

new toolserver home page

Stefan Kühn

11 Jun 2009 11 Jun '09

10:49 p.m.

Hello all,

I need help. I have a perl programme (for Check Wikipedia). This can scan a dump of a language very fast. 2200 pages per minute is no problem.

I will daily scan with the same script the page text of the live Wikipedia. Not all pages, but maybe 20000 per day per language. Normally this need only 10 minutes with the dump, but with the live Wikipedia this need many time. I use the Wikipedia-API to get the text of an article and so my script can only scan 120 pages per minute. So this scan need at the moment in enwiki 300 minutes or in dewiki 134 minutes. The most time my script is waiting. This is a problem because there is a high CPU usage.

I need a faster way to get the text from the live Wikipedia. So I can reduce the CPU usage.

Maybe someone know a faster way! Or have an other idea.

Thanks for every help, Stefan (sk)

More Info: http://de.wikipedia.org/wiki/Benutzer:Stefan_K%C3%BChn/Check_Wikipedia http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Check_Wikipedia http://toolserver.org/~sk/checkwiki/checkwiki.pl

Show replies by date

River Tarnell

11 Jun 11 Jun

10:50 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Stefan Kühn:

...

The most time my script is waiting. This is a problem because there is a high CPU usage.

that doesn't make sense. if it was waiting, it wouldn't be using any CPU. if it's using CPU, it's not waiting, it's doing something.

- river.

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (HP-UX) iEYEARECAAYFAkoxbiIACgkQIXd7fCuc5vL1bQCgihqAbqvlw+IJ+/HtAAyVR4fK n7kAoJA4ZyBi3dr6Qy1Y4yw1JIQF1vFo =TPyX -----END PGP SIGNATURE-----

Stefan Kühn

10:56 p.m.

Maybe a guru for perl can make a look at the script. I hope anyone have a idea.

Stefan

River Tarnell schrieb:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Stefan Kühn:

...
The most time my script is waiting. This is a problem because there is a high CPU usage.

that doesn't make sense. if it was waiting, it wouldn't be using any CPU. if it's using CPU, it's not waiting, it's doing something.

river.

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (HP-UX)

iEYEARECAAYFAkoxbiIACgkQIXd7fCuc5vL1bQCgihqAbqvlw+IJ+/HtAAyVR4fK n7kAoJA4ZyBi3dr6Qy1Y4yw1JIQF1vFo =TPyX -----END PGP SIGNATURE-----

Toolserver-l mailing list Toolserver-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/toolserver-l

-- Stefan Kühn Kuntschberg 6 / 01169 Dresden Privat: 0351-4107780 / Handy: 0163-2763261 Email: kuehn-s@gmx.net / Web: http://www.webkuehn.de

Platonides

12 Jun 12 Jun

12:04 a.m.

Stefan Kühn wrote:

...

Maybe a guru for perl can make a look at the script. I hope anyone have a idea.

Stefan

Then you should provide the script. Are you using non-blocking sockets? (you shouldn't)

Aryeh Gregor

2:40 a.m.

On Thu, Jun 11, 2009 at 6:04 PM, Platonidesplatonides@gmail.com wrote:

...

Then you should provide the script.

He did:

http://toolserver.org/~sk/checkwiki/checkwiki.pl

Darren Hardy

3:59 a.m.

You can try running "/usr/bin/time -v perl checkwiki.pl". This command will show you how much CPU time you are spending in system calls. It looks like your scriptie is using a simple http fetch to grab the article text. Your process should block and not use any CPU while it's waiting for the article:

sub raw_text ... my $ua2 = LWP::UserAgent->new; $response2 = $ua2->get( $url2 );

-Darren

On 11 Jun 2009, at 1:56 PM, Stefan Kühn wrote:

...

Maybe a guru for perl can make a look at the script. I hope anyone have a idea.

Stefan

River Tarnell schrieb:

...
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Stefan Kühn:

...
The most time my script is waiting. This is a problem because there is a high CPU usage.

that doesn't make sense. if it was waiting, it wouldn't be using any CPU. if it's using CPU, it's not waiting, it's doing something.

river.

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (HP-UX)

iEYEARECAAYFAkoxbiIACgkQIXd7fCuc5vL1bQCgihqAbqvlw+IJ+/HtAAyVR4fK n7kAoJA4ZyBi3dr6Qy1Y4yw1JIQF1vFo =TPyX -----END PGP SIGNATURE-----

Toolserver-l mailing list Toolserver-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/toolserver-l

-- Stefan Kühn Kuntschberg 6 / 01169 Dresden Privat: 0351-4107780 / Handy: 0163-2763261 Email: kuehn-s@gmx.net / Web: http://www.webkuehn.de

Toolserver-l mailing list Toolserver-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/toolserver-l

-- Darren Hardy Ph.D. Candidate Bren School of Environmental Science & Management University of California, Santa Barbara http://www.bren.ucsb.edu/~dhardy dhardy@bren.ucsb.edu

Daniel Kinzler

11 Jun 11 Jun

10:52 p.m.

Stefan Kühn schrieb:

...

The most time my script is waiting. This is a problem because there is a high CPU usage.

I need a faster way to get the text from the live Wikipedia. So I can reduce the CPU usage.

If waiting causes CPU usage, you are doing something VERY wrong. Waiting should mean NO CPU usage.

-- daniel

Pietrodn

13 Jun 13 Jun

1:53 p.m.

Il giorno 11/giu/09, alle ore 22:49, Stefan Kühn ha scritto:

...

Hello all,

I need a faster way to get the text from the live Wikipedia. So I can reduce the CPU usage.

The CPU usage shouldn't be high while waiting.

...

Maybe someone know a faster way! Or have an other idea.

Sure. Try WikiProxy: http://meta.wikimedia.org/wiki/User:Duesentrieb/WikiProxy

...

Thanks for every help, Stefan (sk)

"I'm Outlaw Pete, I'm Outlaw Pete, Can you hear me?" Pietrodn powerpdn@gmail.com

Kalan

2:08 p.m.

On Fri, Jun 12, 2009 at 04:49, Stefan Kühnkuehn-s@gmx.net wrote:

...

I need a faster way to get the text from the live Wikipedia. So I can reduce the CPU usage.

With MediaWiki API, you can retrieve 50 pages per request, it should make things much smoother.

(Choose &format= as appropriate)

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop...

— Kalan

Kalan

2:11 p.m.

On Sat, Jun 13, 2009 at 20:08, Kalankalan.001@gmail.com wrote:

...

With MediaWiki API, you can retrieve 50 pages per request, it should make things much smoother.

Oops, I forgot one important thing: URL length is not unlimited, but you should be safe under 4096 characters. Urlencode titles properly and append them till they fit.

— Kalan

Bryan Tong Minh

2:26 p.m.

On Sat, Jun 13, 2009 at 2:11 PM, Kalankalan.001@gmail.com wrote:

...

On Sat, Jun 13, 2009 at 20:08, Kalankalan.001@gmail.com wrote:

...
With MediaWiki API, you can retrieve 50 pages per request, it should make things much smoother.

Oops, I forgot one important thing: URL length is not unlimited, but you should be safe under 4096 characters. Urlencode titles properly and append them till they fit.

— Kalan

Or use a post request.

Bryan

Stefan Kühn

5:46 p.m.

...

With MediaWiki API, you can retrieve 50 pages per request, it should make things much smoother.

(Choose &format= as appropriate)

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop...

Oh, wonderful. I don't know this. I think this will increase the speed of my script. So the CPU usage will be only a short time. I will try this.

Many thanks to Kalan and all the other.

Stefan

Francesco Cosoleto

9:24 p.m.

Stefan Kühn ha scritto:

...

Oh, wonderful. I don't know this. I think this will increase the speed of my script. So the CPU usage will be only a short time. I will try this.

Many thanks to Kalan and all the other.

If your script uses dumps, I don't understand the reason you want using another way to dowload wiki articles. Just wastes bandwidth and uses the same cpu time.

If I have understood features your script provides, probably you should schedule it better (7 days?) and add a delay to process data more slowly.

But maybe I have misunderstood, so forget it.

-- Francesco Cosoleto "O cari, chi tra gli Argivi è il migliore, e chi è mediocre e chi vale pochissimo - poi che non son tutti uguali gli uomini in guerra - or venne lavoro per tutti. E lo capite da voi." (Omero)

5669

Age (days ago)

5671

Last active (days ago)

toolserver-l@lists.wikimedia.org

12 comments

10 participants

tags (0)

participants (10)

Aryeh Gregor
Bryan Tong Minh
Daniel Kinzler
Darren Hardy
Francesco Cosoleto
Kalan
Pietrodn
Platonides
River Tarnell
Stefan Kühn