[Pywikipedia-l] search output

List overview All Threads
Download

newer

older

[Pywikipedia-l] Pywikipedia on...

[Pywikipedia-l] rewrite brach...

Chris Watkins

24 Jul 2009 24 Jul '09

9:31 p.m.

Hi,

I want to list all pages in our wiki that use tables. This seems like it should be simple, but I'm not sure how to do it. Any ideas?

I know that "{|" (the beginning of a table) works as a search term, as I tried it with replace.py. However, I don't want to replace anything, and I don't want to sit there pressing "n" for each result.

Ideally I could capture just the names of the pages, without extended details (such as a proposed diff given by replace.py).

Any help much appreciated. Cheers

-- Chris Watkins Appropedia.org - Sharing knowledge to build rich, sustainable lives. identi.ca/appropedia / twitter.com/appropedia blogs.appropedia.org I like this: five.sentenc.es

Attachments:

attachment.htm (text/html — 1016 bytes)

Show replies by date

Merlijn van Deen

27 Jul 27 Jul

12:53 a.m.

Hi Chris,

Not sure if it's an effective way to search those pages, but what you need is a new script that does something like (not tested, but to give an idea)

import wikipedia, pagegenerators for page in pagegenerators.allpagesgenerator(): if '{|' in page.get(): print page.title

Please remember this is very slow and bandwidth-consuming. Running this on an XML dump seems like a better way. Would it be problematic to have slightly outdated information?

Best regards, Merlijn van Deen

On Fri, July 24, 2009 9:31 pm, Chris Watkins wrote:

...

Hi,

I want to list all pages in our wiki that use tables. This seems like it should be simple, but I'm not sure how to do it. Any ideas?

I know that "{|" (the beginning of a table) works as a search term, as I tried it with replace.py. However, I don't want to replace anything, and I don't want to sit there pressing "n" for each result.

Ideally I could capture just the names of the pages, without extended details (such as a proposed diff given by replace.py).

Any help much appreciated. Cheers -- Chris Watkins

Appropedia.org - Sharing knowledge to build rich, sustainable lives.

identi.ca/appropedia / twitter.com/appropedia blogs.appropedia.org

I like this: five.sentenc.es _______________________________________________ Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Chris Watkins

30 Jul 30 Jul

4:29 p.m.

Thanks Merlijn,

Not sure if it's an effective way to search those pages, but what you need

...

is a new script that does something like (not tested, but to give an idea)

import wikipedia, pagegenerators for page in pagegenerators.allpagesgenerator(): if '{|' in page.get(): print page.title

Please remember this is very slow and bandwidth-consuming. Running this on an XML dump seems like a better way. Would it be problematic to have slightly outdated information?

That's fine - actually, I'd definitely do it on a dump.

I'm not a coder, and not sure how I'd adapt this code to use with the xml dump. But I copied it into a file and call it "searchtest.py" to try it out, and got this error message:

$ python searchtest.py Checked for running processes. 1 processes currently running, including the current process. Traceback (most recent call last): File "searchtest.py", line 2, in <module> for page in pagegenerators.allpagesgenerator(): AttributeError: 'module' object has no attribute 'allpagesgenerator'

Any suggestions? Sorry I'm not doing much to help - I only ever learnt a little Fortran and C, 18+ years ago, and hacked a little Visual Basic a few years back.

Thanks again, Chris

...

Best regards, Merlijn van Deen

On Fri, July 24, 2009 9:31 pm, Chris Watkins wrote:

...
Hi,

I want to list all pages in our wiki that use tables. This seems like it should be simple, but I'm not sure how to do it. Any ideas?

I know that "{|" (the beginning of a table) works as a search term, as I tried it with replace.py. However, I don't want to replace anything, and

I

...
don't want to sit there pressing "n" for each result.

Ideally I could capture just the names of the pages, without extended details (such as a proposed diff given by replace.py).

Any help much appreciated. Cheers -- Chris Watkins

Appropedia.org - Sharing knowledge to build rich, sustainable lives.

identi.ca/appropedia / twitter.com/appropedia blogs.appropedia.org

I like this: five.sentenc.es _______________________________________________ Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

-- Chris Watkins Appropedia.org - Sharing knowledge to build rich, sustainable lives. identi.ca/appropedia / twitter.com/appropedia blogs.appropedia.org I like this: five.sentenc.es

Pietrodn

5:17 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Il giorno 24/lug/09, alle ore 21:31, Chris Watkins ha scritto:

...

Hi,

I want to list all pages in our wiki that use tables. This seems like it should be simple, but I'm not sure how to do it. Any ideas?

You can file a bug on https://jira.toolserver.org/browse/DBQ (Toolserver's Database Query Service): a user will query the replicated MySQL database for you. You can also download the XML dump from http://download.wikimedia.org and: * use the bot on it, or * use MWDumper to import it in your local MySQL and make a query.

Cheers,

"I'm Outlaw Pete, I'm Outlaw Pete, Can you hear me?" Pietrodn powerpdn@gmail.com

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) iEYEARECAAYFAkpxwxcACgkQ36hWg9C8ELEQwACfZ/YWWW1DQurkAHDQgxefqKZL EUYAnRz4AVfIqG96nTMXgqVMhO4hiF9k =mkz6 -----END PGP SIGNATURE-----

Matias

8:19 p.m.

I think that cannot be done in Toolserver with a query, because tables are part of page text and that is no available in Toolserver's db.

On Thu, Jul 30, 2009 at 12:17 PM, Pietrodn powerpdn@gmail.com wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Il giorno 24/lug/09, alle ore 21:31, Chris Watkins ha scritto:

...
Hi,

I want to list all pages in our wiki that use tables. This seems like it should be simple, but I'm not sure how to do it. Any ideas?

You can file a bug on https://jira.toolserver.org/browse/DBQ (Toolserver's Database Query Service): a user will query the replicated MySQL database for you. You can also download the XML dump from http://download.wikimedia.org and:

use the bot on it, or

use MWDumper to import it in your local MySQL and make a query.

Cheers,

"I'm Outlaw Pete, I'm Outlaw Pete, Can you hear me?" Pietrodn powerpdn@gmail.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin)

iEYEARECAAYFAkpxwxcACgkQ36hWg9C8ELEQwACfZ/YWWW1DQurkAHDQgxefqKZL EUYAnRz4AVfIqG96nTMXgqVMhO4hiF9k =mkz6 -----END PGP SIGNATURE-----

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Pietrodn

8:41 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Il giorno 30/lug/09, alle ore 20:19, Matias ha scritto:

...

On Thu, Jul 30, 2009 at 12:17 PM, Pietrodn powerpdn@gmail.com wrote:

...
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Il giorno 24/lug/09, alle ore 21:31, Chris Watkins ha scritto:

...
Hi,

I want to list all pages in our wiki that use tables. This seems like it should be simple, but I'm not sure how to do it. Any ideas?

You can file a bug on https://jira.toolserver.org/browse/DBQ (Toolserver's Database Query Service): a user will query the replicated MySQL database for you. You can also download the XML dump from http://download.wikimedia.org and:

use the bot on it, or

use MWDumper to import it in your local MySQL and make a query.

I think that cannot be done in Toolserver with a query, because tables are part of page text and that is no available in Toolserver's db.

You're right. Sorry for my error. So you've got only two alternatives (bot and MySQL), both involving downloading the XML dump. I think it's not a good idea to query the live site, because it would be too slow (there are throttle limits by the MediaWiki software itself) and it would take too many resources. Which wiki are you going to operate on?

"I'm Outlaw Pete, I'm Outlaw Pete, Can you hear me?" Pietrodn powerpdn@gmail.com

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) iEYEARECAAYFAkpx6XUACgkQ36hWg9C8ELGGrwCeMF88+OXtgPuv6sqjP0fSXPna A9EAoKhV7h9+qDEi/c81SlL1XFndnpoB =nQrB -----END PGP SIGNATURE-----

Stig Meireles Johansen

9:11 p.m.

I hacked a little on some old perl-code I had laying around which I once wrote for making fake apache-logs from a full xml-history dump (for analyzing with analog).. It ain't pretty, but works at least for me... I have only tried it on the full "pages-meta-history" files, but it should work on any similar XML-dump.. it only searches the last revision of a page. here it is: http://toolserver.no/~stigmj/tools/src/xml-search.pl.txt

/Stigmj

On Fri, Jul 24, 2009 at 4:31 PM, Chris Watkins <chriswaterguy@appropedia.org

...

wrote:

...

Hi,

I want to list all pages in our wiki that use tables. This seems like it should be simple, but I'm not sure how to do it. Any ideas?

I know that "{|" (the beginning of a table) works as a search term, as I tried it with replace.py. However, I don't want to replace anything, and I don't want to sit there pressing "n" for each result.

Ideally I could capture just the names of the pages, without extended details (such as a proposed diff given by replace.py).

Any help much appreciated. Cheers -- Chris Watkins

Appropedia.org - Sharing knowledge to build rich, sustainable lives.

identi.ca/appropedia / twitter.com/appropedia blogs.appropedia.org

I like this: five.sentenc.es

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Merlijn van Deen

11:30 p.m.

On Thu, July 30, 2009 9:11 pm, Stig Meireles Johansen wrote:

...

I hacked a little on some old perl-code I had laying around which I once wrote for making fake apache-logs from a full xml-history dump (for analyzing with analog).. It ain't pretty, but works at least for me... I have only tried it on the full "pages-meta-history" files, but it should work on any similar XML-dump.. it only searches the last revision of a page. here it is: http://toolserver.no/~stigmj/tools/src/xml-search.pl.txt

/Stigmj

Suggestion: pywikipediabot has good built-in support. My attempt at building a simple parser (http://arctus.nl/~valhallasw/pulldom.py) is about 10 times slower than just using four (much more readable) lines of code:

import xmlreader

for page in xmlreader.XmlDump('/home/valhallasw/download/nlwikiquote-20090730-pages-articles.xml').parse(): if '{|' in page.text: print page.title

I sometimes am surprised of pywikipediabot myself :)

-Merlijn

Nicolas Dumazet

2 Aug 2 Aug

6:16 a.m.

2009/7/31 Merlijn van Deen valhallasw@arctus.nl:

...

Suggestion: pywikipediabot has good built-in support. My attempt at building a simple parser (http://arctus.nl/~valhallasw/pulldom.py) is about 10 times slower than just using four (much more readable) lines of code:

import xmlreader

for page in xmlreader.XmlDump('/home/valhallasw/download/nlwikiquote-20090730-pages-articles.xml').parse(): if '{|' in page.text: print page.title

I sometimes am surprised of pywikipediabot myself :)

And... xmlreader is the only unit-tested part of pywikipediabot :)

-- Nicolas Dumazet — NicDumZ [ nɪk.d̪ymz ]

Stig Meireles Johansen

11:57 a.m.

On Thu, Jul 30, 2009 at 6:30 PM, Merlijn van Deen valhallasw@arctus.nlwrote:

...

On Thu, July 30, 2009 9:11 pm, Stig Meireles Johansen wrote:

...
I hacked a little on some old perl-code I had laying around which I once

[..snip..]

here it is: http://toolserver.no/~stigmj/tools/src/xml-search.pl.txt http://toolserver.no/%7Estigmj/tools/src/xml-search.pl.txt

...
/Stigmj

Suggestion: pywikipediabot has good built-in support. My attempt at building a simple parser (http://arctus.nl/~valhallasw/pulldom.py http://arctus.nl/%7Evalhallasw/pulldom.py) is about 10 times slower than just using four (much more readable) lines of code:

That may be, but when I tried your code on http://download.wikimedia.org/nowiki/20090729/nowiki-20090729-pages-articles... unpacking of course) I got this: Traceback (most recent call last): File "search.py", line 5, in <module> print page.title UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in position 1: ordinal not in range(128)

While my code ran like this: $ time ./xml-search.pl nowiki-20090729-pages-articles.xml "{|" 0 > t.t real 1m16.511s user 1m15.657s sys 0m0.856s $ grep ^Searched t.t Searched through 407565 articles and found 20889 matches

Give me some working code and I'll do a comparison.. :)

/Stig

Merlijn van Deen

1:19 p.m.

On Sun, August 2, 2009 11:57 am, Stig Meireles Johansen wrote:

...

...
is about 10 times slower than just using four (much more readable) lines of code:

(..snip..)

That may be, but when I tried your code on http://download.wikimedia.org/nowiki/20090729/nowiki-20090729-pages-articles... unpacking of course) I got this: Traceback (most recent call last): File "search.py", line 5, in <module> print page.title UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in position 1: ordinal not in range(128)

Yes, it breaks. To mimic the behaviour of your script (which blindly ignores the encoding and as such works), use page.title.encode('utf-8'), which should work fine.

Additionally, xmlreader actually supports reading bzip2-ed xml (which is probably faster than unzipping and running, and possibly even faster than running it on the plain xml, depending on processor speed and disk speed):

import xmlreader

for page in xmlreader.XmlDump('/home/valhallasw/download/nowiki-20090729-pages-articles.xml.bz2').parse(): if '{|' in page.text: print page.title.encode('utf-8')

valhallasw@elladan:~/pywikipedia/trunk/pywikipedia$ python stig.py > results valhallasw@elladan:~/pywikipedia/trunk/pywikipedia$ wc -l results 20890 results

(which includes one line 'Reading XML dump...', so that is the same result).

-Merlijn van Deen

Stig Meireles Johansen

9:06 p.m.

On Sun, Aug 2, 2009 at 8:19 AM, Merlijn van Deen valhallasw@arctus.nlwrote:

...

On Sun, August 2, 2009 11:57 am, Stig Meireles Johansen wrote:

...
...
is about 10 times slower than just using four (much more readable) lines of code:

...

Additionally, xmlreader actually supports reading bzip2-ed xml (which is probably faster than unzipping and running, and possibly even faster than running it on the plain xml, depending on processor speed and disk speed):

Just for the fun of it, here are some "benchmarks" running on the XML-file:

stigmj@brage:~/t$ time python ../pywikipedia/search.py > t.2

real 1m40.745s user 1m36.138s sys 0m1.472s stigmj@brage:~/t$ time ../bin/xml-search.pl nowiki-20090729-pages-articles.xml "{|" 0 > t.t

real 1m22.145s user 1m20.453s sys 0m1.204s stigmj@brage:~/t$ time python ../pywikipedia/search.py > t.2

real 1m38.219s user 1m35.490s sys 0m1.800s stigmj@brage:~/t$ time ../bin/xml-search.pl nowiki-20090729-pages-articles.xml "{|" 0 > t.t

real 1m24.474s user 1m22.897s sys 0m1.236s

*Running with Bzip2'ed xml-file.*

stigmj@brage:~/t$ time python ../pywikipedia/search.py > t.2

real 3m59.687s user 3m53.591s sys 0m0.640s stigmj@brage:~/t$ time bunzip2 -c nowiki-20090729-pages-articles.xml.bz2 | ../bin/xml-search.pl - "{|" 0 > t.t

real 2m42.841s user 4m8.804s sys 0m2.388s stigmj@brage:~/t$ time python ../pywikipedia/search.py > t.2

real 3m53.044s user 3m48.510s sys 0m0.620s stigmj@brage:~/t$ time bunzip2 -c nowiki-20090729-pages-articles.xml.bz2 | ../bin/xml-search.pl - "{|" 0 > t.t

real 2m49.320s user 4m10.772s sys 0m2.448s stigmj@brage:~/t$ time python ../pywikipedia/search.py > t.2

real 3m46.337s user 3m44.318s sys 0m0.644s stigmj@brage:~/t$ time bunzip2 -c nowiki-20090729-pages-articles.xml.bz2 | ../bin/xml-search.pl - "{|" 0 > t.t

real 2m54.216s user 4m11.724s sys 0m2.568s

When piping from bunzip2 I get to use both processors (Dual Xeon 3Ghz)... so it goes a little bit faster... :)

Well, this was a fun waste of time.. I believe the OP now has a solution either way.... ;)

/Stigmj

5622

Age (days ago)

5631

Last active (days ago)

pywikibot@lists.wikimedia.org

11 comments

6 participants

tags (0)

participants (6)

Chris Watkins
Matias
Merlijn van Deen
Nicolas Dumazet
Pietrodn
Stig Meireles Johansen