Looking for utility to perform text search in dump

List overview All Threads
Download

newer

older

code review in FishEye

Happy SysAdminDay!

Danny B.

25 Jul 2009 25 Jul '09

12:21 p.m.

Hello, I'm looking for any kind of tool which would take the XML dump (most probably the pages-meta-current.xml.bz2, at least the pages-articles.xml.bz2) and would return the list of page titles (or alternatively/configurably page ids) of pages containing given string. Does anybody have such (kind of) tool and is willing to share? Both command line or webpage interface are OK. Thank you. Danny B.

Show replies by date

Simon Walker

25 Jul 25 Jul

2:46 p.m.

New subject: Looking for utility to perform text search in dump

Does AWB not do something along those lines? 2009/7/25 Danny B. <Wikipedia.Danny.B(a)email.cz>

...

-- Regards, Simon Walker User:Stwalkerster on all public Wikimedia Foundation wikis Administrator on the English Wikipedia Developer of Helpmebot, the ACC tool, and Nubio 2 FAQ repository

John Doe

2:50 p.m.

New subject: Looking for utility to perform text search in dump

awb does but wont work on the ts On Sat, Jul 25, 2009 at 8:46 AM, Simon Walker <stwalkerster(a)googlemail.com>wrote;wrote:

...

Does AWB not do something along those lines? 2009/7/25 Danny B. <Wikipedia.Danny.B(a)email.cz> Hello,

I'm looking for any kind of tool which would take the XML dump (most probably the pages-meta-current.xml.bz2, at least the pages-articles.xml.bz2) and would return the list of page titles (or alternatively/configurably page ids) of pages containing given string. Does anybody have such (kind of) tool and is willing to share? Both command line or webpage interface are OK. Thank you. Danny B. _______________________________________________ Toolserver-l mailing list Toolserver-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/toolserver-l

-- Regards, Simon Walker User:Stwalkerster on all public Wikimedia Foundation wikis Administrator on the English Wikipedia Developer of Helpmebot, the ACC tool, and Nubio 2 FAQ repository _______________________________________________ Toolserver-l mailing list Toolserver-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/toolserver-l

Osama KM

4:48 p.m.

New subject: Looking for utility to perform text search in dump

I have the same problem as Danny. My problem with AWB (which is free in itself) is that it depends on the non-free library .NET and it works on Windows only. A Python command-line tool will be perfect. 2009/7/25 John Doe <phoenixoverride(a)gmail.com>om>:

...

awb does but wont work on the ts On Sat, Jul 25, 2009 at 8:46 AM, Simon Walker <stwalkerster(a)googlemail.com> wrote:

Does AWB not do something along those lines? 2009/7/25 Danny B. <Wikipedia.Danny.B(a)email.cz>

_______________________________________________ Toolserver-l mailing list Toolserver-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/toolserver-l

-- OsamaKhalid

River Tarnell

5:13 p.m.

New subject: Looking for utility to perform text search in dump

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Osama KM:

...

My problem with AWB (which is free in itself) is that it depends on the non-free library .NET

.NET is not, itself, non-free. Microsoft's implementation (the most common one) is, but Mono (http://mono-project.com/Main_Page) is not. perhaps the AWB developers could make whatever changes are needed to run it on a free implementation. - river.

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (HP-UX) iEYEARECAAYFAkprISsACgkQIXd7fCuc5vISKACguq5EiedgGFoo8t4pHx5YRrTH Rt4AniL99S7SEay3ziYKlXKQO6lfuDpr =OE/3 -----END PGP SIGNATURE-----

Christian Thiele

5:27 p.m.

New subject: Looking for utility to perform text search in dump

Hi, Am 25.07.2009, 17:13 Uhr, schrieb River Tarnell <river(a)loreley.flyingparchment.org.uk>uk>:

...

Mono works great, I'm using bots using the DotNetWikiBot framework on the toolserver. For simple parsing of a pages-articles.xml file, you may test a script, I used some time ago - it is a very simple xml parser (for the pages-articles.xml structure) and calls a function called "test" with the article title and the text of the article. Its not the perfect solution but the solution implemented in five minutes ;) function test($title, $text) { // do something here } $filename = "enwiki-200XXXXX-pages-articles.xml"; $dataFile = fopen($filename, "r"); if ($dataFile) { $status = 0; while (!feof($dataFile)) { $buffer = fgets($dataFile, 4096); if (($status == 0) && (stripos($buffer, "<page>") !== false)) $status = 1; elseif (($status == 1) && (stripos($buffer, "<title>") !== false)) $title = strip_tags($buffer); elseif (($status == 1) && (stripos($buffer, "<revision>") !== false)) $status = 2; elseif (($status == 2) && (stripos($buffer, "<text ") !== false)) { $status = 3; $text = strip_tags($buffer); if (stripos($buffer, "</text>") !== false) { $status = 2; } } elseif (($status == 3) && (stripos($buffer, "</text>") === false)) $text .= strip_tags($buffer); elseif ($status == 3) { $text .= strip_tags($buffer); $status = 2; } elseif (($status == 2) && (stripos($buffer, "</revision>") !== false)) $status = 1; elseif (($status == 1) && (stripos($buffer, "</page>") !== false)) { test(trim($title), trim($text)); $title = ""; $text = ""; $status = 0; } } fclose($dataFile); } else { die("File not found: $filename"); }

Osama KM

6:20 p.m.

New subject: Looking for utility to perform text search in dump

2009/7/25 River Tarnell <river(a)loreley.flyingparchment.org.uk>uk>:

...

Well, C# and .NET in general are patented so they aren't free and they're bad choices.[1] AWB depends exactly on the non-free implementation of .NET, Microsoft's which makes it even worse. Any other program will be nice, but a Python script will be the best because I'll be able to adapt it by my own. [1]: <http://www.fsf.org/news/dont-depend-on-mono> -- OsamaKhalid

Christian Thiele

6:41 p.m.

New subject: Looking for utility to perform text search in dump

Hi, Am 25.07.2009, 18:20 Uhr, schrieb Osama KM <osamak.wfm(a)gmail.com>om>:

...

Well, C# and .NET in general are patented so they aren't free and they're bad choices.[1]. [1]: <http://www.fsf.org/news/dont-depend-on-mono>

This post is outdated in my opinion. Microsoft reacted on it and is now applying the Microsoft Community Promise on the ECMA 334 and ECMA 335 specs of .NET [2]. And this is not only a "promise" but legally binding. There will be a mono version only having the core parts [3]. So I don't see problems using mono on the toolserver. But of course I understand if you don't want to use it ;). [2] http://port25.technet.com/archive/2009/07/06/the-ecma-c-and-cli-standards.a… [3] http://tirania.org/blog/archive/2009/Jul-06.html Sincerely, Christian Thiele

Osama KM

8:05 p.m.

New subject: Looking for utility to perform text search in dump

OK. I don't want to bring my deep faith in GNU to this post. ;) But Microsoft has never been kind to us, free software users. There is a very nice article about this "promise" at <http://www.fsf.org/news/2009-07-mscp-mono>. 2009/7/25 Christian Thiele <apper(a)apper.de>de>:

...

Hi, Am 25.07.2009, 18:20 Uhr, schrieb Osama KM <osamak.wfm(a)gmail.com>om>:

Well, C# and .NET in general are patented so they aren't free and they're bad choices.[1]. [1]: <http://www.fsf.org/news/dont-depend-on-mono>

-- OsamaKhalid

Carl (CBM)

5:36 p.m.

New subject: Looking for utility to perform text search in dump

On Sat, Jul 25, 2009 at 6:21 AM, Danny B.<Wikipedia.Danny.B(a)email.cz> wrote:

...

I have had good luck in the past simply writing little C programs to use libexpat for this purpose. - Carl

Platonides

9:12 p.m.

Danny B. wrote:

...

I have one program to do so, dated 3 years ago. Many of us have probably rewritten that wheel. I'll send you privately. There's also http://meta.wikimedia.org/wiki/User:Micke/WikiFind Not the ultimate solution, but good enough. I have some changes to unify both versions if you're interested.

Ilmari Karonen

1 Aug 1 Aug

12:31 p.m.

Danny B. wrote:

...

If you're only interested in page titles, why not just download all-titles-in-ns0.gz and grep it? Alternatively, if you want titles in other namespaces too, I have a small perl script I once wrote that can extract such a list from the page.sql.gz dump -- I can clean it up and put it online somewhere if you're interested. -- Ilmari Karonen

5381

days inactive

5388

days old

toolserver-l@lists.wikimedia.org

Manage subscription

11 comments

9 participants

tags (0)

participants (9)

Carl (CBM)
Christian Thiele
Danny B.
Ilmari Karonen
John Doe
Osama KM
Platonides
River Tarnell
Simon Walker