Hello,
I'm looking for any kind of tool which would take the XML dump (most probably the pages-meta-current.xml.bz2, at least the pages-articles.xml.bz2) and would return the list of page titles (or alternatively/configurably page ids) of pages containing given string.
Does anybody have such (kind of) tool and is willing to share? Both command line or webpage interface are OK.
Thank you.
Danny B.
Does AWB not do something along those lines?
2009/7/25 Danny B. Wikipedia.Danny.B@email.cz
Hello,
I'm looking for any kind of tool which would take the XML dump (most probably the pages-meta-current.xml.bz2, at least the pages-articles.xml.bz2) and would return the list of page titles (or alternatively/configurably page ids) of pages containing given string.
Does anybody have such (kind of) tool and is willing to share? Both command line or webpage interface are OK.
Thank you.
Danny B.
Toolserver-l mailing list Toolserver-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/toolserver-l
awb does but wont work on the ts
On Sat, Jul 25, 2009 at 8:46 AM, Simon Walker stwalkerster@googlemail.comwrote:
Does AWB not do something along those lines?
2009/7/25 Danny B. Wikipedia.Danny.B@email.cz
Hello,
I'm looking for any kind of tool which would take the XML dump (most probably the pages-meta-current.xml.bz2, at least the pages-articles.xml.bz2) and would return the list of page titles (or alternatively/configurably page ids) of pages containing given string.
Does anybody have such (kind of) tool and is willing to share? Both command line or webpage interface are OK.
Thank you.
Danny B.
Toolserver-l mailing list Toolserver-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/toolserver-l
-- Regards,
Simon Walker User:Stwalkerster on all public Wikimedia Foundation wikis Administrator on the English Wikipedia Developer of Helpmebot, the ACC tool, and Nubio 2 FAQ repository
Toolserver-l mailing list Toolserver-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/toolserver-l
I have the same problem as Danny. My problem with AWB (which is free in itself) is that it depends on the non-free library .NET and it works on Windows only. A Python command-line tool will be perfect.
2009/7/25 John Doe phoenixoverride@gmail.com:
awb does but wont work on the ts
On Sat, Jul 25, 2009 at 8:46 AM, Simon Walker stwalkerster@googlemail.com wrote:
Does AWB not do something along those lines?
2009/7/25 Danny B. Wikipedia.Danny.B@email.cz
Hello,
I'm looking for any kind of tool which would take the XML dump (most probably the pages-meta-current.xml.bz2, at least the pages-articles.xml.bz2) and would return the list of page titles (or alternatively/configurably page ids) of pages containing given string.
Does anybody have such (kind of) tool and is willing to share? Both command line or webpage interface are OK.
Thank you.
Danny B.
Toolserver-l mailing list Toolserver-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/toolserver-l
-- Regards,
Simon Walker User:Stwalkerster on all public Wikimedia Foundation wikis Administrator on the English Wikipedia Developer of Helpmebot, the ACC tool, and Nubio 2 FAQ repository
Toolserver-l mailing list Toolserver-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Toolserver-l mailing list Toolserver-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/toolserver-l
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Osama KM:
My problem with AWB (which is free in itself) is that it depends on the non-free library .NET
.NET is not, itself, non-free. Microsoft's implementation (the most common one) is, but Mono (http://mono-project.com/Main_Page) is not. perhaps the AWB developers could make whatever changes are needed to run it on a free implementation.
- river.
Hi,
Am 25.07.2009, 17:13 Uhr, schrieb River Tarnell river@loreley.flyingparchment.org.uk:
.NET is not, itself, non-free. Microsoft's implementation (the most common one) is, but Mono (http://mono-project.com/Main_Page) is not. perhaps the AWB developers could make whatever changes are needed to run it on a free implementation.
Mono works great, I'm using bots using the DotNetWikiBot framework on the toolserver.
For simple parsing of a pages-articles.xml file, you may test a script, I used some time ago - it is a very simple xml parser (for the pages-articles.xml structure) and calls a function called "test" with the article title and the text of the article. Its not the perfect solution but the solution implemented in five minutes ;)
function test($title, $text) { // do something here }
$filename = "enwiki-200XXXXX-pages-articles.xml"; $dataFile = fopen($filename, "r"); if ($dataFile) { $status = 0; while (!feof($dataFile)) { $buffer = fgets($dataFile, 4096); if (($status == 0) && (stripos($buffer, "<page>") !== false)) $status = 1; elseif (($status == 1) && (stripos($buffer, "<title>") !== false)) $title = strip_tags($buffer); elseif (($status == 1) && (stripos($buffer, "<revision>") !== false)) $status = 2; elseif (($status == 2) && (stripos($buffer, "<text ") !== false)) { $status = 3; $text = strip_tags($buffer); if (stripos($buffer, "</text>") !== false) { $status = 2; } } elseif (($status == 3) && (stripos($buffer, "</text>") === false)) $text .= strip_tags($buffer); elseif ($status == 3) { $text .= strip_tags($buffer); $status = 2; } elseif (($status == 2) && (stripos($buffer, "</revision>") !== false)) $status = 1; elseif (($status == 1) && (stripos($buffer, "</page>") !== false)) { test(trim($title), trim($text)); $title = ""; $text = ""; $status = 0; } } fclose($dataFile); } else { die("File not found: $filename"); }
On Sat, Jul 25, 2009 at 6:21 AM, Danny B.Wikipedia.Danny.B@email.cz wrote:
I'm looking for any kind of tool which would take the XML dump (most probably the pages-meta-current.xml.bz2, at least the pages-articles.xml.bz2) and would return the list of page titles (or alternatively/configurably page ids) of pages containing given string.
I have had good luck in the past simply writing little C programs to use libexpat for this purpose.
- Carl
2009/7/25 River Tarnell river@loreley.flyingparchment.org.uk:
.NET is not, itself, non-free. Microsoft's implementation (the most common one) is, but Mono (http://mono-project.com/Main_Page) is not. perhaps the AWB developers could make whatever changes are needed to run it on a free implementation.
Well, C# and .NET in general are patented so they aren't free and they're bad choices.[1] AWB depends exactly on the non-free implementation of .NET, Microsoft's which makes it even worse. Any other program will be nice, but a Python script will be the best because I'll be able to adapt it by my own.
[1]: http://www.fsf.org/news/dont-depend-on-mono
Hi,
Am 25.07.2009, 18:20 Uhr, schrieb Osama KM osamak.wfm@gmail.com:
Well, C# and .NET in general are patented so they aren't free and they're bad choices.[1].
This post is outdated in my opinion. Microsoft reacted on it and is now applying the Microsoft Community Promise on the ECMA 334 and ECMA 335 specs of .NET [2]. And this is not only a "promise" but legally binding. There will be a mono version only having the core parts [3]. So I don't see problems using mono on the toolserver. But of course I understand if you don't want to use it ;).
[2] http://port25.technet.com/archive/2009/07/06/the-ecma-c-and-cli-standards.as... [3] http://tirania.org/blog/archive/2009/Jul-06.html
Sincerely, Christian Thiele
OK. I don't want to bring my deep faith in GNU to this post. ;) But Microsoft has never been kind to us, free software users. There is a very nice article about this "promise" at http://www.fsf.org/news/2009-07-mscp-mono.
2009/7/25 Christian Thiele apper@apper.de:
Hi,
Am 25.07.2009, 18:20 Uhr, schrieb Osama KM osamak.wfm@gmail.com:
Well, C# and .NET in general are patented so they aren't free and they're bad choices.[1].
This post is outdated in my opinion. Microsoft reacted on it and is now applying the Microsoft Community Promise on the ECMA 334 and ECMA 335 specs of .NET [2]. And this is not only a "promise" but legally binding. There will be a mono version only having the core parts [3]. So I don't see problems using mono on the toolserver. But of course I understand if you don't want to use it ;).
[2] http://port25.technet.com/archive/2009/07/06/the-ecma-c-and-cli-standards.as... [3] http://tirania.org/blog/archive/2009/Jul-06.html
Sincerely, Christian Thiele
Toolserver-l mailing list Toolserver-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Danny B. wrote:
Hello,
I'm looking for any kind of tool which would take the XML dump (most probably the pages-meta-current.xml.bz2, at least the pages-articles.xml.bz2) and would return the list of page titles (or alternatively/configurably page ids) of pages containing given string.
Does anybody have such (kind of) tool and is willing to share? Both command line or webpage interface are OK.
Thank you.
Danny B.
I have one program to do so, dated 3 years ago. Many of us have probably rewritten that wheel. I'll send you privately. There's also http://meta.wikimedia.org/wiki/User:Micke/WikiFind Not the ultimate solution, but good enough. I have some changes to unify both versions if you're interested.
Danny B. wrote:
I'm looking for any kind of tool which would take the XML dump (most probably the pages-meta-current.xml.bz2, at least the pages-articles.xml.bz2) and would return the list of page titles (or alternatively/configurably page ids) of pages containing given string.
Does anybody have such (kind of) tool and is willing to share? Both command line or webpage interface are OK.
If you're only interested in page titles, why not just download all-titles-in-ns0.gz and grep it?
Alternatively, if you want titles in other namespaces too, I have a small perl script I once wrote that can extract such a list from the page.sql.gz dump -- I can clean it up and put it online somewhere if you're interested.
toolserver-l@lists.wikimedia.org