Hi,
I am using the following perl modules to extract data from Wikipedia and Wikitravel respectively -
- WWW::Wikipedia - MediaWiki::API
From both these APIs and also by looking at the MediaWiki APIs, I seem to
get the entire chunk of text in the Web Service response. To extract different sections of the Wiki entry, I have to rely on pattern matching and regular expressions.
Is there a better way to achieve this? Is there some sample code in any language (preferably, perl) which anyone can share, or is there some tool which does this out of the box?
Any help would be appreciated.
Regards, Ashish
MediaWiki::DumpFile has some facilities for this, although they were very basic the last time i checked.
Its developer is active and responsive to bug reports, enhancements and patches.
2012/3/3 Ashish Mukherjee ashish.mukherjee@gmail.com:
Hi,
I am using the following perl modules to extract data from Wikipedia and Wikitravel respectively -
- WWW::Wikipedia
- MediaWiki::API
From both these APIs and also by looking at the MediaWiki APIs, I seem to get the entire chunk of text in the Web Service response. To extract different sections of the Wiki entry, I have to rely on pattern matching and regular expressions.
Is there a better way to achieve this? Is there some sample code in any language (preferably, perl) which anyone can share, or is there some tool which does this out of the box?
Any help would be appreciated.
Regards, Ashish
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Thanks, Amir.
Do the dumps give very granular-level data for a Wiki entry?
- Ashish
On Sat, Mar 3, 2012 at 4:01 PM, Amir E. Aharoni < amir.aharoni@mail.huji.ac.il> wrote:
MediaWiki::DumpFile has some facilities for this, although they were very basic the last time i checked.
Its developer is active and responsive to bug reports, enhancements and patches.
2012/3/3 Ashish Mukherjee ashish.mukherjee@gmail.com:
Hi,
I am using the following perl modules to extract data from Wikipedia and Wikitravel respectively -
- WWW::Wikipedia
- MediaWiki::API
From both these APIs and also by looking at the MediaWiki APIs, I seem to get the entire chunk of text in the Web Service response. To extract different sections of the Wiki entry, I have to rely on pattern matching
and
regular expressions.
Is there a better way to achieve this? Is there some sample code in any language (preferably, perl) which anyone can share, or is there some tool which does this out of the box?
Any help would be appreciated.
Regards, Ashish
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
2012/3/3 Ashish Mukherjee ashish.mukherjee@gmail.com:
Thanks, Amir.
Do the dumps give very granular-level data for a Wiki entry?
The XML dumps give the complete text of every page, in the same wikitext format that you see when you edit it. It also has metadata, like title, authors, timestamp, namespace etc.
The MediaWiki::DumpFile module also provides some functions that allow you to analyze page info even if it doesn't necessarily come from a dump, but these functions are relatively basic. Just see the module's docs and check whether it has the particular thing that you need.
I used this module quite a lot; you can find the biggest thing i did with it here: http://perlwikibot.svn.sourceforge.net/viewvc/perlwikibot/trunk/no-interwiki...
I haven't maintained it in a long while, but it should still be functional and you are welcome to recycle the functions and the regular expressions there.
If there's any particular kind of data that you need, let me know - maybe i already have code that can extract it.
-- Amir
MaxSem on IRC gave a solution that may help you.
Using the following call, you can get section titles, numbers and offsets from the beginning of the page: https://en.wikipedia.org/w/api.php?action=parse&page=Pittsburgh&prop...
Using the following call, you can get a section's text by its number: https://en.wikipedia.org/w/api.php?action=parse&page=Pittsburgh&prop...
You can tweak your calls using the API sandbox: https://en.wikipedia.org/wiki/Special:ApiSandbox
-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com “We're living in pieces, I want to live in peace.” – T. Moore
2012/3/3 Ashish Mukherjee ashish.mukherjee@gmail.com:
Hi,
I am using the following perl modules to extract data from Wikipedia and Wikitravel respectively -
- WWW::Wikipedia
- MediaWiki::API
From both these APIs and also by looking at the MediaWiki APIs, I seem to get the entire chunk of text in the Web Service response. To extract different sections of the Wiki entry, I have to rely on pattern matching and regular expressions.
Is there a better way to achieve this? Is there some sample code in any language (preferably, perl) which anyone can share, or is there some tool which does this out of the box?
Any help would be appreciated.
Regards, Ashish
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
mediawiki-api@lists.wikimedia.org