Hi folks,
Any special:export experts out there?
I'm trying to download the complete revision history for just a few pages. The options, as I see it, are using the API or special:export. The API returns XML that is formatted differently than special:export and I already have a set of parsers that work with special:export data so I'm inclined to go with that.
I am running into the problem that, it seems when I try to use POST so that I can iteratively grab revisions in increments of 1000, I am denied (I get a WMF servers down error). If I use GET, it works, but then I can't use the parameters that allow me to iterate through all the revisions.
Code pasted below. Any suggestions as to why the server won't accept POST?
Better yet, does anyone already have a working script/tool handy that grabs all the revisions of a page? :)
Thanks, all! (Excuse the cross posting, I usually hang out on research, but thought perhaps folks on the developers list would have insight.) Andrea
class Wikipedia { public function __construct(){ } public function searchResults( $pageTitle = null, $initialRevision = null ) { $url = "http://en.wikipedia.org/w/index.php?title=Special:Export&pages=" . $pageTitle . "&offset=1&limit=1000&action=submit"; $curl = curl_init(); curl_setopt( $curl, CURLOPT_URL, $url ); curl_setopt( $curl, CURLOPT_RETURNTRANSFER, 1 ); curl_setopt( $curl, CURLOPT_POST, true); curl_setopt( $curl, CURLOPT_USERAGENT, "Page Revisions Retrieval Script - Andrea Forte - aforte@drexel.edu"); $result = curl_exec( $curl ); curl_close( $curl ); return $result; } }
Andrea Forte, 14/07/2011 18:58:
Better yet, does anyone already have a working script/tool handy that grabs all the revisions of a page? :)
There's https://code.google.com/p/wikiteam/ Its purpose is to download whole wikis, but you can always edit the titles list.
Nemo
Il 14/07/2011 18:58, Andrea Forte ha scritto:
I'm trying to download the complete revision history for just a few pages.
I wrote a script that does exactly that but with api.php https://github.com/volpino/wiki-network/blob/master/download_page.py
It's part of wiki-network so you have to download the whole package because of deps (unless you slightly modify it). It's python and it's really simple. Note: we needed the diff between revisions and a cleaning from templates and similar stuff so that script does that, you should edit lines around 50.
Thanks! This is far more elegant than my php script that calls the api but uses the same parameters so I assume it's going to return the same xml format.
It just seems so strange that using the API and using special:export (which is also what the full xml dumps look like) return two different xml structures. I figured there was some obvious fix that I was missing. :)
-Andrea
On Fri, Jul 15, 2011 at 6:24 AM, fox fox91@anche.no wrote:
Il 14/07/2011 18:58, Andrea Forte ha scritto:
I'm trying to download the complete revision history for just a few pages.
I wrote a script that does exactly that but with api.php https://github.com/volpino/wiki-network/blob/master/download_page.py
It's part of wiki-network so you have to download the whole package because of deps (unless you slightly modify it). It's python and it's really simple. Note: we needed the diff between revisions and a cleaning from templates and similar stuff so that script does that, you should edit lines around 50.
-- f.
"I didn't try, I succeeded" (Dr. Sheldon Cooper, PhD)
() ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments
wiki-research-l@lists.wikimedia.org