Hi all,
We have a beta version of the code for reading the XML dump and extracting the article names with their associated images. It is in the yahoo group wikishare files section folder "WikiXMLArticleIndexer". Also uploaded to: http://nekrom.com/red79/WikiXMLArticleIndexer.zip It uses a zipreader library: "http://www.icsharpcode.net/opensource/sharpziplib/" so that it can stream the data from the file without having to unzip the file. I tested it on these two files so far: "enwiki-20100622-pages-articles.xml.bz2" "simplewiki-20100902-pages-articles.xml.bz2"
The output file has one article name per line, and then has the images (including the full download URL) in that article in quotation marks.
One cool thing we came across was the image download URL's. Like for "2/28/Bakuninfull.jpg", the "2/28" folder is encoded in the file name "Bakuninfull.jpg" using an MD5 hash (neato!)
The full path of the images are these url's: "http://upload.wikimedia.org/wikipedia/commons/2/28/Bakuninfull.jpg"
and when we make the download script we can add the desired thumbnail scaling to the image ie. like this:
"http://upload.wikimedia.org/wikipedia/commons/thumb/2/28/Bakuninfull.jpg/220..."
Only bad news is its in C# :P
cheers, Jamie
Here's the top few lines of output from running on enwiki-20100622-pages-articles.xml.bz2
Anarchism "2/28/Bakuninfull.jpg" "3/36/German_anti-communist_poster_1918.jpg" "8/84/Members_of_the_Maquis_in_La_Tresorerie.jpg" "0/0f/ParcGuellOkupas.jpg" "b/b5/Max_stirner.jpg" "a/a6/WilliamGodwin.jpg" "e/ea/Portrait_of_Pierre_Joseph_Proudhon_1865.jpg" "0/00/Kropotkin2.jpg" "e/ed/Jarach_and_Zerzan.JPG" "2/23/Emilearmand01.jpg" "b/b8/Fransisco_Ferrer_Guardia.jpg" "7/7d/Gadewar.jpg" Autism "0/0d/Autistic-sweetiepie-boy-with-ducksinarow.jpg" "8/83/Autismbrain.jpg" "7/72/Opening_a_window_to_the_autistic_brain.jpg" Albedo "b/ba/water_reflectivity.jpg" Alabama "b/bb/Alabama.JPG" "4/48/AlabamaWelcome.JPG" "8/87/Map_of_Alabama_terrain_NA.jpg" "6/6e/Birmingham_panorama.jpg" "3/39/Downtown_Mobile_2008_01.jpg" "b/b2/100_1830.JPG" "b/be/Montgomery_Alabama_panorama.jpg" "e/e1/Alabama_winter_2008.jpg" "c/cd/Alabama_quarter,_reverse_side,_2003.jpg" "7/7f/Mobile_Alabama_harbor_aerial_view.jpg" "b/b1/Alabama_state_capitol,_Montgomery.jpg" "1/19/Bob_Riley_greeting_soldiers_in_Birmingham,_19_Jan,_2004.jpg" "d/d4/Harrison-plaza2.jpg" Achilles "c/cf/Leon_Benouville_The_Wrath_of_Achilles.jpg" "1/11/The_Education_of_Achilles,_by_James_Barry.jpg" "d/dd/AmbrosianIliadPict47Achilles.jpg" "5/58/Triumph_of_Achilles_in_Corfu_Achilleion.jpg" "1/1c/Achilles_thniskon_in_Corfu.jpg" "a/a0/Aias_body_Akhilleus_Staatliche_Antikensammlungen_1884.jpg" "c/c4/Achilles_in_Corfu.JPG" "0/01/Wenceslas_Hollar_-_Briseis_and_Achilles.jpg" Abraham Lincoln "3/38/Abe-Lincoln-Birthplace-2.jpg" "f/f6/A&TLincoln.jpg" "4/4f/Abe_Lincoln_young.jpg" "4/4b/Young_Lincoln-1c.jpg" "2/27/Abraham_Lincoln_by_Alexander_Helser,_1860-crop.jpg" "c/cf/Lincoln_Douglas_Debates_1958_issue-4c.jpg" "1/13/The_Rail_Candidate.jpg" "4/41/Lincoln_1896_issue-4c.jpg" "6/60/Abraham_lincoln_inauguration_1861.jpg" "6/64/RunningtheMachine-LincAdmin.jpg" "6/67/PinkertonLincolnMcClernand.jpg" "b/bb/Lincoln_second.jpg" "5/52/Abraham_Lincoln_1866_Issue-15c.jpg" "a/a2/Al16.jpg" "a/ab/TheApotheosisLincolnAndWashington1860s.jpg" "8/84/Abraham_Lincoln_Airmail_1960_Issue-25c.jpg" Aristotle "e/e7/Arabic_aristotle.jpg" "a/ae/Aristotle_in_Nuremberg_Chronicle.jpg" "9/98/Sanzio_01_Plato_Aristotle.jpg" "7/77/Uni_Freiburg_-_Philosophen_4.jpg" "3/33/Octopus3.jpg" "1/13/Torpedo_fuscomaculata2.jpg" "c/cd/Triakis_semifasciata.jpg" "6/63/161Theophrastus_161_frontespizio.jpg" "a/a4/Aristoteles_Louvre.jpg" Academy Award "d/d6/31st_Acad_Awards.jpg" "d/d9/81st_Academy_Awards_Ceremony.JPG" Animalia (book) "f/f2/Animalia.jpg" Altruism "a/a9/Belisaire_demandant_l'aumone_Jacques-Louis_David.jpg"
On Tue, Sep 14, 2010 at 1:04 AM, Jamie Morken jmorken@shaw.ca wrote:
Hi all,
We have a beta version of the code for reading the XML dump and extracting the article names with their associated images.
It is easier to download the imagelinks.sql and the page.sql dumps. Imagelinks contains already a mapping of images used on a page, and page can be used to map page_id to page_namespace and page_title.
Bryan
wikitech-l@lists.wikimedia.org