WikiXMLArticleIndexer - Wikitech-l

14 Sep 2010


      Hi all,
We have a beta version of the code for reading the XML dump and 
extracting the article names with their associated images.  It is in 
the yahoo group wikishare files section folder 
"WikiXMLArticleIndexer".  Also uploaded to:
http://nekrom.com/red79/WikiXMLArticleIndexer.zip
It uses a zipreader library: "http://www.icsharpcode.net/opensource/sharpziplib/" 
so that it can stream the data from the file without having to unzip
the file.  I tested it on these two files so far:
"enwiki-20100622-pages-articles.xml.bz2" 
"simplewiki-20100902-pages-articles.xml.bz2"
The output file has one article name per line, and then has the
images (including the full download URL) in that article in 
quotation marks.
One cool thing we came across was the image download URL's.
Like for "2/28/Bakuninfull.jpg", the "2/28" folder is encoded in the
file name "Bakuninfull.jpg" using an MD5 hash (neato!)
The full path of the images are these url's:
"http://upload.wikimedia.org/wikipedia/commons/2/28/Bakuninfull.jpg"
and when we make the download script we can add the desired thumbnail
scaling to the image ie. like this:
"http://upload.wikimedia.org/wikipedia/commons/thumb/2/28/Bakuninfull.jpg/220..."
Only bad news is its in C# :P
cheers,
Jamie
Here's the top few lines of output from running on
enwiki-20100622-pages-articles.xml.bz2
Anarchism "2/28/Bakuninfull.jpg" "3/36/German_anti-communist_poster_1918.jpg" "8/84/Members_of_the_Maquis_in_La_Tresorerie.jpg" "0/0f/ParcGuellOkupas.jpg" "b/b5/Max_stirner.jpg" "a/a6/WilliamGodwin.jpg" "e/ea/Portrait_of_Pierre_Joseph_Proudhon_1865.jpg" "0/00/Kropotkin2.jpg" "e/ed/Jarach_and_Zerzan.JPG" "2/23/Emilearmand01.jpg" "b/b8/Fransisco_Ferrer_Guardia.jpg" "7/7d/Gadewar.jpg"
Autism "0/0d/Autistic-sweetiepie-boy-with-ducksinarow.jpg" "8/83/Autismbrain.jpg" "7/72/Opening_a_window_to_the_autistic_brain.jpg"
Albedo "b/ba/water_reflectivity.jpg"
Alabama "b/bb/Alabama.JPG" "4/48/AlabamaWelcome.JPG" "8/87/Map_of_Alabama_terrain_NA.jpg" "6/6e/Birmingham_panorama.jpg" "3/39/Downtown_Mobile_2008_01.jpg" "b/b2/100_1830.JPG" "b/be/Montgomery_Alabama_panorama.jpg" "e/e1/Alabama_winter_2008.jpg" "c/cd/Alabama_quarter,_reverse_side,_2003.jpg" "7/7f/Mobile_Alabama_harbor_aerial_view.jpg" "b/b1/Alabama_state_capitol,_Montgomery.jpg" "1/19/Bob_Riley_greeting_soldiers_in_Birmingham,_19_Jan,_2004.jpg" "d/d4/Harrison-plaza2.jpg"
Achilles "c/cf/Leon_Benouville_The_Wrath_of_Achilles.jpg" "1/11/The_Education_of_Achilles,_by_James_Barry.jpg" "d/dd/AmbrosianIliadPict47Achilles.jpg" "5/58/Triumph_of_Achilles_in_Corfu_Achilleion.jpg" "1/1c/Achilles_thniskon_in_Corfu.jpg" "a/a0/Aias_body_Akhilleus_Staatliche_Antikensammlungen_1884.jpg" "c/c4/Achilles_in_Corfu.JPG" "0/01/Wenceslas_Hollar_-_Briseis_and_Achilles.jpg"
Abraham Lincoln "3/38/Abe-Lincoln-Birthplace-2.jpg" "f/f6/A&amp;TLincoln.jpg" "4/4f/Abe_Lincoln_young.jpg" "4/4b/Young_Lincoln-1c.jpg" "2/27/Abraham_Lincoln_by_Alexander_Helser,_1860-crop.jpg" "c/cf/Lincoln_Douglas_Debates_1958_issue-4c.jpg" "1/13/The_Rail_Candidate.jpg" "4/41/Lincoln_1896_issue-4c.jpg" "6/60/Abraham_lincoln_inauguration_1861.jpg" "6/64/RunningtheMachine-LincAdmin.jpg" "6/67/PinkertonLincolnMcClernand.jpg" "b/bb/Lincoln_second.jpg" "5/52/Abraham_Lincoln_1866_Issue-15c.jpg" "a/a2/Al16.jpg" "a/ab/TheApotheosisLincolnAndWashington1860s.jpg" "8/84/Abraham_Lincoln_Airmail_1960_Issue-25c.jpg"
Aristotle "e/e7/Arabic_aristotle.jpg" "a/ae/Aristotle_in_Nuremberg_Chronicle.jpg" "9/98/Sanzio_01_Plato_Aristotle.jpg" "7/77/Uni_Freiburg_-_Philosophen_4.jpg" "3/33/Octopus3.jpg" "1/13/Torpedo_fuscomaculata2.jpg" "c/cd/Triakis_semifasciata.jpg" "6/63/161Theophrastus_161_frontespizio.jpg" "a/a4/Aristoteles_Louvre.jpg"
Academy Award "d/d6/31st_Acad_Awards.jpg" "d/d9/81st_Academy_Awards_Ceremony.JPG"
Animalia (book) "f/f2/Animalia.jpg"
Altruism "a/a9/Belisaire_demandant_l'aumone_Jacques-Louis_David.jpg"