On 05/04/13 23:21, Keith Schacht wrote:
Hi, I've downloaded the latest set of wikimedia dumps. I'm trying to understand where to find images within these dumps. I've studied the database schema and it seems to make sense, but then I take a single example such as:
http://en.wikipedia.org/wiki/File:Carrizo_2a.JPG
And I grep the dumps 'image', 'imagelinks', and 'page' looking for 'Carrizo_2a.JPG' and it's not found. I tried this on both the SQL and XML dumps.
Are these dumps not complete? Am I misunderstanding the structure?
Thanks in advance, Keith
Carrizo_2a.JPG will be linked in http://dumps.wikimedia.org/enwiki/20130403/enwiki-20130403-categorylinks.sql... You will need http://dumps.wikimedia.org/enwiki/20130403/enwiki-20130403-page.sql.gz for figuring out the name of the page including it (as the later will only give you the page_id).
Carrizo_2a.JPG appears in line 155.
$ grep -n --color Carrizo_2a.JPG enwiki-20130403-imagelinks.sql ... (1673384,'Aerial-SanAndreas-CarrizoPlain.jpg'),(1673384,'Aerial-SodaLakePond.jpg'),(1673384,'Carrizo_2a.JPG'), (1673384,'Carrizo_soda_lake_rd_from_south.jpg'), ...