On 05/04/13 23:21, Keith Schacht wrote:
Hi, I've downloaded the latest set of wikimedia
dumps. I'm trying to
understand where to find images within these dumps. I've studied the
database schema and it seems to make sense, but then I take a single
example such as:
http://en.wikipedia.org/wiki/File:Carrizo_2a.JPG
And I grep the dumps 'image', 'imagelinks', and 'page' looking
for
'Carrizo_2a.JPG' and it's not found. I tried this on both the SQL and
XML dumps.
Are these dumps not complete? Am I misunderstanding the structure?
Thanks in advance,
Keith
Carrizo_2a.JPG will be linked in
http://dumps.wikimedia.org/enwiki/20130403/enwiki-20130403-categorylinks.sq…
You will need
http://dumps.wikimedia.org/enwiki/20130403/enwiki-20130403-page.sql.gz
for figuring out the name of the page including it (as the later will
only give you the page_id).
Carrizo_2a.JPG appears in line 155.
$ grep -n --color Carrizo_2a.JPG enwiki-20130403-imagelinks.sql
...
(1673384,'Aerial-SanAndreas-CarrizoPlain.jpg'),(1673384,'Aerial-SodaLakePond.jpg'),(1673384,'Carrizo_2a.JPG'),
(1673384,'Carrizo_soda_lake_rd_from_south.jpg'), ...