Hi,
I'm doing research project on Wikipedia, so i need the Wikipedia data. I decided to use the database dumps of Wikipedia for this purpose but there are too much files there i don't know which file populates the which table. Would you please provide some information that tells the dumps file mapping to exact DB table.
Your prompt response is much appreciated.
Regards, IMran
Imran Latif, 10/04/2013 23:06:
I'm doing research project on Wikipedia, so i need the Wikipedia data. I decided to use the database dumps of Wikipedia for this purpose but there are too much files there i don't know which file populates the which table. Would you please provide some information that tells the dumps file mapping to exact DB table.
Your prompt response is much appreciated.
.sql.gz files are names exactly after each DB table; the list is at https://www.mediawiki.org/wiki/Manual:Database_layout Example: http://dumps.wikimedia.org/fiwiki/20130410/fiwiki-20130410-iwlinks.sql.gz -> https://www.mediawiki.org/wiki/Manual:Iwlinks_table
Nemo
Thanks for replying, Your reply makes sense. I just need to confirm that if i used the following Dump http://dumps.wikimedia.org/fiwiki/20130323/
And download all sql and XML files and populates my table using some utility, then the whole Wikipedia data is configured ? I mean to say that this dump provide me whole data of Wikipedia, including content, revision history etc. Or i need something more.
Regards, IMran
On Thu, Apr 11, 2013 at 2:15 AM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
Imran Latif, 10/04/2013 23:06:
I'm doing research project on Wikipedia, so i need the Wikipedia data. I
decided to use the database dumps of Wikipedia for this purpose but there are too much files there i don't know which file populates the which table. Would you please provide some information that tells the dumps file mapping to exact DB table.
Your prompt response is much appreciated.
.sql.gz files are names exactly after each DB table; the list is at https://www.mediawiki.org/**wiki/Manual:Database_layouthttps://www.mediawiki.org/wiki/Manual:Database_layout Example: http://dumps.wikimedia.org/**fiwiki/20130410/fiwiki-** 20130410-iwlinks.sql.gzhttp://dumps.wikimedia.org/fiwiki/20130410/fiwiki-20130410-iwlinks.sql.gz-> https://www.mediawiki.org/**wiki/Manual:Iwlinks_tablehttps://www.mediawiki.org/wiki/Manual:Iwlinks_table
Nemo
Imran Latif, 11/04/2013 08:26:
Thanks for replying, Your reply makes sense. I just need to confirm that if i used the following Dump http://dumps.wikimedia.org/fiwiki/20130323/
And download all sql and XML files and populates my table using some utility, then the whole Wikipedia data is configured ? I mean to say that this dump provide me whole data of Wikipedia, including content, revision history etc. Or i need something more.
Well, importDump.php is horrible and we have a thread "making imports suck less" 20 min before yours. :) But Ariel made some nice pages: https://meta.wikimedia.org/wiki/Data_dumps/Tools_for_importing#Converting_to_SQL_first is the way and https://meta.wikimedia.org/wiki/Data_dumps/Import_examples has a viable tutorial. In some cases you may prefer to avoid some stuff you don't need, or even to populate some tables on your own... add your examples, as the page says. ;-)
Nemo
Στις 11-04-2013, ημέρα Πεμ, και ώρα 08:42 +0200, ο/η Federico Leva (Nemo) έγραψε:
Imran Latif, 11/04/2013 08:26:
Thanks for replying, Your reply makes sense. I just need to confirm that if i used the following Dump http://dumps.wikimedia.org/fiwiki/20130323/
And download all sql and XML files and populates my table using some utility, then the whole Wikipedia data is configured ? I mean to say that this dump provide me whole data of Wikipedia, including content, revision history etc. Or i need something more.
Well, importDump.php is horrible and we have a thread "making imports suck less" 20 min before yours. :) But Ariel made some nice pages: https://meta.wikimedia.org/wiki/Data_dumps/Tools_for_importing#Converting_to_SQL_first is the way and https://meta.wikimedia.org/wiki/Data_dumps/Import_examples has a viable tutorial. In some cases you may prefer to avoid some stuff you don't need, or even to populate some tables on your own... add your examples, as the page says. ;-)
Nemo
Uh oh I gotta update those now that the new tools went out.
Thanks for reminding me :-D
Ariel
Thanks All, but my question is till there :) let me rephrase , suppose we have the following dumps list http://dumps.wikimedia.org/enwiki/20130403/
at there there are so much files with name enwiki-20130403-pages-meta-current27.xml-p029625001p039009132.bz2http://dumps.wikimedia.org/enwiki/20130403/enwiki-20130403-pages-meta-current27.xml-p029625001p039009132.bz2 of same data as well different SQL files for different tables. Q1: SQL files clearly tells the mapping to database table but the ".bz2" not telling about database mapping. Q2: There are multiple "".bz2"" file with same name , we should take the largest size file ?
Please let me know about that .
Regards, IMran
On Thu, Apr 11, 2013 at 1:20 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
Στις 11-04-2013, ημέρα Πεμ, και ώρα 08:42 +0200, ο/η Federico Leva (Nemo) έγραψε:
Imran Latif, 11/04/2013 08:26:
Thanks for replying, Your reply makes sense. I just need to confirm that if i used the following Dump http://dumps.wikimedia.org/fiwiki/20130323/
And download all sql and XML files and populates my table using some utility, then the whole Wikipedia data is configured ? I mean to say that this dump provide me whole data of Wikipedia, including content, revision history etc. Or i need something more.
Well, importDump.php is horrible and we have a thread "making imports suck less" 20 min before yours. :) But Ariel made some nice pages: <
https://meta.wikimedia.org/wiki/Data_dumps/Tools_for_importing#Converting_to...
is the way and https://meta.wikimedia.org/wiki/Data_dumps/Import_examples has a viable tutorial. In some cases you may prefer to avoid some stuff you don't need, or even to populate some tables on your own... add your examples, as the page says. ;-)
Nemo
Uh oh I gotta update those now that the new tools went out.
Thanks for reminding me :-D
Ariel
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
On 15/04/13 15:03, Imran Latif wrote:
And download all sql and XML files and populates my table using some utility, then the whole Wikipedia data is configured ? I mean to say that this dump provide me whole data of Wikipedia, including content, revision history etc. Or i need something more.
Yes. Installation of MediaWiki and its extensions to match what is installed at wikipedia goes separatedly, of course. What you won't get is: - Deleted content - User information (user list, preferences, watchlists, passwords, ips...)
Thanks All, but my question is till there :) let me rephrase , suppose we have the following dumps list http://dumps.wikimedia.org/enwiki/20130403/
at there there are so much files with name enwiki-20130403-pages-meta-current27.xml-p029625001p039009132.bz2 http://dumps.wikimedia.org/enwiki/20130403/enwiki-20130403-pages-meta-current27.xml-p029625001p039009132.bz2 of same data as well different SQL files for different tables. Q1: SQL files clearly tells the mapping to database table but the ".bz2" not telling about database mapping.
.bz2 means it's compressed with bzip2. Use the previous extension.
Q2: There are multiple "".bz2"" file with same name , we should take the largest size file ?
No. There's no multiple .bz2 with same name. You have for instance enwiki-20130403-pages-meta-history27.xml-p038204154p039009132.bz2 and enwiki-20130403-pages-meta-history27.xml-p038204154p039009132.7z, which is the same file compressed in two different ways (bzip2 and 7-zip), but that is different from eg. enwiki-20130403-pages-meta-history27.xml-p038204154p039009132.bz2 as it's a different piece of the content history. In summary, you need all of them.
On the other hand, there's a bit of redundancy about files. pages-meta-history contains everything from pages-meta-current, which itself contains everything at pages-articles. And the stub-whatever contain less than the whatever versions. You are unlikley to need the abstract.xml, etc.
In fact, you probably don't even need the meta files, pages-articles is probably enough for you.
xmldatadumps-l@lists.wikimedia.org