Hi there,
This is Nikhil, an undergraduate student from India. And I'm trying to understand the Wikipedia's data dumps provided by Wikimedia.
I'm working on 20180520 dumps. It contains many sections, each having different data. And I would like to know what each section's data represent. Although it's written in a brief, I don't get it clearly.
Like in section "All pages, current versions only." Does each and every article's current version is present in this data? Because I just downloaded "enwiki-20180520-pages-meta-current1.xml-p10p30303.bz2", the first-page information is of "AccessibleComputing" but it does not have complete article's information in it?
Hoping to get a quick reply.
Thanks. Nikhil
(Fwding to Xmldatadumps-l@ list - https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l , CCing wikitech-ambassadors@ just for resolution here)
On Mon, Jun 4, 2018 at 10:55 PM, Nikhil Prakash nikhil07prakash@gmail.com wrote:
Hi there,
This is Nikhil, an undergraduate student from India. And I'm trying to understand the Wikipedia's data dumps provided by Wikimedia.
I'm working on 20180520 dumps. It contains many sections, each having different data. And I would like to know what each section's data represent. Although it's written in a brief, I don't get it clearly.
Like in section "All pages, current versions only." Does each and every article's current version is present in this data? Because I just downloaded "enwiki-20180520-pages-meta-current1.xml-p10p30303.bz2", the first-page information is of "AccessibleComputing" but it does not have complete article's information in it?
Hoping to get a quick reply.
Thanks. Nikhil
Wikitech-ambassadors mailing list Wikitech-ambassadors@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors
It should have the wikitext of the most recent version of all pages.
Note, however, this is split into multiple parts, you only downloaded part 1 (page id 10 to page id 30303. Not all page id numbers are used which is why it starts at 10) , other pages are in other parts. Additionally, wikitext is not html, but a custom markup syntax we use which is later rendered into html by the MediaWiki parser. This markup may include other pages (via {{pagename}}).
-- brian On Tuesday, June 5, 2018, Nikhil Prakash nikhil07prakash@gmail.com wrote:
Hi there, This is Nikhil, an undergraduate student from India. And I'm trying to
understand the Wikipedia's data dumps provided by Wikimedia.
I'm working on 20180520 dumps. It contains many sections, each having
different data. And I would like to know what each section's data represent. Although it's written in a brief, I don't get it clearly.
Like in section "All pages, current versions only." Does each and every
article's current version is present in this data? Because I just downloaded "enwiki-20180520-pages-meta-current1.xml-p10p30303.bz2", the first-page information is of "AccessibleComputing" but it does not have complete article's information in it?
Hoping to get a quick reply. Thanks. Nikhil <
https://ci5.googleusercontent.com/proxy/6B7a9Dxe45NiojYWxLFg1ygywFhtTx6WZ7YS...
wikitech-ambassadors@lists.wikimedia.org