FYI, a lot of the database tables are archived on http://dumps.wikimedia.org -- see http://dumps.wikimedia.org/enwiki/20111007/ for the latest dump. The user table is private, but it doesn't seem like you need that. You're looking for what people have publicly posted on their own user pages, which MediaWiki understands as a page in a specific namespace, barely connected to a user at all from a database standpoint.
So if you're looking for category members (like the babel template you linked), it can be found in enwiki-DATE-categorylinks.sql.gz. Import that into mySQL -- it is about 10gb uncompressed, with the indexes making up another 25gb. However, that'll just give you the page_ids of the user page containing the template. You also have to download and import the page table (also public and archived) and join to it in mySQL if you want to get the usernames of everyone who has put themselves in those categories. Page is much more manageable -- uncompressed, it is about 3 gb of data and 2.5 gb of indexes. Stuart
-- R. Stuart Geiger UC-Berkeley School of Information User:Staeiou / @staeiou
On Mon, Oct 24, 2011 at 1:38 PM, Jon Davis wiki@konsoletek.com wrote:
That example you posted isn't a list of all users, just ones who have added "Babel" template to their userpages [1]. That data is stored in the database, in the category and categorylinks tables (possibly elsewhere, I can't remember offhand). I don't think the are sorted in anything more than the current row order in the database (so in the order of creation). The user pages will be included in the "All pages" dumps (as opposed to the "Articles, templates, image descriptions, and primary meta-pages.")
As for your original sets of questions:
IIRC, no userdata is included in any dumps. This is to protect user privacy. No on all accounts, only thing related in the interface language. If you click "My Preferences" on any Wiki, what options you see there is what is stored in the users table (more or less) All edits are "modifications" technically. You'd have to programatically figure out what is _just_ adding content. Yes, that "tool" would be called MediaWiki, if you want the most accurate parser of MediaWiki Markup [2]. There are some alternative parser's [3] but their output can be of variable quality.
-Jon [1] http://meta.wikimedia.org/wiki/Meta:Babel_templates [2] http://www.mediawiki.org/wiki/Markup_spec [3] http://www.mediawiki.org/wiki/Alternative_parsers On Mon, Oct 24, 2011 at 13:11, Rami Al-Rfou' rmyeid@gmail.com wrote:
Hi All, So with more investigation I discovered that I can get a list of the users depending on their skill at a specific language. For example: http://en.wikipedia.org/w/index.php?title=Category:User_zh-N
It seems that such list is populated from a database. Does anyone know where can I find such database ? Other questions are regarding the partial dumps of wikipedia. Are the dumps sorted by any field ? How can get all the users pages ? Are they stored in a specific dump ? Or the dumps are stored by page titles or categories ? Regards.
On Tue, Oct 18, 2011 at 15:29, Rami Al-Rfou' rmyeid@gmail.com wrote:
Hi, I am planning to study the difference in users edits style and their spelling errors in English Wikipedia as part of a research project I am involved in. So I downloaded some of the wikipedia XML partial dump and convert them to SQL. My understanding that wikipedia stores every copy of the pages in the database.
I can not see the users table! Is the users table stored in a special partial dump? Does the user table contain any properties related to the user country, preferred wikipeidas, or their skill in different languages ? I am interested in the user modifications that contain addition to the articles and not modification or deletion. I am planning now to diff between revisions to get such data. Are you aware of any tool or effort that can help? Are you aware of any tools that extract the text from wikipedia markup language.
Regards.
Rami Al-Rfou' PhD student at Stony Brook University
-- Rami Al-Rfou'
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Jon [[User:ShakataGaNai]] / KJ6FNQ http://snowulf.com/ http://ipv6wiki.net/
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l