Hi,
I am planning to study the difference in users edits style and their spelling errors in English Wikipedia as part of a research project I am involved in.
So I downloaded some of the wikipedia XML partial dump and convert them to SQL. My understanding that wikipedia stores every copy of the pages in the database.
- I can not see the users table! Is the users table stored in a special partial dump? - Does the user table contain any properties related to the user country, preferred wikipeidas, or their skill in different languages ? - I am interested in the user modifications that contain addition to the articles and not modification or deletion. I am planning now to diff between revisions to get such data. Are you aware of any tool or effort that can help? - Are you aware of any tools that extract the text from wikipedia markup language.
Regards.
Hi,
you might want to have a look at the JWPL Revision Toolkit http://code.google.com/p/jwpl/ It should provide most information you are looking for, especially access to all the modifications and a parser to extract the plain text from Wikipedia.
The UIMA toolkit http://code.google.com/p/dkpro-core-asl/ also contains a component that gets you all pairs of adjacent revisions from which makes it quite easy to spot the ones which are additions only.
-Torsten
From: wiki-research-l-bounces@lists.wikimedia.org [mailto:wiki-research-l-bounces@lists.wikimedia.org] On Behalf Of Rami Al-Rfou' Sent: Tuesday, October 18, 2011 9:29 PM To: Research into Wikimedia content and communities Cc: Yanqing Chen Subject: [Wiki-research-l] Wikipedia Users database
Hi,
I am planning to study the difference in users edits style and their spelling errors in English Wikipedia as part of a research project I am involved in.
So I downloaded some of the wikipedia XML partial dump and convert them to SQL. My understanding that wikipedia stores every copy of the pages in the database.
* I can not see the users table! Is the users table stored in a special partial dump? * Does the user table contain any properties related to the user country, preferred wikipeidas, or their skill in different languages ? * I am interested in the user modifications that contain addition to the articles and not modification or deletion. I am planning now to diff between revisions to get such data. Are you aware of any tool or effort that can help? * Are you aware of any tools that extract the text from wikipedia markup language. Regards.
-- Rami Al-Rfou' PhD student at Stony Brook University
Hi All,
So with more investigation I discovered that I can get a list of the users depending on their skill at a specific language. For example: http://en.wikipedia.org/w/index.php?title=Category:User_zh-N
It seems that such list is populated from a database. Does anyone know where can I find such database ?
Other questions are regarding the partial dumps of wikipedia. Are the dumps sorted by any field ? How can get all the users pages ? Are they stored in a specific dump ? Or the dumps are stored by page titles or categories ?
Regards.
On Tue, Oct 18, 2011 at 15:29, Rami Al-Rfou' rmyeid@gmail.com wrote:
Hi,
I am planning to study the difference in users edits style and their spelling errors in English Wikipedia as part of a research project I am involved in.
So I downloaded some of the wikipedia XML partial dump and convert them to SQL. My understanding that wikipedia stores every copy of the pages in the database.
- I can not see the users table! Is the users table stored in a special
partial dump?
- Does the user table contain any properties related to the user
country, preferred wikipeidas, or their skill in different languages ?
- I am interested in the user modifications that contain addition to
the articles and not modification or deletion. I am planning now to diff between revisions to get such data. Are you aware of any tool or effort that can help?
- Are you aware of any tools that extract the text from wikipedia
markup language.
Regards.
-- Rami Al-Rfou' PhD student at Stony Brook University
On Mon, Oct 24, 2011 at 04:11:22PM -0400, Rami Al-Rfou' wrote:
Hi All,
So with more investigation I discovered that I can get a list of the users depending on their skill at a specific language. For example: http://en.wikipedia.org/w/index.php?title=Category:User_zh-N
It seems that such list is populated from a database. Does anyone know where can I find such database ?
Everything is eventually filled from a database of course :-P , but specifically, this is a category. http://www.mediawiki.org/wiki/Help:Categories
sincerely, Kim Bruning
On Tue, Oct 25, 2011 at 7:11 AM, Rami Al-Rfou' rmyeid@gmail.com wrote:
Hi All,
So with more investigation I discovered that I can get a list of the users depending on their skill at a specific language. For example: http://en.wikipedia.org/w/index.php?title=Category:User_zh-N
It seems that such list is populated from a database. Does anyone know where can I find such database ?
Other questions are regarding the partial dumps of wikipedia. Are the dumps sorted by any field ? How can get all the users pages ? Are they stored in a specific dump ? Or the dumps are stored by page titles or categories ?
http://csv.ozziesport.com/October%209%20-%20Wikipedia%20English%20Data.csvis a file I have related to that. It is about a year old and a result of manual data mining, where I looked for user boxes and which users had transcluded them onto their user space. My file only covers English Wikipedia and doesn't include every user box around. It might be a good place to start. I don't think that userbox information is stored in a separate user table, so I doubt that you would be able to get access to it through that route. :/
- That example you posted isn't a list of all users, just ones who have added "Babel" template to their userpages [1]. - That data is stored in the database, in the category and categorylinks tables (possibly elsewhere, I can't remember offhand). - I don't think the are sorted in anything more than the current row order in the database (so in the order of creation). - The user pages will be included in the "All pages" dumps (as opposed to the "Articles, templates, image descriptions, and primary meta-pages.")
As for your original sets of questions:
- IIRC, no userdata is included in any dumps. This is to protect user privacy. - No on all accounts, only thing related in the interface language. If you click "My Preferences" on any Wiki, what options you see there is what is stored in the users table (more or less) - All edits are "modifications" technically. You'd have to programatically figure out what is _just_ adding content. - Yes, that "tool" would be called MediaWiki, if you want the most accurate parser of MediaWiki Markup [2]. There are some alternative parser's [3] but their output can be of variable quality.
-Jon [1] http://meta.wikimedia.org/wiki/Meta:Babel_templates [2] http://www.mediawiki.org/wiki/Markup_spec [3] http://www.mediawiki.org/wiki/Alternative_parsers
On Mon, Oct 24, 2011 at 13:11, Rami Al-Rfou' rmyeid@gmail.com wrote:
Hi All,
So with more investigation I discovered that I can get a list of the users depending on their skill at a specific language. For example: http://en.wikipedia.org/w/index.php?title=Category:User_zh-N
It seems that such list is populated from a database. Does anyone know where can I find such database ?
Other questions are regarding the partial dumps of wikipedia. Are the dumps sorted by any field ? How can get all the users pages ? Are they stored in a specific dump ? Or the dumps are stored by page titles or categories ?
Regards.
On Tue, Oct 18, 2011 at 15:29, Rami Al-Rfou' rmyeid@gmail.com wrote:
Hi,
I am planning to study the difference in users edits style and their spelling errors in English Wikipedia as part of a research project I am involved in.
So I downloaded some of the wikipedia XML partial dump and convert them to SQL. My understanding that wikipedia stores every copy of the pages in the database.
- I can not see the users table! Is the users table stored in a
special partial dump?
- Does the user table contain any properties related to the user
country, preferred wikipeidas, or their skill in different languages ?
- I am interested in the user modifications that contain addition to
the articles and not modification or deletion. I am planning now to diff between revisions to get such data. Are you aware of any tool or effort that can help?
- Are you aware of any tools that extract the text from wikipedia
markup language.
Regards.
-- Rami Al-Rfou' PhD student at Stony Brook University
-- Rami Al-Rfou'
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
FYI, a lot of the database tables are archived on http://dumps.wikimedia.org -- see http://dumps.wikimedia.org/enwiki/20111007/ for the latest dump. The user table is private, but it doesn't seem like you need that. You're looking for what people have publicly posted on their own user pages, which MediaWiki understands as a page in a specific namespace, barely connected to a user at all from a database standpoint.
So if you're looking for category members (like the babel template you linked), it can be found in enwiki-DATE-categorylinks.sql.gz. Import that into mySQL -- it is about 10gb uncompressed, with the indexes making up another 25gb. However, that'll just give you the page_ids of the user page containing the template. You also have to download and import the page table (also public and archived) and join to it in mySQL if you want to get the usernames of everyone who has put themselves in those categories. Page is much more manageable -- uncompressed, it is about 3 gb of data and 2.5 gb of indexes. Stuart
-- R. Stuart Geiger UC-Berkeley School of Information User:Staeiou / @staeiou
On Mon, Oct 24, 2011 at 1:38 PM, Jon Davis wiki@konsoletek.com wrote:
That example you posted isn't a list of all users, just ones who have added "Babel" template to their userpages [1]. That data is stored in the database, in the category and categorylinks tables (possibly elsewhere, I can't remember offhand). I don't think the are sorted in anything more than the current row order in the database (so in the order of creation). The user pages will be included in the "All pages" dumps (as opposed to the "Articles, templates, image descriptions, and primary meta-pages.")
As for your original sets of questions:
IIRC, no userdata is included in any dumps. This is to protect user privacy. No on all accounts, only thing related in the interface language. If you click "My Preferences" on any Wiki, what options you see there is what is stored in the users table (more or less) All edits are "modifications" technically. You'd have to programatically figure out what is _just_ adding content. Yes, that "tool" would be called MediaWiki, if you want the most accurate parser of MediaWiki Markup [2]. There are some alternative parser's [3] but their output can be of variable quality.
-Jon [1] http://meta.wikimedia.org/wiki/Meta:Babel_templates [2] http://www.mediawiki.org/wiki/Markup_spec [3] http://www.mediawiki.org/wiki/Alternative_parsers On Mon, Oct 24, 2011 at 13:11, Rami Al-Rfou' rmyeid@gmail.com wrote:
Hi All, So with more investigation I discovered that I can get a list of the users depending on their skill at a specific language. For example: http://en.wikipedia.org/w/index.php?title=Category:User_zh-N
It seems that such list is populated from a database. Does anyone know where can I find such database ? Other questions are regarding the partial dumps of wikipedia. Are the dumps sorted by any field ? How can get all the users pages ? Are they stored in a specific dump ? Or the dumps are stored by page titles or categories ? Regards.
On Tue, Oct 18, 2011 at 15:29, Rami Al-Rfou' rmyeid@gmail.com wrote:
Hi, I am planning to study the difference in users edits style and their spelling errors in English Wikipedia as part of a research project I am involved in. So I downloaded some of the wikipedia XML partial dump and convert them to SQL. My understanding that wikipedia stores every copy of the pages in the database.
I can not see the users table! Is the users table stored in a special partial dump? Does the user table contain any properties related to the user country, preferred wikipeidas, or their skill in different languages ? I am interested in the user modifications that contain addition to the articles and not modification or deletion. I am planning now to diff between revisions to get such data. Are you aware of any tool or effort that can help? Are you aware of any tools that extract the text from wikipedia markup language.
Regards.
Rami Al-Rfou' PhD student at Stony Brook University
-- Rami Al-Rfou'
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Jon [[User:ShakataGaNai]] / KJ6FNQ http://snowulf.com/ http://ipv6wiki.net/
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
wiki-research-l@lists.wikimedia.org