Hi,
for a dissertation study, I try to find a reliable datasource from where I can extract user account events, specifically creation and blocking of user accounts, with usernames, the event name and timestamps,
Is such data available? If yes, could anybody point be to where I can get it from?
Thanks a lot
Hi,
this information is in the log. The dump of the log is in the pages-logging.xml.gz file. For example, the latest dump of the log for the English Wikipedia is: http://dumps.wikimedia.org/enwiki/20120601/enwiki-20120601-pages-logging.xml...
The log also contains items unrelated to accounts, if you only want account creations and blocks, look for types "newusers" and "block".
Petr Onderka [[en:User:Svick]]
On Sat, Jun 9, 2012 at 5:23 PM, Gregor Martynus gregor@martynus.net wrote:
Hi,
for a dissertation study, I try to find a reliable datasource from where I can extract user account events, specifically creation and blocking of user accounts, with usernames, the event name and timestamps,
Is such data available? If yes, could anybody point be to where I can get it from?
Thanks a lot
-- Gregor
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
On 09/06/12 17:23, Gregor Martynus wrote:
Hi,
for a dissertation study, I try to find a reliable datasource from where I can extract user account events, specifically creation and blocking of user accounts, with usernames, the event name and timestamps,
Is such data available? If yes, could anybody point be to where I can get it from?
Thanks a lot
-- Gregor
Yes. Go to http://dumps.wikimedia.org/ You want to grab the pages-logging.xml.gz file.
Despite its name, it does contain creation (not for the very old account) and blocking logs for accounts.
Thank you all!
I downloaded the pages-logging.xml.gz and the data looks good!
May I ask another question? In order to analyze the data, I'd like to transform it to SQL. I've found a java and a pearl tool, but both are not made to transform the logging.xml to sql. Are you aware of another tool that I can use, or maybe an instruction to follow?
Or maybe there is another tool you can think of that I can use to analyze the information in a performant way?
Once again, thanks a lot for your help, really appreciate it.
I think I found a way,
1. From command line, I started a connection to mysql (user:root, no password) and opened the pages_logging database $ mysql --local-infile -uroot pages_logging
2. I executed the following query:
mysql> LOAD DATA LOCAL INFILE '/path/to/pages-logging.xml' -> INTO TABLE test -> CHARACTER SET binary -> LINES STARTING BY '<logitem>' TERMINATED BY '</logitem>' (@logitem) -> SET -> id = ExtractValue(@logitem:=CONVERT(@logitem using utf8), 'id'), -> -> timestamp = ExtractValue(@logitem, 'timestamp'), -> type = ExtractValue(@logitem, 'type'), -> action = ExtractValue(@logitem, 'action'), -> logtitle = ExtractValue(@logitem, 'logitem'), -> user_id = ExtractValue(@logitem, 'contributor/id'), -> user_name = ExtractValue(@logitem, 'contributor/username');
that worked so far. But I see some rows being empty so I guess that the <logentry> nodes have a different syntax dependent on the type and action? I don't care to much as I just need the newuser and block actions, I just want to make sure that my assumption is correct, so that the study is not based on faulty data.
Is there a description of the pages-logging.xml syntax available somewhere, so I can double check my import script?
Thanks again for your help
The XML file is a dump of the logging table, see http://www.mediawiki.org/wiki/Manual:Logging_table for description of its columns.
The dump is not an exact copy of the table, though. For example, deleted log entries are still present in the table, but I believe they are not in the dump.
Petr Onderka [[en:User:Svick]]
On Sun, Jun 10, 2012 at 5:59 PM, Gregor Martynus gregor@martynus.net wrote:
I think I found a way,
- From command line, I started a connection to mysql (user:root, no
password) and opened the pages_logging database $ mysql --local-infile -uroot pages_logging
- I executed the following query:
mysql> LOAD DATA LOCAL INFILE '/path/to/pages-logging.xml' -> INTO TABLE test -> CHARACTER SET binary -> LINES STARTING BY '<logitem>' TERMINATED BY '</logitem>' (@logitem) -> SET -> id = ExtractValue(@logitem:=CONVERT(@logitem using utf8), 'id'), -> -> timestamp = ExtractValue(@logitem, 'timestamp'), -> type = ExtractValue(@logitem, 'type'), -> action = ExtractValue(@logitem, 'action'), -> logtitle = ExtractValue(@logitem, 'logitem'), -> user_id = ExtractValue(@logitem, 'contributor/id'), -> user_name = ExtractValue(@logitem, 'contributor/username');
that worked so far. But I see some rows being empty so I guess that the <logentry> nodes have a different syntax dependent on the type and action? I don't care to much as I just need the newuser and block actions, I just want to make sure that my assumption is correct, so that the study is not based on faulty data.
Is there a description of the pages-logging.xml syntax available somewhere, so I can double check my import script?
Thanks again for your help
-- Gregor Martynus
On Sunday, 10. June 2012 at 14:30, Gregor Martynus wrote:
Thank you all!
I downloaded the pages-logging.xml.gz and the data looks good!
May I ask another question? In order to analyze the data, I'd like to transform it to SQL. I've found a java and a pearl tool, but both are not made to transform the logging.xml to sql. Are you aware of another tool that I can use, or maybe an instruction to follow?
Or maybe there is another tool you can think of that I can use to analyze the information in a performant way?
Once again, thanks a lot for your help, really appreciate it.
-- Gregor Martynus
On Saturday, 9. June 2012 at 19:38, Platonides wrote:
On 09/06/12 17:23, Gregor Martynus wrote:
Hi,
for a dissertation study, I try to find a reliable datasource from where I can extract user account events, specifically creation and blocking of user accounts, with usernames, the event name and timestamps,
Is such data available? If yes, could anybody point be to where I can get it from?
Thanks a lot
-- Gregor
Yes. Go to http://dumps.wikimedia.org/ You want to grab the pages-logging.xml.gz file.
Despite its name, it does contain creation (not for the very old account) and blocking logs for accounts.
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Thanks Petr!
Besides the XML dumps, are there also direct SQL dumps available? It looks like indeed there are some things missing, e.g. namespaces. It also seems that wikipedia has information not only about the performer, but also about the target. I don't see such a thing in the XML dump … am I missing something?
Another question regarding the log actions. I made a list of all actions grouped by types: https://gist.github.com/2906718
I wonder if there is a page describing the meanings of the actions? I searched for such a help page, but couldn't find anything useful yet. If there is no such thing yet, I'd love to contribute it, I'd just need a hint how and where to do it, as I'm not familiar with wikipedia myself, I'm just helping out a friend making her dissertation
-- Gregor Martynus
On Sunday, 10. June 2012 at 19:00, Petr Onderka wrote:
The XML file is a dump of the logging table, see http://www.mediawiki.org/wiki/Manual:Logging_table for description of its columns.
The dump is not an exact copy of the table, though. For example, deleted log entries are still present in the table, but I believe they are not in the dump.
Petr Onderka [[en:User:Svick]]
On Sun, Jun 10, 2012 at 5:59 PM, Gregor Martynus <gregor@martynus.net (mailto:gregor@martynus.net)> wrote:
I think I found a way,
- From command line, I started a connection to mysql (user:root, no
password) and opened the pages_logging database $ mysql --local-infile -uroot pages_logging
- I executed the following query:
mysql> LOAD DATA LOCAL INFILE '/path/to/pages-logging.xml' -> INTO TABLE test -> CHARACTER SET binary -> LINES STARTING BY '<logitem>' TERMINATED BY '</logitem>' (@logitem) -> SET -> id = ExtractValue(@logitem:=CONVERT(@logitem using utf8), 'id'), -> -> timestamp = ExtractValue(@logitem, 'timestamp'), -> type = ExtractValue(@logitem, 'type'), -> action = ExtractValue(@logitem, 'action'), -> logtitle = ExtractValue(@logitem, 'logitem'), -> user_id = ExtractValue(@logitem, 'contributor/id'), -> user_name = ExtractValue(@logitem, 'contributor/username');
that worked so far. But I see some rows being empty so I guess that the <logentry> nodes have a different syntax dependent on the type and action? I don't care to much as I just need the newuser and block actions, I just want to make sure that my assumption is correct, so that the study is not based on faulty data.
Is there a description of the pages-logging.xml syntax available somewhere, so I can double check my import script?
Thanks again for your help
-- Gregor Martynus
On Sunday, 10. June 2012 at 14:30, Gregor Martynus wrote:
Thank you all!
I downloaded the pages-logging.xml.gz and the data looks good!
May I ask another question? In order to analyze the data, I'd like to transform it to SQL. I've found a java and a pearl tool, but both are not made to transform the logging.xml to sql. Are you aware of another tool that I can use, or maybe an instruction to follow?
Or maybe there is another tool you can think of that I can use to analyze the information in a performant way?
Once again, thanks a lot for your help, really appreciate it.
-- Gregor Martynus
On Saturday, 9. June 2012 at 19:38, Platonides wrote:
On 09/06/12 17:23, Gregor Martynus wrote:
Hi,
for a dissertation study, I try to find a reliable datasource from where I can extract user account events, specifically creation and blocking of user accounts, with usernames, the event name and timestamps,
Is such data available? If yes, could anybody point be to where I can get it from?
Thanks a lot
-- Gregor
Yes. Go to http://dumps.wikimedia.org/ You want to grab the pages-logging.xml.gz file.
Despite its name, it does contain creation (not for the very old account) and blocking logs for accounts.
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org (mailto:Xmldatadumps-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
On 10/06/12 19:38, Gregor Martynus wrote:
Thanks Petr!
Besides the XML dumps, are there also direct SQL dumps available?
No. The dumps are in XML precisely to avoid publishing db-specific schemas in SQL files... and filtering the output in the process. I could make you a sql dump, but it'd be preferable if we fixed the XML dumps.
It looks like indeed there are some things missing, e.g. namespaces. It also seems that wikipedia has information not only about the performer, but also about the target. I don't see such a thing in the XML dump … am I missing something?
It may be a bug in the log creation code.
Another question regarding the log actions. I made a list of all actions grouped by types: https://gist.github.com/2906718
I wonder if there is a page describing the meanings of the actions? I searched for such a help page, but couldn't find anything useful yet. If there is no such thing yet, I'd love to contribute it, I'd just need a hint how and where to do it, as I'm not familiar with wikipedia myself, I'm just helping out a friend making her dissertation
I don't think so, you'd need to browse the code. Although they have quite explanatory names.
I'll try to provide some info for them:
abusefilter modify
Modify an abusefilter entry. Made by Abusefilter extension.
block block unblock reblock
User blocking actions. Self-explanatory :)
delete delete
Delete a page.
event
Hide a log entry.
restore
Restore one or more revisions
revision
Hide a revision
gblblock whitelist
Probably provided by GlobalBlocking extension. I don't think this should appear on this wiki. I thought GlobalBlocking was only managed at metawiki.
import interwiki upload
Usage of Special:Import. Importing from a different wiki, or by uploading a file.
move move move_redir
Move a page to a new title or move overwriting a redirect in the new title.
newusers autocreate
The account was automatically created because the user was logged in at another wiki.
create
The user registered an account.
create2
Someone (log_user_text) created an account for somewhere else.
newusers
Seems like log action used by old entries. Seems last used in 2006.
patrol patrol
To patrol a revision.
protect modify move_prot protect unprotect
renameuser renameuser
Renameuser extension, rename a user account.
review approve approve-a approve-i approve-ia unapprove
Provided by Flaggedrevs extension.
rights autopromote rights
Changes to the user groups.
stable config reset
Provided by Flaggedrevs extension.
upload overwrite upload
Upload a new file, or upload a new version of a file.
Thanks a lot for the explanations, that helps a lot!
I think one thing missing in the page-logging.xml dump is target of the action (page column?). It could be a child node with an id and a name, just like <contributor>. Right now, I can filter events that have been performed by a specific user, but not by actions that have been applied to a specific user.
Does that make sense?
For example, I guess that this page is http://www.mediawiki.org/wiki/Special:Log using the logging table, and it provides a field "target". But if I'm not missing something, I can't do that with the current XML dump.
-- Gregor Martynus
On Sunday, 10. June 2012 at 22:12, Platonides wrote:
On 10/06/12 19:38, Gregor Martynus wrote:
Thanks Petr!
Besides the XML dumps, are there also direct SQL dumps available?
No. The dumps are in XML precisely to avoid publishing db-specific schemas in SQL files... and filtering the output in the process. I could make you a sql dump, but it'd be preferable if we fixed the XML dumps.
It looks like indeed there are some things missing, e.g. namespaces. It also seems that wikipedia has information not only about the performer, but also about the target. I don't see such a thing in the XML dump … am I missing something?
It may be a bug in the log creation code.
Another question regarding the log actions. I made a list of all actions grouped by types: https://gist.github.com/2906718
I wonder if there is a page describing the meanings of the actions? I searched for such a help page, but couldn't find anything useful yet. If there is no such thing yet, I'd love to contribute it, I'd just need a hint how and where to do it, as I'm not familiar with wikipedia myself, I'm just helping out a friend making her dissertation
I don't think so, you'd need to browse the code. Although they have quite explanatory names.
I'll try to provide some info for them:
abusefilter modify
Modify an abusefilter entry. Made by Abusefilter extension.
block block unblock reblock
User blocking actions. Self-explanatory :)
delete delete
Delete a page.
event
Hide a log entry.
restore
Restore one or more revisions
revision
Hide a revision
gblblock whitelist
Probably provided by GlobalBlocking extension. I don't think this should appear on this wiki. I thought GlobalBlocking was only managed at metawiki.
import interwiki upload
Usage of Special:Import. Importing from a different wiki, or by uploading a file.
move move move_redir
Move a page to a new title or move overwriting a redirect in the new title.
newusers autocreate
The account was automatically created because the user was logged in at another wiki.
create
The user registered an account.
create2
Someone (log_user_text) created an account for somewhere else.
newusers
Seems like log action used by old entries. Seems last used in 2006.
patrol patrol
To patrol a revision.
protect modify move_prot protect unprotect
renameuser renameuser
Renameuser extension, rename a user account.
review approve approve-a approve-i approve-ia unapprove
Provided by Flaggedrevs extension.
rights autopromote rights
Changes to the user groups.
stable config reset
Provided by Flaggedrevs extension.
upload overwrite upload
Upload a new file, or upload a new version of a file.
On 10/06/12 23:17, Gregor Martynus wrote:
Thanks a lot for the explanations, that helps a lot!
I think one thing missing in the page-logging.xml dump is target of the action (page column?). It could be a child node with an id and a name, just like <contributor>. Right now, I can filter events that have been performed by a specific user, but not by actions that have been applied to a specific user.
Does that make sense?
For example, I guess that this page is http://www.mediawiki.org/wiki/Special:Log using the logging table, and it provides a field "target". But if I'm not missing something, I can't do that with the current XML dump.
-- Gregor Martynus
Yes, Special:Log is the front-end to query the logging table.
The target of the action is provided inside the <logtitle> tag. Note that actions upon users log the name with the User: namespace prefixed.
De: Gregor Martynus gregor@martynus.net Para: Petr Onderka gsvick@gmail.com CC: xmldatadumps-l@lists.wikimedia.org Enviado: Domingo 10 de junio de 2012 19:38 Asunto: Re: [Xmldatadumps-l] anonymous user account logs (account created / account blocked)
Thanks Petr!
Besides the XML dumps, are there also direct SQL dumps available? It looks like indeed there are some things missing, e.g. namespaces. It also seems that wikipedia has information not only about the performer, but also about the target. I don't see such a thing in the XML dump … am I missing something?
Hi, Gregor.
I created a parser for the XML dumps in the logging table. It's written in Python, using SAX:
http://git.libresoft.es/WikixRay/tree/WikiXRay/parsers/dump_sax_logging.py
You also need to create a table in a local DB, and use a small connector to insert directly in the DB:
http://git.libresoft.es/WikixRay/tree/db/table_logging.sql
http://git.libresoft.es/WikixRay/tree/WikiXRay/db/dbaccess.py
Very soon, an improved version using LXML and more efficient organization will be available. This will be part of my new tool: WikiDAT (Wikipedia Data Analysis Toolkit). I expect to publish the first release very soon (before end of this month).
Another question regarding the log actions. I made a list of all actions grouped by types: https://gist.github.com/2906718
I wonder if there is a page describing the meanings of the actions? I searched for such a help page, but couldn't find anything useful yet. If there is no such thing yet, I'd love to contribute it, I'd just need a hint how and where to do it, as I'm not familiar with wikipedia myself, I'm just helping out a friend making her dissertation
In general, documentation about the structure and content of the XML dumps and other data sources is not available, or it is incomplete yet.
I'm also preparing a (brief) companion document for this new tool, explaining in some detail the different XML dump files, their structure, the meaning of fields and example queries to retrieve interesting data. I will be happy to include your map of log actions, if you agree. I'm afraid that not many people have used this (and other dumps) in the past simply because they don't have info about them.
Best, Felipe.
-- Gregor Martynus
On Sunday, 10. June 2012 at 19:00, Petr Onderka wrote: The XML file is a dump of the logging table,
see http://www.mediawiki.org/wiki/Manual:Logging_table for description of its columns.
The dump is not an exact copy of the table, though. For example, deleted log entries are still present in the table, but I believe they are not in the dump.
Petr Onderka [[en:User:Svick]]
On Sun, Jun 10, 2012 at 5:59 PM, Gregor Martynus gregor@martynus.net wrote: I think I found a way,
- From command line, I started a connection to mysql (user:root, no
password) and opened the pages_logging database $ mysql --local-infile -uroot pages_logging
- I executed the following query:
mysql> LOAD DATA LOCAL INFILE '/path/to/pages-logging.xml' -> INTO TABLE test -> CHARACTER SET binary -> LINES STARTING BY '<logitem>' TERMINATED BY '</logitem>' (@logitem) -> SET -> id = ExtractValue(@logitem:=CONVERT(@logitem using utf8), 'id'), -> -> timestamp = ExtractValue(@logitem, 'timestamp'), -> type = ExtractValue(@logitem, 'type'), -> action = ExtractValue(@logitem, 'action'), -> logtitle = ExtractValue(@logitem, 'logitem'), -> user_id = ExtractValue(@logitem, 'contributor/id'), -> user_name = ExtractValue(@logitem, 'contributor/username');
that worked so far. But I see some rows being empty so I guess that the <logentry> nodes have a different syntax dependent on the type and action? I don't care to much as I just need the newuser and block actions, I just want to make sure that my assumption is correct, so that the study is not based on faulty data.
Is there a description of the pages-logging.xml syntax available somewhere, so I can double check my import script?
Thanks again for your help
-- Gregor Martynus
On Sunday, 10. June 2012 at 14:30, Gregor Martynus wrote:
Thank you all!
I downloaded the pages-logging.xml.gz and the data looks good!
May I ask another question? In order to analyze the data, I'd like to transform it to SQL. I've found a java and a pearl tool, but both are not made to transform the logging.xml to sql. Are you aware of another tool that I can use, or maybe an instruction to follow?
Or maybe there is another tool you can think of that I can use to analyze the information in a performant way?
Once again, thanks a lot for your help, really appreciate it.
-- Gregor Martynus
On Saturday, 9. June 2012 at 19:38, Platonides wrote:
On 09/06/12 17:23, Gregor Martynus wrote:
Hi,
for a dissertation study, I try to find a reliable datasource from where I can extract user account events, specifically creation and blocking of user accounts, with usernames, the event name and timestamps,
Is such data available? If yes, could anybody point be to where I can get it from?
Thanks a lot
-- Gregor
Yes. Go to http://dumps.wikimedia.org/ You want to grab the pages-logging.xml.gz file.
Despite its name, it does contain creation (not for the very old account) and blocking logs for accounts.
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Thanks Felipe, I'll definitely give it a try next time. One thing that puzzles me: From your code it seems there would be <namespace> tags in the pages-logging.xml dump. Is this the case, I didn't see these myself.
I've updated the type/action tree with the input by Platonides, feel free to use / extend it: https://gist.github.com/2906718
I was surprised that the pages-logging.xml dump does not contain events about user contributions. My friend is searching for
- users with first time contributions in May - only manual sign ups - dates when accounts have been created
and some more detailed things, but that would be the start.
For example, there is the special page "User Contributions" (http://en.wikipedia.org/wiki/Special:Contributions). Can you point me to the dump(s) I need to get this data (namespace, page title, user, diff, comment, datetime)? The pages-logging.xml is already great to find out about created / blocked user accounts, what we are missing are the actual contributions.
Does that make sense to you?
-- Gregor Martynus
On Monday, 11. June 2012 at 14:51, Felipe Ortega wrote:
De: Gregor Martynus <gregor@martynus.net (mailto:gregor@martynus.net)> Para: Petr Onderka <gsvick@gmail.com (mailto:gsvick@gmail.com)> CC: xmldatadumps-l@lists.wikimedia.org (mailto:xmldatadumps-l@lists.wikimedia.org) Enviado: Domingo 10 de junio de 2012 19:38 Asunto: Re: [Xmldatadumps-l] anonymous user account logs (account created / account blocked)
Thanks Petr!
Besides the XML dumps, are there also direct SQL dumps available? It looks like indeed there are some things missing, e.g. namespaces. It also seems that wikipedia has information not only about the performer, but also about the target. I don't see such a thing in the XML dump … am I missing something?
Hi, Gregor.
I created a parser for the XML dumps in the logging table. It's written in Python, using SAX:
http://git.libresoft.es/WikixRay/tree/WikiXRay/parsers/dump_sax_logging.py
You also need to create a table in a local DB, and use a small connector to insert directly in the DB:
http://git.libresoft.es/WikixRay/tree/db/table_logging.sql
http://git.libresoft.es/WikixRay/tree/WikiXRay/db/dbaccess.py
Very soon, an improved version using LXML and more efficient organization will be available. This will be part of my new tool: WikiDAT (Wikipedia Data Analysis Toolkit). I expect to publish the first release very soon (before end of this month).
Another question regarding the log actions. I made a list of all actions grouped by types: https://gist.github.com/2906718
I wonder if there is a page describing the meanings of the actions? I searched for such a help page, but couldn't find anything useful yet. If there is no such thing yet, I'd love to contribute it, I'd just need a hint how and where to do it, as I'm not familiar with wikipedia myself, I'm just helping out a friend making her dissertation
In general, documentation about the structure and content of the XML dumps and other data sources is not available, or it is incomplete yet.
I'm also preparing a (brief) companion document for this new tool, explaining in some detail the different XML dump files, their structure, the meaning of fields and example queries to retrieve interesting data. I will be happy to include your map of log actions, if you agree. I'm afraid that not many people have used this (and other dumps) in the past simply because they don't have info about them.
Best, Felipe.
-- Gregor Martynus
On Sunday, 10. June 2012 at 19:00, Petr Onderka wrote: The XML file is a dump of the logging table,
see http://www.mediawiki.org/wiki/Manual:Logging_table for description of its columns.
The dump is not an exact copy of the table, though. For example, deleted log entries are still present in the table, but I believe they are not in the dump.
Petr Onderka [[en:User:Svick]]
On Sun, Jun 10, 2012 at 5:59 PM, Gregor Martynus <gregor@martynus.net (mailto:gregor@martynus.net)> wrote: I think I found a way,
- From command line, I started a connection to mysql (user:root, no
password) and opened the pages_logging database $ mysql --local-infile -uroot pages_logging
- I executed the following query:
mysql> LOAD DATA LOCAL INFILE '/path/to/pages-logging.xml' -> INTO TABLE test -> CHARACTER SET binary -> LINES STARTING BY '<logitem>' TERMINATED BY '</logitem>' (@logitem) -> SET -> id = ExtractValue(@logitem:=CONVERT(@logitem using utf8), 'id'), -> -> timestamp = ExtractValue(@logitem, 'timestamp'), -> type = ExtractValue(@logitem, 'type'), -> action = ExtractValue(@logitem, 'action'), -> logtitle = ExtractValue(@logitem, 'logitem'), -> user_id = ExtractValue(@logitem, 'contributor/id'), -> user_name = ExtractValue(@logitem, 'contributor/username');
that worked so far. But I see some rows being empty so I guess that the <logentry> nodes have a different syntax dependent on the type and action? I don't care to much as I just need the newuser and block actions, I just want to make sure that my assumption is correct, so that the study is not based on faulty data.
Is there a description of the pages-logging.xml syntax available somewhere, so I can double check my import script?
Thanks again for your help
-- Gregor Martynus
On Sunday, 10. June 2012 at 14:30, Gregor Martynus wrote:
Thank you all!
I downloaded the pages-logging.xml.gz and the data looks good!
May I ask another question? In order to analyze the data, I'd like to transform it to SQL. I've found a java and a pearl tool, but both are not made to transform the logging.xml to sql. Are you aware of another tool that I can use, or maybe an instruction to follow?
Or maybe there is another tool you can think of that I can use to analyze the information in a performant way?
Once again, thanks a lot for your help, really appreciate it.
-- Gregor Martynus
On Saturday, 9. June 2012 at 19:38, Platonides wrote:
On 09/06/12 17:23, Gregor Martynus wrote:
Hi,
for a dissertation study, I try to find a reliable datasource from where I can extract user account events, specifically creation and blocking of user accounts, with usernames, the event name and timestamps,
Is such data available? If yes, could anybody point be to where I can get it from?
Thanks a lot
-- Gregor
Yes. Go to http://dumps.wikimedia.org/ You want to grab the pages-logging.xml.gz file.
Despite its name, it does contain creation (not for the very old account) and blocking logs for accounts.
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org (mailto:Xmldatadumps-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org (mailto:Xmldatadumps-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org (mailto:Xmldatadumps-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
On 11/06/12 21:53, Gregor Martynus wrote:
Thanks Felipe, I'll definitely give it a try next time. One thing that puzzles me: From your code it seems there would be <namespace> tags in the pages-logging.xml dump. Is this the case, I didn't see these myself.
I've updated the type/action tree with the input by Platonides, feel free to use / extend it: https://gist.github.com/2906718
I was surprised that the pages-logging.xml dump does not contain events about user contributions. My friend is searching for
- users with first time contributions in May
- only manual sign ups
- dates when accounts have been created
and some more detailed things, but that would be the start.
For example, there is the special page "User Contributions" (http://en.wikipedia.org/wiki/Special:Contributions). Can you point me to the dump(s) I need to get this data (namespace, page title, user, diff, comment, datetime)? The pages-logging.xml is already great to find out about created / blocked user accounts, what we are missing are the actual contributions.
Does that make sense to you?
-- Gregor Martynus
Page edits appear in the article XML dumps. Special:Contributions is just a query against the revision table. The information you want is at pages-meta-history, but if you can use it (ie. you don't need the actual page content), stub-meta-history is a much smaller file.
De: Platonides platonides@gmail.com Para: Gregor Martynus gregor@martynus.net CC: Felipe Ortega glimmer_phoenix@yahoo.es; "xmldatadumps-l@lists.wikimedia.org" xmldatadumps-l@lists.wikimedia.org Enviado: Lunes 11 de junio de 2012 22:38 Asunto: Re: [Xmldatadumps-l] anonymous user account logs (account created / account blocked)
On 11/06/12 21:53, Gregor Martynus wrote:
Thanks Felipe, I'll definitely give it a try next time. One thing that puzzles me: From your code it seems there would be <namespace> tags in the pages-logging.xml dump. Is this the case, I didn't see these myself.
Ops. That may be remaining after from copy-paste from the parser skeleton for revision table dumps. The new version won't have that, definitely. I'll fix that source file.
I've updated the type/action tree with the input by Platonides, feel free to use / extend it: https://gist.github.com/2906718
Great, thanks.
I was surprised that the pages-logging.xml dump does not contain events about user contributions. My friend is searching for
- users with first time contributions in May
- only manual sign ups
- dates when accounts have been created
and some more detailed things, but that would be the start.
For example, there is the special page "User Contributions" (http://en.wikipedia.org/wiki/Special:Contributions). Can you point me to the dump(s) I need to get this data (namespace, page title, user, diff, comment, datetime)? The pages-logging.xml is already great to find out about created / blocked user accounts, what we are missing are the actual contributions.
Does that make sense to you?
-- Gregor Martynus
Page edits appear in the article XML dumps. Special:Contributions is just a query against the revision table. The information you want is at pages-meta-history, but if you can use it (ie. you don't need the actual page content), stub-meta-history is a much smaller file.
Indeed. For that purpose, you must join information from revision, page and logging tables. As Platonides has suggested, you have 2 options:
- Stub-meta-history: All meta information about revision and page tables, but no text. - Pages-meta-history: Same as before plus complete text for every revision in each wiki page (all namespaces). Keep in mind that you have the whole text, no diffs, for every change. That's why these files can be huge once you decompress them (x100 times larger, sometimes even more).
Please, also be careful with the 'diff' tool, as sometimes it cannot track changes between different versions accurately (it depends on which granularity you demand).
Cheers, Felipe.
----- Mensaje original -----
Thanks again for your input, sounds like the Stub-meta-history dump is exactly what we need. I'm already downloading it.
I'm not sure if this is the place for such an suggestion, but it would be great to have example versions of the real dumps, with only a few hundred entries each, just to find out if they fit specific requirements, without the need of downloading sever GB of data. Just a thought.
On 11/06/12 23:22, Gregor Martynus wrote:
Thanks again for your input, sounds like the Stub-meta-history dump is exactly what we need. I'm already downloading it.
I'm not sure if this is the place for such an suggestion, but it would be great to have example versions of the real dumps, with only a few hundred entries each, just to find out if they fit specific requirements, without the need of downloading sever GB of data. Just a thought.
-- Gregor Martynus
You can use a small wiki for that. A common choice is simplewiki, because a) It's smaller than enwiki b) It's written in English (simple English)
But if you don't have a language barrier, you can go for other wikis. For example, Wikipedia in Ligurian language has just a few thousand pages: http://dumps.wikimedia.org/lijwiki/20120611/
excellent idea, thanks Platonides.
De: Gregor Martynus gregor@martynus.net
Para: Platonides platonides@gmail.com CC: "xmldatadumps-l@lists.wikimedia.org" xmldatadumps-l@lists.wikimedia.org Enviado: Martes 12 de junio de 2012 9:13 Asunto: Re: [Xmldatadumps-l] anonymous user account logs (account created / account blocked)
excellent idea, thanks Platonides.
In fact, this is a very common procedure with just one caveat. Big languages sometimes present problems with the dump content that are not present in very small languages. In general, one must program dump parsers to be robust against missing fields (specially, missing author information, or missing text field). HTH. Felipe.
-- Gregor Martynus
On Tuesday, 12. June 2012 at 00:16, Platonides wrote: On 11/06/12 23:22, Gregor Martynus wrote:
Thanks again for your input, sounds like the Stub-meta-history dump is
exactly what we need. I'm already downloading it.
I'm not sure if this is the place for such an suggestion, but it would be great to have example versions of the real dumps, with only a few hundred entries each, just to find out if they fit specific requirements, without the need of downloading sever GB of data. Just a thought.
-- Gregor Martynus
You can use a small wiki for that. A common choice is simplewiki, because a) It's smaller than enwiki b) It's written in English (simple English)
But if you don't have a language barrier, you can go for other wikis. For example, Wikipedia in Ligurian language has just a few thousand pages: http://dumps.wikimedia.org/lijwiki/20120611/
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
De: Gregor Martynus gregor@martynus.net Para: Felipe Ortega glimmer_phoenix@yahoo.es CC: "xmldatadumps-l@lists.wikimedia.org" xmldatadumps-l@lists.wikimedia.org Enviado: Lunes 11 de junio de 2012 21:53 Asunto: Re: [Xmldatadumps-l] anonymous user account logs (account created / account blocked)
Thanks Felipe, I'll definitely give it a try next time. One thing that puzzles me: From your code it seems there would be <namespace> tags in the pages-logging.xml dump. Is this the case, I didn't see these myself.
Hi again, Gregor.
I' ve just checked with an excerpt from simplewiki-pages-logging.xml and it's not an error. Namespace info is also included, since it is part of the <siteinfo> item in XML dumps:
<siteinfo> <sitename>Wikipedia</sitename> <base>http://simple.wikipedia.org/wiki/Main_Page</base> <generator>MediaWiki 1.20wmf4</generator> <case>first-letter</case> <namespaces> <namespace key="-2" case="first-letter">Media</namespace> <namespace key="-1" case="first-letter">Special</namespace> <namespace key="0" case="first-letter" /> <namespace key="1" case="first-letter">Talk</namespace> <namespace key="2" case="first-letter">User</namespace> <namespace key="3" case="first-letter">User talk</namespace> <namespace key="4" case="first-letter">Wikipedia</namespace> <namespace key="5" case="first-letter">Wikipedia talk</namespace> <namespace key="6" case="first-letter">File</namespace> <namespace key="7" case="first-letter">File talk</namespace> <namespace key="8" case="first-letter">MediaWiki</namespace> <namespace key="9" case="first-letter">MediaWiki talk</namespace> <namespace key="10" case="first-letter">Template</namespace> <namespace key="11" case="first-letter">Template talk</namespace> <namespace key="12" case="first-letter">Help</namespace> <namespace key="13" case="first-letter">Help talk</namespace> <namespace key="14" case="first-letter">Category</namespace> <namespace key="15" case="first-letter">Category talk</namespace> </namespaces> </siteinfo>
Best, Felipe.
Thanks Felipe, I've seen that to in the header of the pages-logging.xml dump, but I don't see a reference within the <logitem> entries. Or am I missing something?
For example the stub-meta-history.xml dump has these: "<ns>0</ns>". There is no such thing in the pages-logging.xml, is it? It's not a big problem though, I think you can get the namespace out of the logtitle if you have to
De: Gregor Martynus gregor@martynus.net Para: Felipe Ortega glimmer_phoenix@yahoo.es CC: "xmldatadumps-l@lists.wikimedia.org" xmldatadumps-l@lists.wikimedia.org Enviado: Martes 12 de junio de 2012 19:24 Asunto: Re: [Xmldatadumps-l] anonymous user account logs (account created / account blocked)
Thanks Felipe, I've seen that to in the header of the pages-logging.xml dump, but I don't see a reference within the <logitem> entries. Or am I missing something?
For example the stub-meta-history.xml dump has these: "<ns>0</ns>". There is no such thing in the pages-logging.xml, is it? It's not a big problem though, I think you can get the namespace out of the logtitle if you have to
Actually, there is a section in my code in which I use the namespace name to create a similar numerical identifier for each log item. Thus, it can reproduce the same approach used in the revision table (though this field is not present in the logging table at first).
For this, you can use the text in the <logtitle> item. This is the title of the page affected by the log action. If there is a prefix such as:
Talk:Something User:FooBar
You can match the prefix with the namespace string and insert the code in the DB. In case that the title doesn't have any prefix, the page belongs to the main namespace (articles, so ns = 0).
This extra info can be very useful when filtering log actions by type of article in which they were applied. You can see in my customized definition of the "logging" table that I have added some extra fields, including this <ns> id.
Cheers,
Felipe.
-- Gregor Martynus
On Tuesday, 12. June 2012 at 16:31, Felipe Ortega wrote: ________________________________
De: Gregor Martynus gregor@martynus.net Para: Felipe Ortega glimmer_phoenix@yahoo.es CC: "xmldatadumps-l@lists.wikimedia.org" xmldatadumps-l@lists.wikimedia.org Enviado: Lunes 11 de junio de 2012 21:53 Asunto: Re: [Xmldatadumps-l] anonymous user account logs (account created / account blocked)
Thanks Felipe, I'll definitely give it a try next time. One thing that puzzles me: From your code it seems there would be <namespace> tags in the pages-logging.xml dump. Is this the case, I didn't see these myself.
Hi again, Gregor.
I' ve just checked with an excerpt from simplewiki-pages-logging.xml and it's not an error. Namespace info is also included, since it is part of the <siteinfo> item in XML dumps:
<siteinfo> <sitename>Wikipedia</sitename> <base>http://simple.wikipedia.org/wiki/Main_Page</base> <generator>MediaWiki 1.20wmf4</generator> <case>first-letter</case> <namespaces> <namespace key="-2" case="first-letter">Media</namespace> <namespace key="-1" case="first-letter">Special</namespace> <namespace key="0" case="first-letter" /> <namespace key="1" case="first-letter">Talk</namespace> <namespace key="2" case="first-letter">User</namespace> <namespace key="3" case="first-letter">User talk</namespace> <namespace key="4" case="first-letter">Wikipedia</namespace> <namespace key="5" case="first-letter">Wikipedia talk</namespace> <namespace key="6" case="first-letter">File</namespace> <namespace key="7" case="first-letter">File talk</namespace> <namespace key="8" case="first-letter">MediaWiki</namespace> <namespace key="9" case="first-letter">MediaWiki talk</namespace> <namespace key="10" case="first-letter">Template</namespace> <namespace key="11" case="first-letter">Template talk</namespace> <namespace key="12" case="first-letter">Help</namespace> <namespace key="13" case="first-letter">Help talk</namespace> <namespace key="14" case="first-letter">Category</namespace> <namespace key="15" case="first-letter">Category talk</namespace> </namespaces> </siteinfo>
Best, Felipe.
Yeah, that makes sense, thanks Felipe!
xmldatadumps-l@lists.wikimedia.org