Thanks Felipe, I'll definitely give it a try next time. One thing that puzzles me: From your code it seems there would be <namespace> tags in the pages-logging.xml dump. Is this the case, I didn't see these myself.
I've updated the type/action tree with the input by Platonides, feel free to use / extend it: https://gist.github.com/2906718
I was surprised that the pages-logging.xml dump does not contain events about user contributions. My friend is searching for
- users with first time contributions in May - only manual sign ups - dates when accounts have been created
and some more detailed things, but that would be the start.
For example, there is the special page "User Contributions" (http://en.wikipedia.org/wiki/Special:Contributions). Can you point me to the dump(s) I need to get this data (namespace, page title, user, diff, comment, datetime)? The pages-logging.xml is already great to find out about created / blocked user accounts, what we are missing are the actual contributions.
Does that make sense to you?
-- Gregor Martynus
On Monday, 11. June 2012 at 14:51, Felipe Ortega wrote:
De: Gregor Martynus <gregor@martynus.net (mailto:gregor@martynus.net)> Para: Petr Onderka <gsvick@gmail.com (mailto:gsvick@gmail.com)> CC: xmldatadumps-l@lists.wikimedia.org (mailto:xmldatadumps-l@lists.wikimedia.org) Enviado: Domingo 10 de junio de 2012 19:38 Asunto: Re: [Xmldatadumps-l] anonymous user account logs (account created / account blocked)
Thanks Petr!
Besides the XML dumps, are there also direct SQL dumps available? It looks like indeed there are some things missing, e.g. namespaces. It also seems that wikipedia has information not only about the performer, but also about the target. I don't see such a thing in the XML dump … am I missing something?
Hi, Gregor.
I created a parser for the XML dumps in the logging table. It's written in Python, using SAX:
http://git.libresoft.es/WikixRay/tree/WikiXRay/parsers/dump_sax_logging.py
You also need to create a table in a local DB, and use a small connector to insert directly in the DB:
http://git.libresoft.es/WikixRay/tree/db/table_logging.sql
http://git.libresoft.es/WikixRay/tree/WikiXRay/db/dbaccess.py
Very soon, an improved version using LXML and more efficient organization will be available. This will be part of my new tool: WikiDAT (Wikipedia Data Analysis Toolkit). I expect to publish the first release very soon (before end of this month).
Another question regarding the log actions. I made a list of all actions grouped by types: https://gist.github.com/2906718
I wonder if there is a page describing the meanings of the actions? I searched for such a help page, but couldn't find anything useful yet. If there is no such thing yet, I'd love to contribute it, I'd just need a hint how and where to do it, as I'm not familiar with wikipedia myself, I'm just helping out a friend making her dissertation
In general, documentation about the structure and content of the XML dumps and other data sources is not available, or it is incomplete yet.
I'm also preparing a (brief) companion document for this new tool, explaining in some detail the different XML dump files, their structure, the meaning of fields and example queries to retrieve interesting data. I will be happy to include your map of log actions, if you agree. I'm afraid that not many people have used this (and other dumps) in the past simply because they don't have info about them.
Best, Felipe.
-- Gregor Martynus
On Sunday, 10. June 2012 at 19:00, Petr Onderka wrote: The XML file is a dump of the logging table,
see http://www.mediawiki.org/wiki/Manual:Logging_table for description of its columns.
The dump is not an exact copy of the table, though. For example, deleted log entries are still present in the table, but I believe they are not in the dump.
Petr Onderka [[en:User:Svick]]
On Sun, Jun 10, 2012 at 5:59 PM, Gregor Martynus <gregor@martynus.net (mailto:gregor@martynus.net)> wrote: I think I found a way,
- From command line, I started a connection to mysql (user:root, no
password) and opened the pages_logging database $ mysql --local-infile -uroot pages_logging
- I executed the following query:
mysql> LOAD DATA LOCAL INFILE '/path/to/pages-logging.xml' -> INTO TABLE test -> CHARACTER SET binary -> LINES STARTING BY '<logitem>' TERMINATED BY '</logitem>' (@logitem) -> SET -> id = ExtractValue(@logitem:=CONVERT(@logitem using utf8), 'id'), -> -> timestamp = ExtractValue(@logitem, 'timestamp'), -> type = ExtractValue(@logitem, 'type'), -> action = ExtractValue(@logitem, 'action'), -> logtitle = ExtractValue(@logitem, 'logitem'), -> user_id = ExtractValue(@logitem, 'contributor/id'), -> user_name = ExtractValue(@logitem, 'contributor/username');
that worked so far. But I see some rows being empty so I guess that the <logentry> nodes have a different syntax dependent on the type and action? I don't care to much as I just need the newuser and block actions, I just want to make sure that my assumption is correct, so that the study is not based on faulty data.
Is there a description of the pages-logging.xml syntax available somewhere, so I can double check my import script?
Thanks again for your help
-- Gregor Martynus
On Sunday, 10. June 2012 at 14:30, Gregor Martynus wrote:
Thank you all!
I downloaded the pages-logging.xml.gz and the data looks good!
May I ask another question? In order to analyze the data, I'd like to transform it to SQL. I've found a java and a pearl tool, but both are not made to transform the logging.xml to sql. Are you aware of another tool that I can use, or maybe an instruction to follow?
Or maybe there is another tool you can think of that I can use to analyze the information in a performant way?
Once again, thanks a lot for your help, really appreciate it.
-- Gregor Martynus
On Saturday, 9. June 2012 at 19:38, Platonides wrote:
On 09/06/12 17:23, Gregor Martynus wrote:
Hi,
for a dissertation study, I try to find a reliable datasource from where I can extract user account events, specifically creation and blocking of user accounts, with usernames, the event name and timestamps,
Is such data available? If yes, could anybody point be to where I can get it from?
Thanks a lot
-- Gregor
Yes. Go to http://dumps.wikimedia.org/ You want to grab the pages-logging.xml.gz file.
Despite its name, it does contain creation (not for the very old account) and blocking logs for accounts.
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org (mailto:Xmldatadumps-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org (mailto:Xmldatadumps-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org (mailto:Xmldatadumps-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l