Thanks Felipe, I'll definitely give it a try next time. One thing that puzzles me: 
From your code it seems there would be <namespace> tags in the pages-logging.xml dump. Is this the case, I didn't see these myself.

I've updated the type/action tree with the input by Platonides, feel free to use / extend it:
https://gist.github.com/2906718

I was surprised that the pages-logging.xml dump does not contain events about user contributions. My friend is searching for

- users with first time contributions in May
- only manual sign ups
- dates when accounts have been created

and some more detailed things, but that would be the start.

For example, there is the special page "User Contributions" (http://en.wikipedia.org/wiki/Special:Contributions). Can you point me to the dump(s) I need to get this data (namespace, page title, user, diff, comment, datetime)? The pages-logging.xml is already great to find out about created / blocked user accounts, what we are missing are the actual contributions.

Does that make sense to you?

-- 
Gregor Martynus

On Monday, 11. June 2012 at 14:51, Felipe Ortega wrote:

________________________________
De: Gregor Martynus <gregor@martynus.net>
Para: Petr Onderka <gsvick@gmail.com>
Enviado: Domingo 10 de junio de 2012 19:38
Asunto: Re: [Xmldatadumps-l] anonymous user account logs (account created / account blocked)


Thanks Petr!


Besides the XML dumps, are there also direct SQL dumps available? It looks like indeed there are some things missing, e.g. namespaces. It also seems that wikipedia has information not only about the performer, but also about the target. I don't see such a thing in the XML dump … am I missing something?

Hi, Gregor.


I created a parser for the XML dumps in the logging table. It's written in Python, using SAX:

http://git.libresoft.es/WikixRay/tree/WikiXRay/parsers/dump_sax_logging.py

You also need to create a table in a local DB, and use a small connector to insert directly in the DB:

http://git.libresoft.es/WikixRay/tree/db/table_logging.sql

http://git.libresoft.es/WikixRay/tree/WikiXRay/db/dbaccess.py

Very soon, an improved version using LXML and more efficient organization will be available. This will be part of my new tool: WikiDAT (Wikipedia Data Analysis Toolkit). I expect to publish the first release very soon (before end of this month).



Another question regarding the log actions. I made a list of all actions grouped by types:


I wonder if there is a page describing the meanings of the actions? I searched for such a help page, but couldn't find anything useful yet. If there is no such thing yet, I'd love to contribute it, I'd just need a hint how and where to do it, as I'm not familiar with wikipedia myself, I'm just helping out a friend making her dissertation

In general, documentation about the structure and content of the XML dumps and other data sources is not available, or it is incomplete yet.

I'm also preparing a (brief) companion document for this new tool, explaining in some detail the different XML dump files, their structure, the meaning of fields and example queries to retrieve interesting data. I will be happy to include your map of log actions, if you agree. I'm afraid that not many people have used this (and other dumps) in the past simply because they don't have info about them.


Best,
Felipe.




-- 
Gregor Martynus



On Sunday, 10. June 2012 at 19:00, Petr Onderka wrote:
The XML file is a dump of the logging table,
of its columns.


The dump is not an exact copy of the table, though.
For example, deleted log entries are still present in the table,
but I believe they are not in the dump.


Petr Onderka
[[en:User:Svick]]


On Sun, Jun 10, 2012 at 5:59 PM, Gregor Martynus <gregor@martynus.net> wrote:
I think I found a way,


1. From command line, I started a connection to mysql (user:root, no
password) and opened the pages_logging database
$ mysql --local-infile -uroot pages_logging


2. I executed the following query:


mysql> LOAD DATA LOCAL INFILE '/path/to/pages-logging.xml'
    -> INTO TABLE test
    -> CHARACTER SET binary
    -> LINES STARTING BY '<logitem>' TERMINATED BY '</logitem>' (@logitem)
    -> SET
    ->   id        = ExtractValue(@logitem:=CONVERT(@logitem using utf8),
'id'),
    ->
    ->   timestamp = ExtractValue(@logitem, 'timestamp'),
    ->   type      = ExtractValue(@logitem, 'type'),
    ->   action    = ExtractValue(@logitem, 'action'),
    ->   logtitle  = ExtractValue(@logitem, 'logitem'),
    ->   user_id   = ExtractValue(@logitem, 'contributor/id'),
    ->   user_name = ExtractValue(@logitem, 'contributor/username');


that worked so far. But I see some rows being empty so I guess that the
<logentry> nodes have a different syntax dependent on the type and action? I
don't care to much as I just need the newuser and block actions, I just want
to make sure that my assumption is correct, so that the study is not based
on faulty data.


Is there a description of the pages-logging.xml syntax available somewhere,
so I can double check my import script?


Thanks again for your help


--
Gregor Martynus


On Sunday, 10. June 2012 at 14:30, Gregor Martynus wrote:


Thank you all!


I downloaded the pages-logging.xml.gz and the data looks good!


May I ask another question? In order to analyze the data, I'd like to
transform it to SQL. I've found a java and a pearl tool, but both are not
made to transform the logging.xml to sql. Are you aware of another tool that
I can use, or maybe an instruction to follow?


Or maybe there is another tool you can think of that I can use to analyze
the information in a performant way?


Once again, thanks a lot for your help, really appreciate it.


--
Gregor Martynus


On Saturday, 9. June 2012 at 19:38, Platonides wrote:


On 09/06/12 17:23, Gregor Martynus wrote:


Hi,


for a dissertation study, I try to find a reliable datasource from where
I can extract user account events, specifically creation and blocking of
user accounts, with usernames, the event name and timestamps,


Is such data available? If yes, could anybody point be to where I can
get it from?


Thanks a lot


--
Gregor




You want to grab the pages-logging.xml.gz file.


Despite its name, it does contain creation (not for the very old
account) and blocking logs for accounts.








_______________________________________________
Xmldatadumps-l mailing list


_______________________________________________
Xmldatadumps-l mailing list

_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l