Thanks Petr!

Besides the XML dumps, are there also direct SQL dumps available? It looks like indeed there are some things missing, e.g. namespaces. It also seems that wikipedia has information not only about the performer, but also about the target. I don't see such a thing in the XML dump … am I missing something?

Another question regarding the log actions. I made a list of all actions grouped by types:
https://gist.github.com/2906718

I wonder if there is a page describing the meanings of the actions? I searched for such a help page, but couldn't find anything useful yet. If there is no such thing yet, I'd love to contribute it, I'd just need a hint how and where to do it, as I'm not familiar with wikipedia myself, I'm just helping out a friend making her dissertation

-- 
Gregor Martynus

On Sunday, 10. June 2012 at 19:00, Petr Onderka wrote:

The XML file is a dump of the logging table,
see http://www.mediawiki.org/wiki/Manual:Logging_table for description
of its columns.

The dump is not an exact copy of the table, though.
For example, deleted log entries are still present in the table,
but I believe they are not in the dump.

Petr Onderka
[[en:User:Svick]]

On Sun, Jun 10, 2012 at 5:59 PM, Gregor Martynus <gregor@martynus.net> wrote:
I think I found a way,

1. From command line, I started a connection to mysql (user:root, no
password) and opened the pages_logging database
$ mysql --local-infile -uroot pages_logging

2. I executed the following query:

mysql> LOAD DATA LOCAL INFILE '/path/to/pages-logging.xml'
    -> INTO TABLE test
    -> CHARACTER SET binary
    -> LINES STARTING BY '<logitem>' TERMINATED BY '</logitem>' (@logitem)
    -> SET
    ->   id        = ExtractValue(@logitem:=CONVERT(@logitem using utf8),
'id'),
    ->
    ->   timestamp = ExtractValue(@logitem, 'timestamp'),
    ->   type      = ExtractValue(@logitem, 'type'),
    ->   action    = ExtractValue(@logitem, 'action'),
    ->   logtitle  = ExtractValue(@logitem, 'logitem'),
    ->   user_id   = ExtractValue(@logitem, 'contributor/id'),
    ->   user_name = ExtractValue(@logitem, 'contributor/username');

that worked so far. But I see some rows being empty so I guess that the
<logentry> nodes have a different syntax dependent on the type and action? I
don't care to much as I just need the newuser and block actions, I just want
to make sure that my assumption is correct, so that the study is not based
on faulty data.

Is there a description of the pages-logging.xml syntax available somewhere,
so I can double check my import script?

Thanks again for your help

--
Gregor Martynus

On Sunday, 10. June 2012 at 14:30, Gregor Martynus wrote:

Thank you all!

I downloaded the pages-logging.xml.gz and the data looks good!

May I ask another question? In order to analyze the data, I'd like to
transform it to SQL. I've found a java and a pearl tool, but both are not
made to transform the logging.xml to sql. Are you aware of another tool that
I can use, or maybe an instruction to follow?

Or maybe there is another tool you can think of that I can use to analyze
the information in a performant way?

Once again, thanks a lot for your help, really appreciate it.

--
Gregor Martynus

On Saturday, 9. June 2012 at 19:38, Platonides wrote:

On 09/06/12 17:23, Gregor Martynus wrote:

Hi,

for a dissertation study, I try to find a reliable datasource from where
I can extract user account events, specifically creation and blocking of
user accounts, with usernames, the event name and timestamps,

Is such data available? If yes, could anybody point be to where I can
get it from?

Thanks a lot

--
Gregor


You want to grab the pages-logging.xml.gz file.

Despite its name, it does contain creation (not for the very old
account) and blocking logs for accounts.




_______________________________________________
Xmldatadumps-l mailing list