De: Platonides platonides@gmail.com Para: Gregor Martynus gregor@martynus.net CC: Felipe Ortega glimmer_phoenix@yahoo.es; "xmldatadumps-l@lists.wikimedia.org" xmldatadumps-l@lists.wikimedia.org Enviado: Lunes 11 de junio de 2012 22:38 Asunto: Re: [Xmldatadumps-l] anonymous user account logs (account created / account blocked)
On 11/06/12 21:53, Gregor Martynus wrote:
Thanks Felipe, I'll definitely give it a try next time. One thing that puzzles me: From your code it seems there would be <namespace> tags in the pages-logging.xml dump. Is this the case, I didn't see these myself.
Ops. That may be remaining after from copy-paste from the parser skeleton for revision tableĀ dumps. The new version won't have that, definitely. I'll fix that source file.
I've updated the type/action tree with the input by Platonides, feel free to use / extend it: https://gist.github.com/2906718
Great, thanks.
I was surprised that the pages-logging.xml dump does not contain events about user contributions. My friend is searching for
- users with first time contributions in May
- only manual sign ups
- dates when accounts have been created
and some more detailed things, but that would be the start.
For example, there is the special page "User Contributions" (http://en.wikipedia.org/wiki/Special:Contributions). Can you point me to the dump(s) I need to get this data (namespace, page title, user, diff, comment, datetime)? The pages-logging.xml is already great to find out about created / blocked user accounts, what we are missing are the actual contributions.
Does that make sense to you?
-- Gregor Martynus
Page edits appear in the article XML dumps. Special:Contributions is just a query against the revision table. The information you want is at pages-meta-history, but if you can use it (ie. you don't need the actual page content), stub-meta-history is a much smaller file.
Indeed. For that purpose, you must join information from revision, page and logging tables. As Platonides has suggested, you have 2 options:
- Stub-meta-history: All meta information about revision and page tables, but no text. - Pages-meta-history: Same as before plus complete text for every revision in each wiki page (all namespaces). Keep in mind that you have the whole text, no diffs, for every change. That's why these files can be huge once you decompress them (x100 times larger, sometimes even more).
Please, also be careful with the 'diff' tool, as sometimes it cannot track changes between different versions accurately (it depends on which granularity you demand).
Cheers, Felipe.
----- Mensaje original -----