Hi,
don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
reasons:
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
host.
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
interesting? :-)
cheers,
--
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
We got an automated notice from one of the big search engines that both http://ftpmirror.your.org and http://dumps.wikimedia.your.org were hosting some unspecified malware. I've verified nothing on the mirror box itself is compromised from the best I can tell, which leaves them being unhappy with something that we're mirroring.
I've started ClamAV scanning the whole public volume, but that's going to take quite a while (+20 million files, 80TB of data). The only thing it's complained about so far is:
http://ftpmirror.your.org/pub/wikimedia/images/wiktionary/fj/c/c4/citibank-…
which was making the scanner crash. I don't see anything wrong with the file itself though.
Is it possible someone could have uploaded something at one point that was malicious and it's still floating around in the archives that got pushed to us?
-- Kevin
On 02/07/12 18:45, Gregor Martynus wrote:
> Hey there,
>
> I really appreciate the help of you and the others in the mailing list,
> thanks to your help I was able to extract the data my friend needs for
> her wikipedia study.
>
> While it worked as a proof of concept, that the data she needs is
> available, it took to much time to process all needed data and she needs
> it another 3 to 4 times.
>
> I contact you because you offered me to send an SQL dump back then, does
> that mean you've access to the Wikipedia database?
>
> I'm not an SQL expert, but I think based on the MediaWiki database
> schema, I could build the SQL queries for the 3 things she needs
>
> 1. user_accounts(name, created_at, blocked_at) created manually in 2012
> 2. pages (namespace_id, title, is_redirect)
> 3. revisions (page_id, user_id, created_at) created by users from 1.
> bytes_diff would be nice, if available
>
> The goal is to find users that manually created an account, have not
> been blocked and that made their first contribution in June 2012.
>
> Any chance you could run the queries and send me the results as sql
> dumps? Or if not, do you know somebody I could ask?
>
> Thanks again for your help!
>
> --
> Gregor Martynus
Given that your requisites here are more specific than those of the bug,
I have made the following query for you, taking advantage that the
recentchanges talbe keeps the data from last month:
select rc_timestamp, rc_user_text, user_registration, rc_namespace,
rc_title, rc_old_len, rc_new_len, rc_new, rc_minor, rc_type,
page_is_redirect from recentchanges JOIN user on (user_id=rc_user) LEFT
JOIN page ON (page_id=rc_cur_id) where rc_timestamp > '201206000000'
and user_registration LIKE '2012%' order by rc_timestamp asc
Some notes:
- This will contain from June 2 to July 2.
- No bytes_diff, but you have old and new byte len :)
- It may have some non-edit log entries.
File lives at http://toolserver.org/~platonides/sandbox/martynus.txt
I also made a list of the accounts created in 2012 which were made manually:
SELECT user_id, user_name, log_timestamp AS signed_up_at FROM
user LEFT JOIN logging ON user_id = log_user WHERE log_type =
'newusers' AND log_action = 'create' AND log_timestamp >
'20120000000000
File lives at http://toolserver.org/~platonides/sandbox/martynus2.txt
And of block actions done to users registered in 2012:
SELECT log_action, log_timestamp, log_title, user_registration FROM
logging JOIN user on (log_title=user_name) WHERE log_type = 'block'AND
log_timestamp > '20120000000000' and user_registration > '20120000000000'
File lives at http://toolserver.org/~platonides/sandbox/martynus3.txt
I guess that will work for your needs.
Regards
PS: Rows per file:
73426 martynus.txt
54429 martynus2.txt
2893 martynus3.txt
Sorry if this is the wrong place to ask, please let me know if there is a better place or if you know somebody whom I could contact.
2 questions:
1. If I'd prepare 3 SQL SELECT queries to run against DEWIKI, could anybody with access actually run them and send me the results as sql dumps?
2. Could I have an SQL dump of http://simple.wikipedia.org/ or another small Wikipedia to prepare the queries?
The results will be ~20MB.
I was able – with the help of this mailing list – to extract the needed information out of the XML dumps last month, but it took me days to download and process the files and my friend will need the data another 3 - 4 times for her study.
I'd really appreciate your help
--
Gregor