Xmldatadumps-l July 2012

xmldatadumps-l@lists.wikimedia.org

8 participants
5 discussions

by Richard Jelinek

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

8 years, 3 months

Malware reported in mirror

by Kevin Day

We got an automated notice from one of the big search engines that both http://ftpmirror.your.org and http://dumps.wikimedia.your.org were hosting some unspecified malware. I've verified nothing on the mirror box itself is compromised from the best I can tell, which leaves them being unhappy with something that we're mirroring. I've started ClamAV scanning the whole public volume, but that's going to take quite a while (+20 million files, 80TB of data). The only thing it's complained about so far is: http://ftpmirror.your.org/pub/wikimedia/images/wiktionary/fj/c/c4/citibank-… which was making the scanner crash. I don't see anything wrong with the file itself though. Is it possible someone could have uploaded something at one point that was malicious and it's still floating around in the archives that got pushed to us? -- Kevin

11 years, 9 months

Re: [Xmldatadumps-l] anonymous user account logs (account created / account blocked)

by Platonides

On 02/07/12 18:45, Gregor Martynus wrote: > Hey there, > > I really appreciate the help of you and the others in the mailing list, > thanks to your help I was able to extract the data my friend needs for > her wikipedia study. > > While it worked as a proof of concept, that the data she needs is > available, it took to much time to process all needed data and she needs > it another 3 to 4 times. > > I contact you because you offered me to send an SQL dump back then, does > that mean you've access to the Wikipedia database? > > I'm not an SQL expert, but I think based on the MediaWiki database > schema, I could build the SQL queries for the 3 things she needs > > 1. user_accounts(name, created_at, blocked_at) created manually in 2012 > 2. pages (namespace_id, title, is_redirect) > 3. revisions (page_id, user_id, created_at) created by users from 1. > bytes_diff would be nice, if available > > The goal is to find users that manually created an account, have not > been blocked and that made their first contribution in June 2012. > > Any chance you could run the queries and send me the results as sql > dumps? Or if not, do you know somebody I could ask? > > Thanks again for your help! > > -- > Gregor Martynus Given that your requisites here are more specific than those of the bug, I have made the following query for you, taking advantage that the recentchanges talbe keeps the data from last month: select rc_timestamp, rc_user_text, user_registration, rc_namespace, rc_title, rc_old_len, rc_new_len, rc_new, rc_minor, rc_type, page_is_redirect from recentchanges JOIN user on (user_id=rc_user) LEFT JOIN page ON (page_id=rc_cur_id) where rc_timestamp > '201206000000' and user_registration LIKE '2012%' order by rc_timestamp asc Some notes: - This will contain from June 2 to July 2. - No bytes_diff, but you have old and new byte len :) - It may have some non-edit log entries. File lives at http://toolserver.org/~platonides/sandbox/martynus.txt I also made a list of the accounts created in 2012 which were made manually: SELECT user_id, user_name, log_timestamp AS signed_up_at FROM user LEFT JOIN logging ON user_id = log_user WHERE log_type = 'newusers' AND log_action = 'create' AND log_timestamp > '20120000000000 File lives at http://toolserver.org/~platonides/sandbox/martynus2.txt And of block actions done to users registered in 2012: SELECT log_action, log_timestamp, log_title, user_registration FROM logging JOIN user on (log_title=user_name) WHERE log_type = 'block'AND log_timestamp > '20120000000000' and user_registration > '20120000000000' File lives at http://toolserver.org/~platonides/sandbox/martynus3.txt I guess that will work for your needs. Regards PS: Rows per file: 73426 martynus.txt 54429 martynus2.txt 2893 martynus3.txt

11 years, 9 months

en wiki dumps halted while...

by Ariel T. Glenn

db lag on one of the en wiki slaves is investigated; don't know how long it will take to find the cause or fix it (not days, but surely hours). Ariel

11 years, 9 months

custom SQL dump?

by Gregor Martynus

Sorry if this is the wrong place to ask, please let me know if there is a better place or if you know somebody whom I could contact. 2 questions: 1. If I'd prepare 3 SQL SELECT queries to run against DEWIKI, could anybody with access actually run them and send me the results as sql dumps? 2. Could I have an SQL dump of http://simple.wikipedia.org/ or another small Wikipedia to prepare the queries? The results will be ~20MB. I was able – with the help of this mailing list – to extract the needed information out of the XML dumps last month, but it took me days to download and process the files and my friend will need the data another 3 - 4 times for her study. I'd really appreciate your help -- Gregor

11 years, 9 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l July 2012