Hi,
There is a way to download wikidumps for any project / language, the data is from early 2009. I will detail the steps as a note for future reference. The data is made available as part of Amazon AWS Public Datasets (http://aws.amazon.com/publicdatasets/). This method will cost a tiny amount of money, as you will have to pay for an Amazon EC2 instance. 1) Create an AWS account 2) Log in to AWS and select the EC2 tab 3) Click 'Launch Instance' 4) Select a Configuration, the Basic 64-bit Amazon Linux AMI 1.0 is fine 5) Instance Type: select Micro (this is the cheapest) and press continue 6) Instance Defaults: keep the defaults and press continue 7) Enter key, value pairs, such as key=name, value=WikiMirror, this is not really required and press continue 8) Create a new key pair and give it a name and press Create & Download your Key Pair (this is your private EC2 key and you need to store it somewhere safe). 9) Create a security group, default settings are fine and press enter to continue. 10) Review your settings and press launch
This will start an Amazon EC2 instance. The next step is to make the Wikimedia public dataset accessible to our instance. 1) Click EBS Volumes on the upper right of the window 2) Select Create Volume from the different tabs 3) Snapshot pulldown: scrolldown and search for Wikipedia XML Backups (XML) and enter 500Gb in the size input field. Make sure that the zone matches the zone of your primary volume and press create 4) Click Attach Volume and enter /dev/sdf or similar
We know have made the Dataset available to our EC2 instance. Let's get the data: (I'll assume a Windows environment): I think this step is not necessary for Linux / OSX users): 1) We have a .PEM certificate but if we would like to use Putty on Windows then we need to convert the .PEM certificate to a certificate the Putty can use. 2) Download puttygen from http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html 3) Start puttygen and select Load Private Key. Select the key that we downloaded at step 8 of creating an EC2 instance. 4) Press Save private key and save the new .ppk key 5) Close puttygen and start putty (you can download it from http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html as well) 6) Create a new session: the EC2 server name can be found by going to EC2 dashboard on the left, and then selecting Running Instance. At the bottom of the page you will find: Public DNS and a long name, copy this name and enter it in the Putty session. 7) In putty, click on SSH on the left and then Auth. There is a field saying use key for authentication and a browse button. Press the browse button and select the key from step 4. 8) Start the session, enter as username ec2-user, the key will authorize you. We have logged on to our EC2 instance 9) enter on the command line: sudo mkdir /mnt/data-store 10) enter on the command line: sudo mount /dev/sdf /mnt/data-store (the sdf depends on what you entered at step 4 of creating the dataset. 11) cd /mnt/data-store and enter ls -al and you will see all the files available.
Next step is either to copy a file using SCP or start your own FTP server on the EC2 instance and download the files that you need. You will need to pay a small fee but this is in the range of cents.
Best,
Diederik
On Sun, Nov 28, 2010 at 8:20 AM, Daniel Kinzler daniel@brightbyte.de wrote:
It's not that toolserver admins are excentric adding such rule, but an issue of WM-DE liability if such information is published.
However, I think that providing such file to just a few selected people would be acceptable.
I think so too, but I have to ask our ED. Will send an email now.
-- daniel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- <a href="http://about.me/diederik">Check out my about.me profile!</a>