Hello alltogether,
is there any alternative way to get hands on a wikipedia dump? Preferably the last complete one. Which was supposed to be found at this address: http://download.wikimedia.org/enwiki/20100130/
I would need that dump asap for my research. Thank you for any help!
Best regards
—
Oliver Schmidt PhD student Nano Systems Biology Research Group
University of Ulster, School of Biomedical Sciences Cromore Road, Coleraine BT52 1SA, Northern Ireland
T: +44 / (0)28 / 7032 3367 F: +44 / (0)28 / 7032 4375 E: schmidt-o1@email.ulster.ac.ukmailto:schmidt-o1@email.ulster.ac.uk
—
Crossposting.
This dump is in /mnt/user-store/dump or dumps, on Toolserver. If the admins don't see any problem, it may be put available for download (~30GB).
Regards, emijrp
2010/11/25 Oliver Schmidt Schmidt-O1@email.ulster.ac.uk
Hello alltogether,
is there any alternative way to get hands on a wikipedia dump? Preferably the last complete one. Which was supposed to be found at this address: http://download.wikimedia.org/enwiki/20100130/
I would need that dump asap for my research. Thank you for any help!
Best regards
—
Oliver Schmidt PhD student Nano Systems Biology Research Group
University of Ulster, School of Biomedical Sciences Cromore Road, Coleraine BT52 1SA, Northern Ireland
T: +44 / (0)28 / 7032 3367 F: +44 / (0)28 / 7032 4375 E: schmidt-o1@email.ulster.ac.ukmailto:schmidt-o1@email.ulster.ac.uk
—
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Thank you Emijrp,
That sounds great! If the admins agree, how can I access the toolserver?
Shall I request an account to access the toolserver as stated here: https://wiki.toolserver.org/view/Account_approval_process
or shall I wait until the admins agree?
Regards, Olli
________________________________________ From: wikitech-l-bounces@lists.wikimedia.org [wikitech-l-bounces@lists.wikimedia.org] on behalf of emijrp [emijrp@gmail.com] Sent: 26 November 2010 12:51 To: Wikimedia developers Cc: toolserver-l@lists.wikimedia.org Subject: Re: [Wikitech-l] alternative way to get wikipedia dump while server is down
Crossposting.
This dump is in /mnt/user-store/dump or dumps, on Toolserver. If the admins don't see any problem, it may be put available for download (~30GB).
Regards, emijrp
2010/11/25 Oliver Schmidt Schmidt-O1@email.ulster.ac.uk
Hello alltogether,
is there any alternative way to get hands on a wikipedia dump? Preferably the last complete one. Which was supposed to be found at this address: http://download.wikimedia.org/enwiki/20100130/
I would need that dump asap for my research. Thank you for any help!
Best regards
—
Oliver Schmidt PhD student Nano Systems Biology Research Group
University of Ulster, School of Biomedical Sciences Cromore Road, Coleraine BT52 1SA, Northern Ireland
T: +44 / (0)28 / 7032 3367 F: +44 / (0)28 / 7032 4375 E: schmidt-o1@email.ulster.ac.ukmailto:schmidt-o1@email.ulster.ac.uk
—
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Fri, Nov 26, 2010 at 1:51 PM, emijrp emijrp@gmail.com wrote:
Crossposting.
This dump is in /mnt/user-store/dump or dumps, on Toolserver. If the admins don't see any problem, it may be put available for download (~30GB).
Somehow I think that publishing an entire dump violates the "do not publish significant parts of an article" rule.
2010/11/26 Bryan Tong Minh bryan.tongminh@gmail.com:
Somehow I think that publishing an entire dump violates the "do not publish significant parts of an article" rule.
Surely the toolserver admins could be asked to consider waiving that in this case considering the public nature of the dumps and the downtime situation with download.wm.o
Roan Kattouw (Catrope)
I really don't want to bug you people but I'd like to prepare everything to get the dump. Is it necessary that I request an account to access the toolserver?
Sorry for my unqualified question.
Thanks for any help
Olli
________________________________________ From: wikitech-l-bounces@lists.wikimedia.org [wikitech-l-bounces@lists.wikimedia.org] on behalf of Roan Kattouw [roan.kattouw@gmail.com] Sent: 26 November 2010 15:12 To: Wikimedia developers Cc: toolserver-l@lists.wikimedia.org Subject: Re: [Wikitech-l] [Toolserver-l] alternative way to get wikipedia dump while server is down
2010/11/26 Bryan Tong Minh bryan.tongminh@gmail.com:
Somehow I think that publishing an entire dump violates the "do not publish significant parts of an article" rule.
Surely the toolserver admins could be asked to consider waiving that in this case considering the public nature of the dumps and the downtime situation with download.wm.o
Roan Kattouw (Catrope)
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Roan Kattouw wrote:
2010/11/26 Bryan Tong Minh bryan.tongminh@gmail.com:
Somehow I think that publishing an entire dump violates the "do not publish significant parts of an article" rule.
Surely the toolserver admins could be asked to consider waiving that in this case considering the public nature of the dumps and the downtime situation with download.wm.o
Roan Kattouw (Catrope)
It's not that toolserver admins are excentric adding such rule, but an issue of WM-DE liability if such information is published.
However, I think that providing such file to just a few selected people would be acceptable.
Also, as discussed with Ariel, I will gladly mirror such dumps at wm-es web space.
On Fri, Nov 26, 2010 at 6:43 PM, Platonides platonides@gmail.com wrote:
Also, as discussed with Ariel, I will gladly mirror such dumps at wm-es web space.
You do have a toolserver account right? I think it would be a good idea if you could copy the dumps from the TS to your web space.
It's not that toolserver admins are excentric adding such rule, but an issue of WM-DE liability if such information is published.
However, I think that providing such file to just a few selected people would be acceptable.
I think so too, but I have to ask our ED. Will send an email now.
-- daniel
Hi,
There is a way to download wikidumps for any project / language, the data is from early 2009. I will detail the steps as a note for future reference. The data is made available as part of Amazon AWS Public Datasets (http://aws.amazon.com/publicdatasets/). This method will cost a tiny amount of money, as you will have to pay for an Amazon EC2 instance. 1) Create an AWS account 2) Log in to AWS and select the EC2 tab 3) Click 'Launch Instance' 4) Select a Configuration, the Basic 64-bit Amazon Linux AMI 1.0 is fine 5) Instance Type: select Micro (this is the cheapest) and press continue 6) Instance Defaults: keep the defaults and press continue 7) Enter key, value pairs, such as key=name, value=WikiMirror, this is not really required and press continue 8) Create a new key pair and give it a name and press Create & Download your Key Pair (this is your private EC2 key and you need to store it somewhere safe). 9) Create a security group, default settings are fine and press enter to continue. 10) Review your settings and press launch
This will start an Amazon EC2 instance. The next step is to make the Wikimedia public dataset accessible to our instance. 1) Click EBS Volumes on the upper right of the window 2) Select Create Volume from the different tabs 3) Snapshot pulldown: scrolldown and search for Wikipedia XML Backups (XML) and enter 500Gb in the size input field. Make sure that the zone matches the zone of your primary volume and press create 4) Click Attach Volume and enter /dev/sdf or similar
We know have made the Dataset available to our EC2 instance. Let's get the data: (I'll assume a Windows environment): I think this step is not necessary for Linux / OSX users): 1) We have a .PEM certificate but if we would like to use Putty on Windows then we need to convert the .PEM certificate to a certificate the Putty can use. 2) Download puttygen from http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html 3) Start puttygen and select Load Private Key. Select the key that we downloaded at step 8 of creating an EC2 instance. 4) Press Save private key and save the new .ppk key 5) Close puttygen and start putty (you can download it from http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html as well) 6) Create a new session: the EC2 server name can be found by going to EC2 dashboard on the left, and then selecting Running Instance. At the bottom of the page you will find: Public DNS and a long name, copy this name and enter it in the Putty session. 7) In putty, click on SSH on the left and then Auth. There is a field saying use key for authentication and a browse button. Press the browse button and select the key from step 4. 8) Start the session, enter as username ec2-user, the key will authorize you. We have logged on to our EC2 instance 9) enter on the command line: sudo mkdir /mnt/data-store 10) enter on the command line: sudo mount /dev/sdf /mnt/data-store (the sdf depends on what you entered at step 4 of creating the dataset. 11) cd /mnt/data-store and enter ls -al and you will see all the files available.
Next step is either to copy a file using SCP or start your own FTP server on the EC2 instance and download the files that you need. You will need to pay a small fee but this is in the range of cents.
Best,
Diederik
On Sun, Nov 28, 2010 at 8:20 AM, Daniel Kinzler daniel@brightbyte.de wrote:
It's not that toolserver admins are excentric adding such rule, but an issue of WM-DE liability if such information is published.
However, I think that providing such file to just a few selected people would be acceptable.
I think so too, but I have to ask our ED. Will send an email now.
-- daniel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- <a href="http://about.me/diederik">Check out my about.me profile!</a>
I have a copy of the 20091009 enwiki dumps if that would do:
http://jeffkubina.org/data/download.wikimedia.org/enwiki/20091009/
Jeff
On 28 November 2010 02:42, Jeff Kubina jeff.kubina@gmail.com wrote:
I have a copy of the 20091009 enwiki dumps if that would do:
http://jeffkubina.org/data/download.wikimedia.org/enwiki/20091009/
Jeff
Jeff Kubina http://google.com/profiles/jeff.kubina 410-988-4436 8am-10pm EST
On Thu, Nov 25, 2010 at 12:30 PM, Oliver Schmidt < Schmidt-O1@email.ulster.ac.uk> wrote:
Hello alltogether,
is there any alternative way to get hands on a wikipedia dump? Preferably the last complete one. Which was supposed to be found at this address: http://download.wikimedia.org/enwiki/20100130/
I would need that dump asap for my research. Thank you for any help!
Best regards
—
Oliver Schmidt PhD student Nano Systems Biology Research Group
University of Ulster, School of Biomedical Sciences Cromore Road, Coleraine BT52 1SA, Northern Ireland
T: +44 / (0)28 / 7032 3367 F: +44 / (0)28 / 7032 4375 E: schmidt-o1@email.ulster.ac.ukmailto:schmidt-o1@email.ulster.ac.uk
—
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
I don't suppose anybody has a copy of any Romanian or Georgian Wiktionary from any time? (-:
Andrew Dunbar (hippietrail)
What are the ISO codes? ro and ka?
I have kawiktionary-20100807-pages-meta-history.xml.7z (1.3 MB) and rowiktionary-20100810-pages-meta-history.xml.7z (10.1 MB). Very tiny.
2010/11/28 Andrew Dunbar hippytrail@gmail.com
On 28 November 2010 02:42, Jeff Kubina jeff.kubina@gmail.com wrote:
I have a copy of the 20091009 enwiki dumps if that would do:
http://jeffkubina.org/data/download.wikimedia.org/enwiki/20091009/
Jeff
Jeff Kubina http://google.com/profiles/jeff.kubina 410-988-4436 8am-10pm EST
On Thu, Nov 25, 2010 at 12:30 PM, Oliver Schmidt < Schmidt-O1@email.ulster.ac.uk> wrote:
Hello alltogether,
is there any alternative way to get hands on a wikipedia dump? Preferably the last complete one. Which was supposed to be found at this address: http://download.wikimedia.org/enwiki/20100130/
I would need that dump asap for my research. Thank you for any help!
Best regards
—
Oliver Schmidt PhD student Nano Systems Biology Research Group
University of Ulster, School of Biomedical Sciences Cromore Road, Coleraine BT52 1SA, Northern Ireland
T: +44 / (0)28 / 7032 3367 F: +44 / (0)28 / 7032 4375 E: schmidt-o1@email.ulster.ac.ukmailto:schmidt-o1@email.ulster.ac.uk
—
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
I don't suppose anybody has a copy of any Romanian or Georgian Wiktionary from any time? (-:
Andrew Dunbar (hippietrail)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org