Dear Wikiteam, Guy Chapman requested that I post to the mailing list to ask how we can proceed to getting a copy of Wikipedia so that we can offer it as a database in our free search service, in response to the request in the following paragraph. He made me aware of its size, but that is not an issue. I would like to obtain a copy and then establish a routine for automated synced downloads like we do for the other databases we have in our system. I have had several requests to add Wikipedia to our eTBLAST text similarity search engine. This is to improve reference finding as well as novelty assessment. Our search tool is widely used, widely published and is free. Please see etblast.org or http://en.wikipedia.org/wiki/ETBLAST. I would like to create a searchable copy of Wikipedia locally with links back to Wikipedia for hits, and of course acknowledge Wikimedia. We do this for several open text datasets and are prepared to keep a local, synced copy of Wikipedia, if you are interested. I am certain that our mutual users would like and benefit from our working together.
Cheers, and thank you, Skip
----- Original Message ----- From: "Wikipedia information team" info-en@wikimedia.org To: "Skip Garner" garner@vbi.vt.edu Cc: "Dominik L. Borkowski" dom@vbi.vt.edu, "Johnny Sun" szhaohui@vbi.vt.edu Sent: Wednesday, December 1, 2010 9:43:25 AM Subject: Re: [Ticket#2010112810016598] I would like to provide a different search engine for Wikimedia
Dear Skip Garner,
Thank you for your email. Our response follows your message.
11/29/2010 16:23 - Skip Garner wrote:
Guy, Thank you for the information. I would like to move forward on this, for I
think it will be of mutual value. The size of the database is not an issue, and we are always expanding our storage and serving capabilities. We regularly work with data in the 100's of T in size. One issue would be getting the first copy, but we could probably handle that by fed-x.
Can you tell me how we can proceed?
Cheers, Skip
The best bet is probably to email the wikitech mailing list, which is where the devs hang out.
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
They will have the best idea of the practicalities.
Yours sincerely, Guy Chapman
Dear Skip,
You can always use the different dump files to host a local version of Wikipedia. These dump files are being available at download.wikimedia.org. However, at this moment there are some hardware issues and the site is currently not available. Given the task, I think that the [language-code][wikiproject]-pages-meta-current.xml.bz2 are the most interesting files. You can find a complete dump of August 2009 as part of Amazon's AWS public datasets at http://aws.amazon.com/publicdatasets/.
I have posted a step-by-step tutorial on Wiki research mailing list explaining how to get access to those files.
Best,
Diederik
On Wed, Dec 1, 2010 at 11:35 AM, Skip Garner garner@vbi.vt.edu wrote:
Dear Wikiteam, Guy Chapman requested that I post to the mailing list to ask how we can proceed to getting a copy of Wikipedia so that we can offer it as a database in our free search service, in response to the request in the following paragraph. He made me aware of its size, but that is not an issue. I would like to obtain a copy and then establish a routine for automated synced downloads like we do for the other databases we have in our system. I have had several requests to add Wikipedia to our eTBLAST text similarity search engine. This is to improve reference finding as well as novelty assessment. Our search tool is widely used, widely published and is free. Please see etblast.org or http://en.wikipedia.org/wiki/ETBLAST. I would like to create a searchable copy of Wikipedia locally with links back to Wikipedia for hits, and of course acknowledge Wikimedia. We do this for several open text datasets and are prepared to keep a local, synced copy of Wikipedia, if you are interested. I am certain that our mutual users would like and benefit from our working together.
Cheers, and thank you, Skip
----- Original Message ----- From: "Wikipedia information team" info-en@wikimedia.org To: "Skip Garner" garner@vbi.vt.edu Cc: "Dominik L. Borkowski" dom@vbi.vt.edu, "Johnny Sun" szhaohui@vbi.vt.edu Sent: Wednesday, December 1, 2010 9:43:25 AM Subject: Re: [Ticket#2010112810016598] I would like to provide a different search engine for Wikimedia
Dear Skip Garner,
Thank you for your email. Our response follows your message.
11/29/2010 16:23 - Skip Garner wrote:
Guy, Thank you for the information. I would like to move forward on this, for I
think it will be of mutual value. The size of the database is not an issue, and we are always expanding our storage and serving capabilities. We regularly work with data in the 100's of T in size. One issue would be getting the first copy, but we could probably handle that by fed-x.
Can you tell me how we can proceed?
Cheers, Skip
The best bet is probably to email the wikitech mailing list, which is where the devs hang out.
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
They will have the best idea of the practicalities.
Yours sincerely, Guy Chapman
-- Wikipedia - http://en.wikipedia.org
Disclaimer: all mail to this address is answered by volunteers, and responses are not to be considered an official statement of the Wikimedia Foundation. For official correspondence, please contact the Wikimedia Foundation by certified mail at the address listed on http://www.wikimediafoundation.org
-- Harold "Skip" Garner Executive Director Virginia Bioinformatics Institute Virginia Tech Washington Street (0477) Blacksburg, VA 24061 http://www.vbi.vt.edu
Phone: 540.231.2582 Fax: 540.231.1388
Assistant: Renee Nester renee@vbi.vt.edu 540.231.2582
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org