[Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

Mike Dupont jamesmikedupont at googlemail.com
Fri May 18 10:12:13 UTC 2012


there is no 10gb limit, but it is the recommended bucket size if you
want to split up the file, according to my recent discussion with  the
archive.org team, and they have been helping me optimize the storage.
the idea of mine is to make smaller blocks that can be fetched quickly
and that people for example reading an article could just load the
data needed to display would be availab le via json(p) or xml/text
from a file.
we can make the wikipedia in a read only mode hosted totallz on the
archive org without a database server by encoding the search binary
trees in json data stored also on archive org, the clients can perform
the searches themselves.
that is my current research on fosm.org and i hope it can apply to the
wikipedia as well.
mike

On Fri, May 18, 2012 at 9:41 AM, emijrp <emijrp at gmail.com> wrote:
> There is no such 10GB limit,
> http://archive.org/details/ARCHIVETEAM-YV-6360017-6399947 (238 GB example)
>
> ArchiveTeam/WikiTeam is uploading some dumps to Internet Archive, if you
> want to join the effort use the mailing list
> https://groups.google.com/group/wikiteam-discuss to avoid wasting resources.
>
> 2012/5/18 Mike Dupont <jamesmikedupont at googlemail.com>
>
>> Hello People,
>> I have completed my first set in uploading the osm/fosm dataset (350gb
>> unpacked) to archive.org
>> http://osmopenlayers.blogspot.de/2012/05/upload-finished.html
>>
>> We can do something similar with wikipedia, the bucket size of
>> archive.org is 10gb, we need to split up the data in a way that it is
>> useful. I have done this by putting each object on one line and each
>> file contains the full data records and the parts that belong to the
>> previous block and next block, so you are able to process the blocks
>> almost stand alone.
>>
>> mike
>>
>> _______________________________________________
>> Wikimedia-l mailing list
>> Wikimedia-l at lists.wikimedia.org
>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
>>
>
>
>
> --
> Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com
> Pre-doctoral student at the University of Cádiz (Spain)
> Projects: AVBOT <http://code.google.com/p/avbot/> |
> StatMediaWiki<http://statmediawiki.forja.rediris.es>
> | WikiEvidens <http://code.google.com/p/wikievidens/> |
> WikiPapers<http://wikipapers.referata.com>
> | WikiTeam <http://code.google.com/p/wikiteam/>
> Personal website: https://sites.google.com/site/emijrp/
> _______________________________________________
> Wikimedia-l mailing list
> Wikimedia-l at lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l



-- 
James Michael DuPont
Member of Free Libre Open Source Software Kosova http://flossk.org
Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3



More information about the Wikimedia-l mailing list