Hi all;
Just like the scripts to preserve wikis[1], I'm working in a new script to
download all Wikimedia Commons images packed by day. But I have limited
spare time. Sad that volunteers have to do this without any help from
Wikimedia Foundation.
I started too an effort in meta: (with low activity) to mirror XML dumps.[2]
If you know about universities or research groups which works with
Wiki[pm]edia XML dumps, they would be a possible successful target to mirror
them.
If you want to download the texts into your PC, you only need 100GB free and
to run this Python script.[3]
I heard that Internet Archive saves XML dumps quarterly or so, but no
official announcement. Also, I heard about Library of Congress wanting to
mirror the dumps, but not news since a long time.
L'Encyclopédie has an "uptime"[4] of 260 years[5] and growing. Will
Wiki[pm]edia projects reach that?
Regards,
emijrp
[1] http://code.google.com/p/wikiteam/
[2] http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps
[3]
http://code.google.com/p/wikiteam/source/browse/trunk/wikipediadownloader.py
[4] http://en.wikipedia.org/wiki/Uptime
[5] http://en.wikipedia.org/wiki/Encyclop%C3%A9die
2011/6/2 Fae <faenwp(a)gmail.com>
> Hi,
>
> I'm taking part in an images discussion workshop with a number of
> academics tomorrow and could do with a statement about the WMF's long
> term commitment to supporting Wikimedia Commons (and other projects)
> in terms of the public availability of media. Is there an official
> published policy I can point to that includes, say, a 10 year or 100
> commitment?
>
> If it exists, this would be a key factor for researchers choosing
> where to share their images with the public.
>
> Thanks,
> Fae
> --
> http://enwp.org/user_talk:fae
> Guide to email tags: http://j.mp/faetags
>
> _______________________________________________
> foundation-l mailing list
> foundation-l(a)lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>
The September en wikipedia dumps are done. Folks who use them, note
that this is the first run with the generation of a pile of smaller
files. The naming scheme as you will have noticed has an additional
string: -p<first-page-id-contained>p<last-pageid-contained> Expect the
specific groupings to change from one run to the next; it's time-based,
rather than based on the number of pages or revisions.
You may notice a gap of a few numbers between files; this would indicate
that those pages were deleted and not included in the dump at all.
Since there were no issues with the network, database servers, broken MW
deployments etc., the run finished without any need for restarts of a
particular step; this is probably the fastest we'll ever see it run, in
a little under 8 days.
Any issues, please let me know. I expect people will need a script to
download these files easily; didn't someone on this list have a tool in
the works?
Ariel
...another dump. August is done, July 7z are done, the last of the May
history and 7z are done. That brings us up to date.
I expect to test new code with production of many small files, as
previously discussed on this list, starting within the next few days.
This test will be for en wikipedia only, as that's the dump that's
hardest to run to completion. The results might be a perfectly good
dump, or not. Even if they are, I do not plan to try running en
wikipedia dumps twice a month, so don't get your hopes up. (Who would
process all that data every two weeks anyways?)
Ariel
Hello everyone,
I hope I'm not disturbing you too much, I have the following question:
I'm considering to download the enwiki-latest-pages-articles.xml, but I need to know if this contains enough information to rebuild the category structure (parent categories, subcategories, including the Category:Contents, etc.). Does the dump include the category pages or only the articles?
Thank you very much,
Imre
Hello, are there any plans to combine all of the pages-meta-history XML dumps from the 7/22 dump into one file? This is useful for importing into JWPL.
Thanks,
Diane M. Napolitano
Associate Research Engineer
Educational Testing Service
Turnbull Hall R-239
Princeton, New Jersey 08540
A new month, another couple of en wikipedia dumps...
It looks like the various upgrade issues are all straightened out. The
June files that were truncated have all been rerun and are ready for
download. In the meantime the July dumps are ready, for those willing
to grab the bz2 files. 7z files should be available in anouther couple
of days, barring any site issues. If the July files look ok to folks,
I'll do the last of our OS upgrades so that all our dump servers will be
up to date.
Ariel
I have been tasked with building an offline copy of the Wikipedia website. The main goal is to have the database and images stored locally so that we can run a Wikipedia website on a local server. Our ultimate goal is to have Mediawiki and the Wikipedia database and images stored on a single hard drive in a server in a location with no Internet access.
I have already been in contact with the Wikimedia Offline list about this issue but as yet have received no feedback about how to go about solving this problem.
I've made good progress configuring Mediawiki and importing the articles and page links but when I look at some pages I can see that some templates are missing from my offline Wikipedia. The missing templates are made obvious because the text "Template:abcdef" is displayed in red, and have the alt text "Template:abcdef (page does not exist)". A couple of specific examples of missing templates are 'Template:Citation/make link' and 'Template:Gaps'.
My offline Wikipedia data was imported into the MySQL database using the Mwdumper program. The source data came from the enwiki-latest-pages-articles.xml file. I imported the page links into the database using the enwiki-latest-pagelinks.sql file.
Can anyone give me some guidance on how to fix this problem?
My sincere apologies if this is not the appropriate place to ask for assistance with this problem.
Kevin
I've finally moved past manual testing to running a worker process on
one of our upgraded hosts. This means new versions of various pieces of
software as well as the OS. I hope I've found all the issues but I
would feel a whole lot better aobut it if folks would take a look at the
jobs running now. Small wikis only, the large ones are still running on
the old setup. Anyone notice anything weird or is it all ok?
Ariel