Tomasz, can you grep old logging for an upload entry of File:Olympic
Highway - Moorong.jpg (uploaded 16 jul 2009
<http://commons.wikimedia.org/w/index.php?title=File:Olympic_Highway_-_Mooro…>)
or File:Renoir, Pierre-Auguste - The Two Sisters, On the Terrace.jpg (14
jul 2009
<http://commons.wikimedia.org/w/index.php?title=File:Renoir,_Pierre-Auguste_…>,
not the one for 15 jul 2009) ?
Dumps prior to 20090804 are not publicly available. The objective is to
look for evidence about the disappeared upload logs of those files (bug
20744).
It'd be something like gzip -d <
commonswiki-200907*-pages-logging.xml.gz|grep -A 10 -B 10 Moorong.jpg
Presence of Olympic Highway - Moorong.jpg into image.sql would also be
interesting.
> For example, I had to manually increase the number of
> threads for 7ZIP to speed it up, as you can see. It will
Sorry, I meant PIGZ :-). Fire fingers.
F.
> > _______________________________________________
> > Xmldatadumps-l mailing list
> > Xmldatadumps-l(a)lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
> >
>
>
>
>
Greetings,
I am trying to import the French wiki (full history xml) on a Ubuntu machine
with quad-core trendy CPU and 16 GB RAM. The import query is the following:
java -Xmn256M -Xms396M -Xmx512M -XX:+DisableExplicitGC -verbose:gc
-XX:NewSize=32m -XX:MaxNewSize=64m -XX:SurvivorRatio=6 -XX:+UseParallelGC
-XX:GCTimeRatio=9 -XX:AdaptiveSizeDecrementScaleFactor=1 -jar mwdumper.jar
--format=sql:1.5 frwiki-20090810-pages-meta-history.xml.bz2 | mysql -u wiki
-p frwikiLatest
I have disabled the autocommit for mysql, disabled foreign key checks and
unique checks. I have set the pool size, buffer log size, and the buffer
size to large values as recommended for mysql good performance.
After around 3 minutes of running the above command, I have got:
6 pages (0.083/sec), 1,000 revs (13.889/sec)
8 pages (0.038/sec), 2,000 revs (9.378/sec)
13 pages (0.041/sec), 3,000 revs (9.458/sec)
The source file is on its own physical disk and the mysql data folder is on
another physical disks. Both disks are very fast.
*Any suggestions on how to improve the speed?
*
another issue is that the InnoDB (page, revision, text) do not show the
number of the records although the size of the table is non-zero. I think
the might be related to disable keys query.
*Is that correct?
*
bilal
We've started marking redirects within each one of the archive
snapshots. Starting on 7/28/09 each history and article snapshot will
contain a
<page>
..
<redirect />
<revision>
..
</page>
entry so that everyone can easily identify which articles are in fact
simply redirects.
This came as a request from Erik Zachte to further improve on our stats
collection and has allowed us to surface more user contributor stats
that are not filled with articles lacking significant content.
--tomasz
Tomasz Finc wrote:
> Tomasz Finc wrote:
>> Brion Vibber wrote:
>>> Tomasz Finc wrote:
>>>> Looks like we aren't getting in the replacement drives until mon/tues
>>>> of next week so the array will continue to be in degraded state until
>>>> then. Thankfully it's still under warranty so the turn around wont be
>>>> too bad. Tentatively putting the work to happen on Tuesday now.
>>> We were able to put in new disks today, but the raid array didn't fully
>>> recover. We got lots of I/O errors, and have been unable to run JFS
>>> recovery successfully so far.
>>>
>>> In the meantime we're running http://download.wikimedia.org/ off the
>>> copy of the last couple of dumps that had been copied to another server.
>>> The dump _files_ are there but currently the index is not.
>>>
>>> We're not 100% sure whether we'll be able to recover the earlier dumps
>>> or not, but of course more will be made soon enough. :)
>>>
>>> Some additional files such as the MediaWiki release download and DVD ISO
>>> downloads are still in process of being restored.
>>>
>>> -- brion
>> Thanks for the update Brion. I'll be checking in with Rob tomorrow to
>> see how ready the new set of drives are and if we are set to start
>> generating the snapshots anew.
>>
>> --tomasz
>>
>> _______________________________________________
>> Xmldatadumps-admin-l mailing list
>> Xmldatadumps-admin-l(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
>
> Sadly Fred and Rob took a look at the JFS storage and were not able to
> salvage any of the existing file system. We've gone ahead and started
> clean with the archives I made last week as the seeds.
>
> There will be one more day of testing tomorrow for drive removal and I
> expect to have the system back up and running by the end of the week. It
> should take about a week after to get a full cycle of all wikis.
>
Everything has been looking really good so far and I'm finally
comfortable in starting the snapshots back up. The only bit left to do
is to test by pulling a drive but that will have to wait till we have
RobH on site again.
Were currently running at five snapshot processes and if nothing weird
happens I'll dial it up to eight.
--tomasz
cc'ing xmldatadumpms-l on this.
Phil Adams wrote:
> hi tomasz,
>
> phil (philadams) here from #wikimedia-tech earlier today.
>
> i'm interested in looking at user behaviour on wikipedia, so i figured
> that the en wiki stub-meta-history would be a good place to start. i
> grabbed and uncompressed the 2009 07/02 version, and started just
> exploring it a little. i had a few questions:
>
> * is this dump supposed to contain ALL revisions to each en wiki page
> (articles and user pages in particular)? i ask b/c when i look at the
> revision history for (say) AmericanSamoa, the meta dump shows only 5
> or 6 revisions for that page, spread across time from 2001 to 2007.
> the en wiki history page online
> (http://en.wikipedia.org/w/index.php?title=American_Samoa&action=history)
> shows far more edits. what am i missing?
The XML files available on download are snapshots in time of our data
set. When each snapshot runs, the stub step gets a consistent view of
our database at that exact time. Any new revisions will only be
available in the next run.
AmericanSamoa is showing up just like it should in the snapshot because
its a redirect. If you take a look at
http://en.wikipedia.org/w/index.php?title=AmericanSamoa&action=history
then you will notice that it's only had a handful of edits when compared to
http://en.wikipedia.org/w/index.php?title=American%20Samoa&action=history
note the space between the two words.
>
> * is there any sort of ordering to the history dump? it appears
> nominally alphabetic, although isn't strictly alphabetic.
The ordering is by page id.
>
> * if i have misunderstood the purpose of the meta dumps, but still
> wanted the same information, is my best recourse simply to d/l the
> entire en wiki dump? does that contain complete revision histories for
> all pages?
The only difference between a stub and the full history is the full page
content. If you don't need the content then they are effectively the same.
--tomasz
Why was the newst copy of enwiki with the full history removed from
the downloads site? I checked around and was only able to find one
place with it:
http://www.archive.org/details/enwiki-20080103
You'll want the "enwiki-20080103-pages-meta-history.xml.7z" file,
which is about 17GB. There is another file that is 130GB, but that is
the SAME thing, just compressed with bz2 insteaf of 7z, making it
larger, so don't get that one.
Tomasz, I am willing to volunteer my services as a programmer to help
with this problem in making full history enwiki dumps, if it is
possible (I can't donate hardware/money). What are the issues which
are causing it to be so slow and what methods are you employing to
improve it?
I know that LiveJournal has some sort of live backup system using
MySQL and Perl, but couldn't find any details on their presentations.
You might be able to ask one of their developers for help, on their LJ
blog. Can Wikimedia afford a snapshot server? It doesn't need to be as
fast as the others.
In the long run, whatever this system is, it will probably need to be
integrated into some sort of backup, because it would be a huge pain
if something happened at the data center and you needed to restore
from the partial quasi-backups in the current systems.
How does the current dump method work? Are they incremental in the
sense that they build up on previous dumps, instead of re-dumping all
of the data?
For future dumps, we might have to resort to some form of snapshot
server that is fed all updates either from memcaches or mysqls. This
allows for a live backup to be performed, so it's useful for not just
dumps.
Is it possible to suspend any individual slaves temporarily during off
peak hours to flush the database to disk and then copy the database
files to another computer? If not, we may still be able to use a
"stale" database files copied to another computer, as long as we only
use data from it that is at least a few days old, so we know that it's
been flushed to disk (not sure how mysql flushes the data...).
Of course, this may all be totally off, since I don't know a lot about
the current configuration and issues, so I'll take whatever input you
have to help work on something better.
Sebastian Graf wrote:
> Hello Tomasz,
>
> thanks for your quick response.
>
> Unfortunately I am in need not only of *text* but of *english text*
> since we are currently working on an revisioned indexer.
>
> Are there any english dumps available except the enwiki?
Yup, you can grab
enwikisource
enwikiversity
enwikinews
enwiktionary
enwikiquote
metawiki
commonswiki
--tomasz
Hello everybody,
I am a worker at the computer science departement at the University of
Konstanz in Germany. We are working on a revisioned native XML
database. Wikipedia is therefore the optimal playground when it comes
to huge amounts of data since the xml dump is perfect for our
application.
At the moment I am looking for a new dump for the enwiki which
contains all revisions. I know that this XML has to be really huge,
but that's why we want to use it. Unfortunately I couldn't find any
file called "page-meta-history" on the enwiki download section. Can
you help me with some dump, an idea how to get the data,...?
greetings
sebastian
--------------------------------------------------
Sebastian Graf
Distributed Systems Lab
University of Konstanz
Phone: +49 7531 88 4319
Mail: sebastian.graf(a)uni-konstanz.de
Hi,
I'm trying to get a hold of the wikipedia dump , in particular
enwiki-latest-pages-meta-history.xml.bz2
It seems that on the page where it's supposed to be
(http://download.wikipedia.org/enwiki/latest/) it's weighing at 0.6KB
whereas I was used for it to be 147GB
What happened to the data and where did it went ?
Also , on the wikipedia (
http://en.wikipedia.org/wiki/Wikipedia_database ) page I read
"As of January 17 </wiki/January_17>, 2009 </wiki/2009>, it seems that
all snapshots of pages-meta-history.xml.7z hosted
at http://download.wikipedia.org/enwiki/ are missing. The developers at
Wikimedia Foundation are working to address this issue
(http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040841.html).
There are other ways to obtain this file"
I checked the other ways of obtaining the file that they describe , none
worked.
Why did the dumps vanished and how can I download a copy of them ?
Thank you