Hello folks,
I had some questions about the order or pages and revisions in the dump.
As I understand, the order is according to the respective IDs. But
where do these IDs come from? Are they the keys of the corresponding
table in the database? So then they are more or less in order of
creation? If that's the case, why does the dump begin with pages with
titles mostly beginning with "A"?
Thank you,
Chrisil
Hello everybody,
I was doing a bit of analysis of the dump
enwiki-20100130-pages-meta-history.xml.7z. What I found to my surprise
is that there are (at least) 7 million pages in the main namespace. I
got this figure by grepping for page titles that do not contain a ":"
character. Is this really the case or am I missing something? I'd seen
some Wikimedia stats that said the number of articles currently is about
3.2 million, so I'm not sure why I'm seeing so many pages in the dump.
Thank you,
Chrisil
This went off-list for some reason...
---------- Forwarded message ----------
From: Thomas Dalton <thomas.dalton(a)gmail.com>
Date: 29 June 2010 01:18
Subject: Re: [Xmldatadumps-l] Number of pages on Wikipedia
To: "Chrisil J. Arackaparambil" <chrisil(a)lanl.gov>
On 29 June 2010 01:06, Chrisil J. Arackaparambil <chrisil(a)lanl.gov> wrote:
> Hello everybody,
>
> I was doing a bit of analysis of the dump
> enwiki-20100130-pages-meta-history.xml.7z. What I found to my surprise
> is that there are (at least) 7 million pages in the main namespace. I
> got this figure by grepping for page titles that do not contain a ":"
> character. Is this really the case or am I missing something? I'd seen
> some Wikimedia stats that said the number of articles currently is about
> 3.2 million, so I'm not sure why I'm seeing so many pages in the dump.
The 3.2 million figure does not include redirects.
Hello everybody,
I'd downloaded the enwiki-20100312-pages-meta-history.xml.7z dump a
couple of days ago, but now it seems the link has been taken down:
http://download.wikimedia.org/enwiki/20100312/
On browsing the archives of this list I also found some comments
indicating that this dump had some problems.
I'd appreciate it if someone could please tell me what the status of
this dump is, and which would be the most recent dump in good shape.
Also, if the 20100312 dump is bad, perhaps there should be a message on
that page indicating it so that people don't mistakenly take it for a
clean one?
Thank you,
Chrisil
For all those watching our snapshots there will be some down time today as we transition to a new storage node.
We've been testing on the node for quite a while and were finally ready to migrate over.
Todays schedule includes:
- final syncs between storage2 -> dataset1
- stopping current system
- changing configs
- starting snapshots system
- migrating download.
We'll keep you updated as the transition progresses.
--tomasz
I have restarted the frwiki run again and the dump for plwiki which I
inadvertently interrupted; I apologize for the inconvenience. These are
now both running so that they can be left unattended.
Ariel
Does anyone here have one of the dumps of test.wikipedia from Jan or Feb
of this year? If so please let me know off list, I'd like to do some
comparisons with the current run. Thanks.
Also, in reply to Jamie Morken, the missing text for the 2005 revisions
in the Jan 2010 en wikipedia dump was caused by a bug that that was
fixed midway during that run (see the techblog post for specifics).
Ariel Glenn
Hi,
Nice graphs! Since pages in the dump files are in order of page id (more or less in order of article creation date), but most of the missing data is from revisions that occured in the timeframe [2005-01-14T -
2005-05-14] the data was probably lost during the SQL database format to xml conversion step and not in the bzip or 7z compression step. My guess is an intermittent SQL database read timeout/error.
cheers,
Jamie
----- Original Message -----
From: Dmitry Chichkov <dchichkov(a)gmail.com>
Date: Monday, May 17, 2010 11:22 pm
Subject: Re: [Xmldatadumps-admin-l] FYI: comparison between enwiki-20100130-pages-meta-history.xml.7z and enwiki-20100312-pages-meta-history.xml.7z
To: Jamie Morken <jmorken(a)shaw.ca>, xmldatadumps-admin-l(a)lists.wikimedia.org
> I've tried filtering and plotting empty text revisions using the
> followingcriteria: comment starts on '/*' (section edits)
> AND not an IP edit;
> The idea is that generally section edits do not result in the
> deletion of
> the complete article text and registered users tend to vandalize less.
> Consequently we can somewhat see what revisions text were missed
> due to
> backup.
>
> Resulting plots are attached for both [enwiki-20100130 31.9 GB] and
> [enwiki-20100312 15.8 GB] files.
>
> If anybody is interested in the raw filtered data here's a link
> to the
> zipped .csv(s):
> http://76.126.237.67/tmp/missing.revisions.enwiki-20100xx.7z
> The .csv files have the following format: 'pageid, revisionid,
> unixtime,pagetitle'.
>
> -- Cheers, Dmitry
>
I notice the dumps seem currently frozen, is this the best place to ask
for information, or is it publicly available somewhere else? (in which
case sorry for pestering).
Conrad