As I assume most people here know, each revision in the full history
dumps for Mediawiki reports the complete page text. So even though an
edit may have changed only a few characters, the entire page is
reported for each revision. This is one of the reasons that full
history dumps are very large.
Recently I've written some Python code to re-express the revision
history into an "edit syntax", using an xml compatible notation for
changes with expressions like:
<replace>, <delete>, <insert>, etc.
Since many revisions really only consist of small changes to the text,
using the notation I've been developing can greatly reduce the size of
the dump, while still maintaining a human readable syntax. For
example, I recently ran it against the full history dump of ruwiki
(179 GB uncompressed, 1.2 M pages, 11.2 M revisions), and got a 94%
reduction in size (11.1 GB). Because it is still a text based format,
it stacks well with traditional file compressors (bz2: 89% reduction -
1.24 GB; 7z: 91% reduction - 1.07 GB).
It also could be a precursor to analysis designed to work out
"primary" authors and other tasks where one wants to know who is
making large edits and who is making small, housekeeping edits.
Obviously, as a compressor it is most successful with large pages which have
a large number of relatively minor revisions. For example, the enwiki
history of [[Saturn]] (current size 57 kb, 4741 revisions) sees a
99.1% size reduction. I suspect that the size reduction on large
wikis, like en or de, would be even greater than the 94% for ruwiki
since larger wikis tend to have larger pages and more revisions per
page.
The current version of my compressor averaged a little better than 250
revisions per second on ruwiki (about 12 hours total) on a
18-month-old desktop. However, as the CPU utilization was only 50-70%
of a full processing core most of the time, I suspect that my choice
to read and write from an external hard drive may have been the
limiting factor. On a good machine, 400+ rev/s might be a plausible
number for the current code. Or in short, the overhead for figuring
out my edit syntax is relatively small compared to the generation time
for the current dumps (which I'm guessing is limited by communication
with the text data store).
My code has some quirks and known bugs, and I'd describe it as a
late-stage alpha version at the moment. It still needs considerable
work (not to mention documentation) before I would consider it to be
something ready for general use.
However, I wanted to know if this is a project of interest to
Mediawiki developers or other people. Placed in the dump chain, it
could substantially reduce the size of the human readable dumps (at
the expense that one would need to process through a series of edits if
you wanted see the full-text of any specific revision). Or utilized
for different
purposes, it could help figure out major vs. minor editors, etc. If
this project is mostly just a curiosity for my own use, then I will
probably keep the code pretty crude. However, if other people are
interested in using something like this, then I am willing to put more
effort into developing something that is cleaner and more generally
usable.
So, I'd like to know whether there are people (besides myself) who
are interested in seeing the full history dumps expressed in an edit syntax
rather than the full-text syntax currently used.
-Robert Rohde
So you want to run a top-10 web site? Now's your chance...
We're now hiring for a full-time system administrator to help monitor,
maintain, and document the 400+ Linux/Unix servers that operate
Wikipedia and its sister projects. This position will be based at our
San Francisco headquarters, but will work closely with our remote staff
and volunteers.
Currently, system administration tasks are spread over our other tech
staff and volunteers, who have to split their time with software
development, data center management, and network planning. A full-time
system administrator will let us be more responsive to site issues when
they happen, and more importantly be more proactive about planning for
and averting problems before they affect the folks back home.
We've got operating systems to upgrade, configurations to document,
software installations to automate, and a lot of service data that needs
to be monitored and digested... if you think you've got the chops for
it, send us your CV by the end of January!
http://wikimediafoundation.org/wiki/Job_openings/System_Administrator
-- brion vibber (brion @ wikimedia.org)
CTO, Wikimedia Foundation
Hello,
I have done a Special:Export latest revision of http://en.wikipedia.org/w/index.php?title=Diabetes_mellitus
including templates, and copied:
{{Infobox Disease
| Name = TestSMW
| Image =
| Caption =
| DiseasesDB =
| ICD10 = {{ICD10|Group|Major|minor|LinkGroup|LinkMajor}}
| ICD9 = 000000
| ICDO =
| OMIM =
| MedlinePlus =
| eMedicineSubj =
| eMedicineTopic =
| MeshID =
}}
Into my test page http://wiki.medicalstudentblog.co.uk/index.php/
TestSMW -- However as you can see, it comes out all garbaged. Can
anyone advise? I should now have all the templates from the export/
import, perhaps I'm missing some other extension(s)?
Thanks, Dawson
At 16.15 15/01/2009 -0500, you wrote:
>On Thu, Jan 15, 2009 at 11:54 AM, Eugenio Tacchini
<eugenio(a)favoriti.it> wrote:
>> Thanks for yor reply.
>>
>> I don't need a generale measure but I need the status for each single
>> page; as far as I have seen probably the only solution is to look at
>> the corresponding templates, maybe via the table marco suggested me.
>
>Yes, the way you want to do this is checking templatelinks. This is
>how disambiguations are checked in the software, and it could be used
>for stubs and so on too.
Ok, thanks again.
Eugenio
On Thu, Jan 15, 2009 at 11:54 AM, Eugenio Tacchini <eugenio(a)favoriti.it> wrote:
> Thanks for yor reply.
>
> I don't need a generale measure but I need the status for each single
> page; as far as I have seen probably the only solution is to look at
> the corresponding templates, maybe via the table marco suggested me.
Yes, the way you want to do this is checking templatelinks. This is
how disambiguations are checked in the software, and it could be used
for stubs and so on too.
On Thu, Jan 15, 2009 at 11:00 AM, Eugenio Tacchini <eugenio(a)favoriti.it>wrote:
> Hello everybody,
> I'm looking, for academic research purposes, for the "status" of
> wikipedia pages. For "status" I mean:
> - stub
> - normal
> - good article
> - featured
>
> Is there any coloumn in the mediawiki database schema that can give
> me this information?
>
> Thanks in advance.
>
> Cheers,
>
> Eugenio
>
None of this information is stored in the database really. A count of
real articles vs. stubs is sort of stored in the site_stats. The content
pages that aren't stubs are counted in ss_good_articles. However,
ss_total_pages is all pages, content or not; making this a bad metric
for your purposes. An individual wiki's concept of stub/normal/good/
featured is a completely arbitrary system not based in the actual
software in any way. The only idea I'd have (for English Wikipedia)
would be cross-referencing to see which articles contain the {{stub}}
(or similar) template, as that should give you a good idea of stubs.
Perhaps similar things could be done with {{featured}}?
The only other idea I'd have would be to check the FlaggedRevs tables
to see how they describe individual articles. However, the English
Wikipedia doesn't use the extension yet, and I don't think this information
is included in the dumps, iirc.
-Chad
On Thu, Jan 15, 2009 at 5:00 PM, Eugenio Tacchini <eugenio(a)favoriti.it> wrote:
> Hello everybody,
> I'm looking, for academic research purposes, for the "status" of
> wikipedia pages. For "status" I mean:
> - stub
revision.rev_len
> - normal
pretty obvious - everything above stub level
> - good article
> - featured
templatelinks (maybe, not sure!!)
marco
ps: you might want to apply for a toolserver account or ask someone
with ts access to execute queries for you
Hello everybody,
I'm looking, for academic research purposes, for the "status" of
wikipedia pages. For "status" I mean:
- stub
- normal
- good article
- featured
Is there any coloumn in the mediawiki database schema that can give
me this information?
Thanks in advance.
Cheers,
Eugenio