Hi all
When trying to process the current enwiki dump (specifically,
enwiki-20080312-pages-articles.xml.bz2) using mwdumper, it crashed on me with an
UTF-8-related I/O Error in Xerces. The problem occurrs (with slightly different
symptoms) with Xerces 2.7.1 and 2.9.1, it is described in detail at [1].
Basically, Xerces' UTF-8 decoder is broken for the case that a surrogate pair is
split across buffer reads. The problem was reported by Robert Stojnic aka
rainman last year, but apparently the attempt to fix it only changed the way it
is broken.
Anyway, the current dump isn't usable with mwdumper. This is not good. And it is
likely to happen again.
I see two ways to solve it:
1) Ship a patched version of Xerces with mwdumper, with Robert's patch applied
(see bug report at [1]). But there seems to be some problem with that patch
(dicussed in the bug report), and relying on a patched version of a (supposedly)
standard lib feels a but dirty.
2) Don't use Xerces' UTF-8 decoder, use the JRE's built in one. I have prepared
a patch for this (see [2]), but I have not tested it excessively (only so far as
that it doesn't blow up in my face). I havn't even verified that it actually
fixes the problem (it takes half a day to get that far in processing the dump -
that thing is huge, a small test case would be great for this). It would also be
good to know how using the JRE's decoder impacts performance. Because of these
questions, I havn't committed the patch yet. Please play with it if you have a
couple of minutes for this kind of thing.
Regards,
Daniel
[1] Xerces bug report <https://issues.apache.org/jira/browse/XERCESJ-1257>
[2] mwdumper patch: <http://rafb.net/p/7c5bkg52.html>
Hi,
I have written some extensions for forms that use template parameters.
I have found that it appears to be impossible to pass a template
parameter to an extension. I think this make be an upshot of the
design, but before I totally give up I thought I'd see if anyone here
had tried to do the same thing.
Ideally my extension on a template page would look like this:
<textbox>{{{name|Not set}}}</textbox>
then used like this:
{{form:dynamicform|name=Alex}}
I am embedding them into the webpage as Divs, and then using Java
script to mark them up.
However. I cannot get the parser to expand the "name" parameter for my
extension. I get passed {{{name|Not Set}}}. OK, so I'll call
parsetags, but that seems to only pick up "Not set" when called from
my extension, not "Alex".
My eventual solution was partially formed HTML:
<textbox id=name />{{{name}}} <close textbox/>
Note the closed extension tags. The extension hacks around with the
parser adding tokens and replacing them after strip tags. It creates
an open div tag, and then a close div tag using <close> extension.
This works and used properly produces well formed HTML, but is not
elegant and is prone to error. I was hoping the new parser ordering
might help my plight, but it didn't. Am I going about this the wrong
way?
Kind regards,
Alex
--
Alex Powell
Exscien Training Ltd
Tel: +44 (0) 1865 920024
Direct: +44 (0) 1865 920032
Mob: +44 (0) 7717 765210
skype: alexp700
mailto:alexp@exscien.com
http://www.exscien.com
Registered in England and Wales 05927635, Unit 10 Wheatley Business
Centre, Old London Road, Wheatley, OX33 1XW, England
Hello Wikimedia community,
I just wanted to send an announcement to this list regarding a GSoC
project proposal I am planning to submit with the Wikimedia foundation
as the mentoring organization. I've been discussing the project with
the MetaVidWiki community, but in case there is some broader interest
I wanted to run it by here as well.
Excerpt from the proposal:
I propose to do a number of modifications to MetaVidWiki [1] (an
existing MediaWiki extension for the annotation of video content),
with the common goal of improving the ease in which users can add
their own video content to the Wiki.
The specific tasks are:
* making the process of video uploading to a MetaVidWiki system an
easy one-step process (currently, uploading a video file to the system
requires several non-trivial steps)
* adding the ability to use / annotate externally hosted video
(from sites such as archive.org and YouTube)
The full project proposal is at:
http://urbanstew.org/metavidwiki/proposal.html
And the beginnings of some supplemental material here:
http://urbanstew.org/metavidwiki/
Please let me know if you have any feedback regarding the project / proposal.
Kind regards,
Stjepan
[1] http://metavid.ucsc.edu/wiki/
When I grep for "<contributor>" or "<revision>" in
svwiki-20080310-pages-meta-history.xml I find 5,822,491
occurrences. But [[sv:Special:Statistics]] says there have been
6,246,812 edits. What are the 424,321 edits in between? Deleted
pages?
According to [[sv:Special:Statistics]] there are 58,087 user
accounts, but <contributor><username> has 28,416 distinct values.
Is it realistic that half of all registered usernames have never
contributed a single edit (to non-deleted pages)? Can we find out
what happened to them? Did they write spam that was deleted and
the username permanently blocked? Did they just register their
name to stop others from doing so? Or did something go wrong
during the registration?
Of those who did contribute something, of course most usernames
only made very few contributions. This is a long tail. So how do
we separate the regular/serious/active contributors from the
occassional ones? In [[m:board elections]] to the WMF, a limit of
400 edits is used, and this threshold is as good as any.
In <contributor><username> of the sv.wp dump there are 900 names
(and 104 addresses in <contributor><ip>) that have contributed 400
revisions or more (to non-deleted pages). Of these 900, some 80
have names containing "bot" and some are sock puppets, but I guess
that 800 could be eligible to vote. There are 81 admins on sv.wp.
Is one admin per ten eligible voter volunteers a "normal"
quotient? It also means we have one eligible voter per 12,500
speakers of the Swedish language (800 out of 10 million).
I think 800 is the number of volunteers that should be mentioned
rather than the 58,087 mostly inactive usernames.
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik - http://aronsson.se
I'm working on similar feature, but it's not completed yet. It's based on a bit different idea, but I have a few experiences with this, so if you want some help... -MGrabovsky
Hello,
I recently made an extension that I uploaded too
http://svn.wikimedia.org/svnroot/trunk/extensions/Click/Click.php and
it works perfectly. Unfortunately it won't work inline; no matter what
I do it always starts a new paragraph or adds an opening <br> - does
anyone know how I can fix this? Maybe it's because it returns HTML,
but I think there should still be a way to make it inline.
MinuteElectron.
Hello,
today we have all database migration to Oracle 10g underway. The code
has been in development and testing for quite a long time already,
and we are getting lots of nifty features immediately available
afterwards. Stay tuned!
Though our database consultants prepared real-time migration plan, if
you notice any problems, do tell.
BR,
Domas