I recently set up a MediaWiki (http://server.bluewatersys.com/w90n740/)
and I need to extra the content from it and convert it into LaTeX
syntax for printed documentation. I have googled for a suitable OSS
solution but nothing was apparent.
I would prefer a script written in Python, but any recommendations
would be very welcome.
Do you know of anything suitable?
I've been putting placeholder images on a lot of articles on en:wp.
e.g. [[Image:Replace this image male.svg]], which goes to
[[Wikipedia:Fromowner]], which asks people to upload an image if they
I know it's inspired people to add free content images to articles in
several cases. What I'm interested in is numbers. So what I'd need is
a list of edits where one of the SVGs that redirects to
[[Wikipedia:Fromowner]] is replaced with an image. (Checking which of
those are actually free images can come next.)
Is there a tolerably easy way to get this info from a dump? Any
Wikipedia statistics fans who think this'd be easy?
(If the placeholders do work, then it'd also be useful convincing some
wikiprojects to encourage the things. Not that there's ownership of
articles on en:wp, of *course* ...)
Sorry for my English :)
What I need is case insensitive titles. My solution for the problem was to
change collation in mysql from <unf8_bin> to <utf8_general_ci> in table
<page>, for field <page_title>.
But bigger problem with links persists. In my case, if there is an article
<Frank Dreben>, link [[Frank Dreben]] is treated like a link to an existent
article (GoodLink), but link [[frank dreben]] is treated like a link to a
non-existent article, so, this link opens editing of existent article <Frank
Dreben>. What can be fixed for that link [[frank dreben]] to be treated like
I've spent some time in Parser.php, LinkCache.php, Title.php, Linker.php,
LinkBatch.php but found nothing useful. The last thing I tried was to do
strtoupper on title every time array of link cache is filled, in
LinkCache.php. I also tried to do strtoupper on title every time data is
fetched from the array.
I've tried to make titles in cache be case insensitive, but it didn't work
out, not sure why - it seems like when links are constructed (parser, title,
linker, etc) only LinkCache methods are used.
Could anybody point a direction to dig in? :)
I'm currently working on the Scott Forseman image donation, cutting
large scanned images into smaller, manually optimized ones.
The category containing the unprocessed images is
It's shameful. Honestly. Look at it. We're the world's #9 top web
site, and this is the best we can do?
Yes, I know that the images are large, both in dimensions
(~5000x5000px) and size (5-15MB each).
Yes, I know that ImageMagick has problems with such images.
But honestly, is there no open source software that can generate a
thumbnail from a 15MB PNG without nuking our servers?
In case it's not possible (which I doubt, since I can generate
thumbnails with ImageMagick from these on my laptop, one at a time;
maybe a slow-running thumbnail generator, at least for "usual" sizes,
on a dedicated server?), it's no use cluttering the entire page with
Where's the option for a list view? You know, a table with linked
title, size, uploader, date, no thumbnails? They're files, so why
don't we use things that have proven useful in a file system?
And then, of course:
"There are 200 files in this category."
That's two lines below the "(next 200)" link. At that point, we know
there are more than 200 images, but we forget about that two lines
Yes, I know that some categories are huge, and that it would take too
long to get the exact number.
But, would the exact number for large categories be useful? 500.000 or
500.001 entries, who cares? How many categories are that large anyway?
200 or 582 entries, now /that/ people might care about.
Why not at least try to get a number, set a limit to, say, 5001, and
* give the exact number if it's less that 5001 entries
* say "over 5000 entries" if it returns 5001
Yes, everyone's busy.
Yes, there are more pressing issues (SUL, stable versions, you name it).
Yes, MediaWiki wasn't developed as a media repository (tell me about it;-)
Yes, "sofixit" myself.
Still, I ask: is this the best we can do?
The most recent enwiki dump seems corrupt (CRC failure when bunzipping).
Another person (Nessus) has also noticed this, so it's not just me:
Steps to reproduce:
lsb32@cmt:~/enwiki> md5sum enwiki-20080103-pages-meta-current.xml.bz2
lsb32@cmt:~/enwiki> bzip2 -tvv
enwiki-20080103-pages-meta-current.xml.bz2 &> bunzip.log
lsb32@cmt:~/enwiki> tail bunzip.log
[3490: huff+mtf rt+rld]
[3491: huff+mtf rt+rld]
[3492: huff+mtf rt+rld]
[3493: huff+mtf rt+rld]
[3494: huff+mtf rt+rld]
[3495: huff+mtf data integrity (CRC) error in data
You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.
lsb32@cmt:~/enwiki> bzip2 -V
bzip2, a block-sorting file compressor. Version 1.0.3, 15-Feb-2005.
Copyright (C) 1996-2005 by Julian Seward.
This program is free software; you can redistribute it and/or modify
it under the terms set out in the LICENSE file, which is included
in the bzip2-1.0 source distribution.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
LICENSE file for more details.
bzip2: I won't write compressed data to a terminal.
bzip2: For help, type: `bzip2 --help'.
> Message: 8
> Date: Fri, 12 Oct 2007 17:59:22 +0200
> From: GerardM <gerard.meijssen(a)gmail.com>
> Subject: Re: [Wikitech-l] Primary account for single user login
> This issue has been decided. Seniority is not fair either; there are
> hundreds if not thousands of users that have done no or only a few edits and
> I would not consider it fair when a person with say over 10.000 edits should
> have to defer to these typically inactive users.
1. Yes, it's not fair, but this is the truth on wikimedia project that ones
have to admit. Imagine if, all wikimedia sites has a single user login
since when it is first established, the one who first register will own that
username for all wikimedia sites.
2. The person with less edits, doesn't mean that they are less active than the
one with more edits. And according to,
``Edit counts do not necessarily reflect the value of a user's contributions
to the Wikipedia project.''
What if, some users have less edits count,
* since they deliberately edit, preview, edit, and preview the articles,
over and over, before submitting the deliberated versions to wikimedia
* Some users edit, edit and edit the articles in their offline storage, over
and over, before submitting the only final versions to wikimedia sites.
While some users have more edits count,
* since they often submit so many changes, without previewing it first, and
have to correct the undeliberated edit, over and over.
* Some users often submit so many minor changes, over and over, rather than
accumulate the changes resulting in fewer edits count.
* Some users do so many robot routines by themselves, rather than letting
the real robot to do those tasks.
* Some users often take part in many edit wars.
* Some users often take part in many arguments in many talk pages.
What if, the users with less edits count, try to increase their edits count
to take back the status of primary account.
What if, they decide to change their habit of editing, to increase the
* by submitting many edits without deliberated preview,
* by splitting the accumulated changes into many minor edits, and submit
* by stopping their robots, and do those robot routines by themselves,
* by joining edit wars.
3. According to 2) above, I think, the better measurement of activeness is to
measure the time between the first edit and the last edit of that username.
The formula will look like this,
activeness = last edit time - first edit time
> A choice has been made and as always, there will be people that will find an
> un-justice. There were many discussions and a choice was made. It is not
> good to revisit things continuously, it is good to finish things so that
> there is no point to it any more.
> On 10/12/07, Anon Sricharoenchai <anon.hui(a)gmail.com> wrote:
> > According to the conflict resolution process, that the account with
> > most edits is selected as a primary account for that username, this
> > may sound reasonable for the username that is owned by the same person
> > on all wikimedia sites.
> > But the problem will come when the same username on those wikimedia
> > sites is owned by different person and they are actively in used.
> > The active account that has registered first (seniority rule) should
> > rather be considered the primary account.
> > Since, I think the person who register first should own that username
> > on the unified
> > wikimedia sites.
> > Imagine, what if the wikimedia sites have been unified ever since the
> > sites are
> > first established long time ago (that their accounts have never been
> > separated),
> > the person who register first will own that username on all of the
> > wikimedia
> > sites.
> > The person who come after will be unable to use the registered
> > username, and have
> > to choose their alternate username.
> > This logic should also apply on current wikimedia sites, after it have
> > been
> > unified.
We're planning to set up 4 data displays in the Wikimedia Foundation
office - I'm thinking at least 19" screens, maybe larger. The intent
here is not to appear "hip", but to make the office environment more
interesting for visitors, such as potential donors. This creates
conversation pieces and memorable moments - which is important for
I'd like to request your comments on what kinds of displays we could
set up. Some initial ideas:
- Real-time recent changes. This should be relatively straightforward
using the IRC feeds. Most effort here will go into prettification, I
think. What would be a good IRC client to show multiple channels at
- Show random articles. Not particularly creative, but should also be
fairly easy to do using some scripting. Would be nice to show stuff
from projects beyond WP.
- Show articles matching to current searches. How difficult would it
be to capture search data for this?
- Show the actual search strings. I don't love this one, because
Google already does this, but it might be interesting content-wise.
- Show traffic data. What would be interesting displays here? Can we
show bandwidth usage in real-time?
- Show images as they are being uploaded. Do we have anything like
that already? If not, how hard would it be to implement?
- Data displays of developmental indicators - e.g. Gapminder data on
Internet access, literacy, etc. Is there anything like this that we
could do with relatively little effort? Any volunteers to put
- Geomapping of access - some visualization of the primary clusters
where traffic is coming from, based on sampling. I imagine this could
be quite tricky - but might be a cool long-term project for a
- Visualization of edit patterns, similar to:
Other ideas / comments?
Deputy Director, Wikimedia Foundation
Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate
I just had the following thought: For a tag intersection system,
* we limit queries to two intersections (show all pages with categories A and B)
* we assume on average 5 categories per page (can someone check that?)
then we have 5*4=20 intersections per page.
Now, for each intersection, we calculate MD5("A|B") to get an integer
hash, and store that in a new table (page_id INTEGER,intersection_hash
That table would be 4 times as long as the categorylinks table.
* Memory usage: Acceptable (?)
* Update: Fast, on page edit only
* Works for non-existing categories
On a query, we look for the (indexed) hash in a subquery, then check
those against the actual categorylinks.
Looking up an integer in the subquery should be fast enough ;-)
Given the number of categories and INTEGER >4bn, that would make the
hash unique for all combinations of 65K categories (if the hash were
truely randomly distributed, which it isn't), which should mean that
the number of false positives (to be filtered by the main query)
should be rather low.
If that's fast enough, we could even expand to three intersections (A,
B, and C), querying "A|B", "A|C", and "B|C", and let PHP find the ones
common to all three sets.
Summary: Fixing slow MySQL lookup by throwing memory at it...
Feasible? Or pipe dream?
This may be the dumbest question I've every asked, so go easy with me
please! In SkinTemplate.php we have this:
$sitecss .= '@import "' . self::makeUrl( '-',
"action=raw&gen=css$siteargs$skinquery" ) . '";' . "\n";
I understand that it helps when a useskin parameter is passed; what I don't
understand is, why does it have to return a value when no such a parameter
is passed? It can simply return "nothing", can't it?
> * Adding SimpleCaptcha::addCaptchaAPI() method that adds CAPTCHA information to an API result array. Other CAPTCHA implementations should override this method with a function that does the same (did this for FancyCaptcha and MathCaptcha)
> + $resultArr['captcha']['type'] = 'simple';
> + $resultArr['captcha']['id'] = $index;
> + $resultArr['captcha']['question'] = $captcha['question'];
> + $resultArr['captcha']['type'] = 'image';
> + $resultArr['captcha']['id'] = $index;
> + $resultArr['captcha']['url'] = $title->getLocalUrl( 'wpCaptchaId=' . urlencode( $index ) );
> + $resultArr['captcha']['type'] = 'math';
> + $resultArr['captcha']['id'] = $index;
> + $resultArr['captcha']['sum'] = $sum;
Hmmm... How is an API client meant to figure out what to do with this
captcha data, when there seems to be nothing consistent in how it's
presented? How is it meant to display a new type of challenge which
might be added in the future?
If you're going to have an API for this, I think it needs at least some
minimum future-proofing; otherwise all the clients will break when
something is tweaked in the captcha.
-- brion vibber (brion @ wikimedia.org)