I've consolidated some of the watchlist access code into
WatchedItem.php and added memcached support for it. This should reduce
db hits on logged-in page views; it had been checking the watchlist
table (twice!) per logged-in page render.
(Also a note: main development is currently going on in the stable
branch and is focused on speed, security, and bug fixes, not features.
A lot of fixes will need to be forward-ported to the dev branch at some
point.)
-- brion vibber (brion @ pobox.com)
Some of you might have seen that I did write a bot to edit on Wikipedia.
Actually it is not a bot per-se, but a framework to be able to write
wikipedia bots. I have used this framework to make a new and consistent
lay-out for all the nl: year pages, and currently its two greatest hits
are automatically adding consistent interwiki links and
semi-automatically disambiguating links to disambiguation pages.
I am running these robots on the nl: wikipedia only, and I think that is
the way it should stay. My idea is that if a language wikipedia agrees
to have a robot sniff around, it should be handled and monitored by
someone that is a very regular visitor of that language.
Now of course I am getting into trouble with this: there are other
people that want to run this software. First other user was Andre
Engels. I just gave him a copy of the code. Then there was Christian
List on da:, and he got a copy. But then more people working on
wikipedia started asking for the code, and recently also people that
have other instances of the wikipedia software running for other
projects. I am quite fond of free software, but here I am hesitating a
bit: if I give out the code to a growing select few, maintenance for me
is going to be a nightmare.
Now, I could just start an open source project at sourceforge to
collaboratively develop it further. BUT: I am afraid of the power of
this software, and the damage it could do if it ends up in the wrong
hands (vandals). On the other hand, it is not very difficult to write a
vandalizing robot for wikipedia, and I haven't heard anyone did it so
far; why would it start if a decent piece of robot software would be
freely available?
I'd like to evoke a discussion on this list to help me find a solution
to this. What do wikipedia developers and admins think about it?
Regards,
Rob Hooft
--
Rob W.W. Hooft || rob(a)hooft.net || http://www.hooft.net/people/rob/
Hi
I want to raise the issue that the inter-language link situation at the moment
is a big mess:
Please have a look at Rob Hooft's 5 pages listing articles on the english
wikipedia that do not have inter-language links to corresponding articles in
other languages.
http://en.wikipedia.org/wiki/User_talk:Rob_Hooft/page1 (1 through 5).
There are literally thousands of such links missing. Multiply this by roughly
two, to factor in the missing links on all the other languages (English
wikipedia has roughly half the total number of articles), and you can see the
magnitude of the problem.
The correct alphabetical ordering of links is also necessary, and adds more
work.
Now, there are a couple of solutions to this problem (please add your own):
1) Just continue as we have been doing. Additionally asking people to make an
effort to add inter-language links.
Pro: doesn't require any mediawiki changes
Con: wastes a _massive_ amount of man-hours that needless to say could be
better employed
2) Deploy the robot that Rob wrote (and is using on the Dutch wikipedia) on
all wikipedias. To my knowledge this robot searches for missing links and
automatically adds those.
Pro: no mediawiki changes necessary
Con: at least one person per language is needed to run/administer the robot
3) Create a unified inter-language link field (similar to what I have proposed
for pictures/images). This would have a list of languages in which a
particular article is available.
Pro: No need for robot traffic that eats up man-hours, bandwidth, cpu, ram,
disc usage. No massive waste of man-hours due to just continuing as we are.
Con: mediawiki development of such a feature.
Another obvious problem is that the field relating to each article will have
to have a unique identifier which is referrenced from each different language
version of that article. Two possible approaches:
a) just settle on the english title, I know this is language-ism and will piss
a bunch of people off
b) unique identifier is a number. The problem here is obvious: If you write a
new article you will 'have' to check that the article has not been written in
a different language already and therefore already has a unique number.
Actually, now that I think about it option (b) sounds pretty stupid.
Thoughts?
Best,
Sascha Noyes
aka snoyes
--
Please encrypt all email. Public key available from
www.pantropy.net/snoyes.asc
Eureka! I've got it!!
Give each page a "links complete" bit.
The first time the page is loaded, we check all the links, then set the
bit ON.
After that, any page change (creation, deletion, etc.) WHICH AFFECTS
THIS PAGE would then turn the bit OFF.
For example, a page with 20 links. Half of them are to existing (blue)
articles. Half are to missing (red) articles. We load the page, figure
out the colors, cache the page, and set the bit ON.
Next time, someone loads the page, we chech the bit. If it's still on,
we serve the cached copy. (Cold cash! Cash and carry! We get a
substantial discount!!!)
But --
If anyone creates a new page, we take some extra time to trace all the
"pages that link here" and turn off each affected page's "links
complete" bit. (We do not have to re-render the page, though. That can
wait.)
Likewise, if anyone deletes a page.
(patting self on back)
Uncle Ed Poor
Resident genius, tame creationist, and brainwashed cult member ^_^
Just thinking in terms of minimizing the number of directions we've got
to go when mirroring and synchronizing things across servers... We
currently keep page data in a database, but also cache some rendered
pages on the local filesystem. We keep information about TeX fragments
in a database, but cache the rasterized images in the local filesystem.
Uploaded files are a little different: we keep track of them in a
database, but the essential files themselves cannot be reconstructed
from the database alone. Thus if we run multiple web servers from one
database, they fight over the files, needing either a shared networked
filesystem -- inaccessible if the server dies -- or for all but one to
be read-only and periodically update from the live site.
Transparent load-balancing will require that any web server that you
stumble upon should be as good as any other and provide the same stuff,
so locking off all but one (as we've done from time to time when
experimenting with a second 'en2' access point) is less than ideal.
Hypothetically we could have a notification system where the server
that makes a change sends updates to all other servers, but this raises
a question of resynchronization if a member of the cluster is taken out
of service and brought back in -- or if two make conflicting changes
simultaneously.
It might be the simplest thing to just store uploads directly as blobs
in the database. When a request comes to serve an uploaded file, we can
cache it in the local filesystem (or memory) if we like to speed access
and allow read-only access with the db server down. This allows
resynchronization by checking cache time validity, and lets images be
backed up along with the rest of the db.
Terrible idea? Good idea? Anyone have experience doing this kind of
thing?
Typical files are in the tens or low hundreds of kilobytes, with some
ranging up to the maximum allowed 2MB. The total size of uploads we
have is comparable to the total size of the current-version page texts,
and much smaller than the total old-revision page texts we've already
got in the db.
-- brion vibber (brion @ pobox.com)
PRIVATE - OFFLIST
Brion,
I worry that some vandal might learn from the Porsche incident and
replace some innocent picture with goatse.cx -- on some prominent page
like Mother Theresa or George Bush.
I hope we can close this vulnerability quickly (I'm more concerned about
this, than about my fanciful scheme for speeding up the database).
Ed Poor (PRIVATE - OFFLIST)
I would like to have my watchlist publicly viewable, and I would also
like to maintain (discreetly) a private, non-viewable watchlist.
Additionally, I would like to have a publicly EDITABLE watchlist: like,
hey Ed please watch this page!
I'd even like to have multiple watchlists, each with:
* a name
* a setting for publicly viewable, or hidden
* a setting for publicly editable, or personally controlled
I need an NPOV watchlist, a "users fighting" watchlist, and a "global
warming" watchlist.
How soon can we get this stuff, Tim/Brion/Magnus and crew? ^_^
Ed Poor
Um, Brion, I plead insanity. I didn't understand your summary of what
the software does.
I *think* you're saying the software already does what I in my
Archimedean joy had suggested.
Ed "dripping wet" Poor