Timwi wrote:
>> * The other thing I like to mention is LiveJournal. Their database backend is pretty impressive and handles the load of almost a million active users. They have never even dreamt of placing journal entries, audio posts or user pictures into files in a file system. The way they have it now they can easily create more database clusters and move users (and their data) around between clusters using a little Perl script. With a file system, that would be quite a bit more difficult.<<
Not quite. LiveJournal identified blob transfer from their databases as a bottleneck and removed them to remove the bottleneck, as part of their work on removing the databases as a whole from bottleneck status, which in general meant shifting database things into memcached caches whenever possible. This moving started in earnest in early November.
LJ blobs (images and audio) are now on a NetApp box filesystem, completely apart from their user database cluster machines. You can see the overall architecture here:
http://www.livejournal.com/community/lj_backend/974.html
A description of the move out of the database machines is here:
http://www.livejournal.com/community/lj_backend/502.html
Without the third party Akamai servers they would be serving via their blob component, which would cache in memcached, loading into that from the NetApp filesystem if not in the cache already.
Addressing your concerns about integrity, the way to do that is to back up the images, then the database, then any new images. With all image names incorporating timestamps that will ensure that the images to match the database are available, at the cost of keeping some extra images - those deleted before the backup and new created during it.
LJ caches data into memcached with the memcached servers (sometimes several of them on one machine) residing on the web servers (page builders), because the web servers are CPU-bound and have RAM to spare. Lets the machines do double duty.
Like Memcached, the blob component is open source.
If we treated section' s path as its ID, we could get rid of most conflicts
when editing sections.
''Disclaimer: this should work fairly well for two users in conflict. I'm
not sure how scalable it is or needs to be.''
Construct paths from heading texts, separated by #. Where needed, add
dissambig characters. Use the path to identify sections when editing.
Provide users with an option to either edit the section with it subsections
(so the smallest needed part is edited when restructuring) or just the text
of the section (to edit even less when appropriate).
So, users U1 and U2 are concurrently editing section S1 and S2 and then U1
saves the page with his version of S1. U2 tries to save the page with his
edit of S2, but gets an edit conflict. What's to be done?
1) If S1 and S2 are same, show diff and let U2 handle it.
2 ) If S1 and S2 are disjunct (i.e. one is not included in the other), there
is no conflict. Overwrite S2 with U2's version and save.
3) If S1 and S2 are not disjunct, we call the larger one (S) and the smaller
one (s). Check if (s) exists in the newest version of (S).
3.1) If (s) doesn't exist, present U2 with the diff page for (S).
3.2) If (s) exists, and U1 changed its contents, present U2 with the diff.
3.3) If (s) exists, and the user who edited (S) did not change its contents,
replace it with the version of the user who edited (s) (this is quite
automagical, so if conservative, present U2 with the diff)
3.4) When U2 finally saves the page, record the version as an edit of (S) to
ensure correct behaviour.
Additional tuning may be needed:
Maybe show separate diff for (s) and show source of (s) in the version where
it was directly chosen for editing .
What happens when U1 changed the title of the section, (or ordering of
sections with the same names) but left the contents alone? Should we bother
to check for it? Maybe keep CRC's of section contents? (They can be kept in
the source- filter them out on display and edit, calculate new and store on
save)
~~~~ zocky
Some projects, like Wikibooks and Wikisource (not to mention Meta), don't have
major naming conflict problems. But it would help a great deal if it were
possible for users to set their preferred interface language in their
preferences and for language category tags to set the interface language as
well (so anons visiting a page with a French language tag will have their
interface switch to French; &lang=fr in the URL would trigger the same thing
as well).
-- mav
Hello,
I'm currently playing around with wikipedia to get a version with German and
English entries running on my zaurus. As far as I currently understand the
code, having entries from two languages would require two mysql-databases,
two http-directories on the server, and cross-references between these wikis
on the "en" and "de" prefix within the interwiki database table.
Is this correct or is there a way to have two languages with only one
database and one copy of the mediawiki code?
BTW, http://wikipedia.sourceforge.net/ still names mediawiki-20031118.tar.gz
as most recent version, which got me wondering over the speed of
rebuildlinks.php ;-)
Kind regards
Markus
Just out of curiosity, what is the current thinking among developers? Is it
more:
A) MediaWiki is a general purpose wiki infrastructure, which also runs
Wikipedia. Developers of MediaWiki should concentrate on making the best
possible software and editors of Wikipedia should take advantage of new
features as they're provided.
B) MediaWiki is support infrastructure for Wikipedia. Editors of Wikipedia
should decide what kind of behaviour and which features Wikipedia needs and
developers of MediaWiki should implement them.
~~~~ zocky
Ken, I'm forwarding your inquiry to wikitech-l.
----- Forwarded message from Ken Dobruskin <ken(a)dobruskin.com> -----
From: Ken Dobruskin <ken(a)dobruskin.com>
Date: Fri, 19 Dec 2003 12:00:02 +0100 (CET)
To: Jimbo Wales <jwales(a)bomis.com>
Subject: Forbidden access to wikipedia server
Greetings!
After reading the Wikipedia GFDL and policy on robots, I thought I'd try an experiment, absolutely in line with said policies.
However when I tried to access the site from my server it failed with a 403.
User-agent: Python-urllib/1.15
IP address: 216.28.158.40
Sorry if I should be asking this to the mailing list, but I did see a word about access issues to be addressed to a site admin.
Would appreciate your advice.
Best 'net regards,
Ken
----- End forwarded message -----
I topped off the rpm-based security updates on pliny and larousse, I
*think* I got the updates installed on geoffrin via yast2, and I gave
them more current kernels which shouldn't have the recently publicized
kernel vulnerability (the latest Red Hat-provided one for larousse, and
a compiled 2.4.23 for pliny; I assume the SuSE updates should also be
current).
They'll need to be rebooted for the kernel upgrades to take. L & P are
theoretically both set to boot to the new kernels on the next boot
only; so if they won't come up, a second reboot should bring back the
old kernel.
Since this involves conservatively some 30 minutes of downtime
(possibly rather more if there are problems) I'd rather not do it in
the middle of the day. Note also that geoffrin cannot be remotely
rebooted yet if it can't be logged into.
-- brion vibber (brion @ pobox.com)