>Of course we can split en: (by namespace, by first letter, by whatever
>you want).
Sure. Such splitting assumes that we'd be able to do multiple-server joins.
Wiki has quite a lot of relations inside, and even having interwiki is quite a PITA.
On the other hand, if we 'split' and use other database engines, we may not use
relational databases at all. Those were designed for relations, we need storage.
> What was the purpose of this dismissive remark? It is clearly better
> than Wikipedia's current state, judging from the server speed and the
> Frequency of database error messages.
It has different task to deal with. If we had to deal with wiki pages as separate
entities without too many connections with outside world, having more async
information paths, etc, we could work with far less servers. But it's collaboration
platform, and requires what collaboration needs - real time information
management. And sure, some more async stuff should come in future, but that
could mean moving lots of tasks from relational databases to various event brokers, etc.
And yes, we could serve HTML dump with extreme performance.
There'd be no database errors, and you'd love the speed. Single p4
may handle 4000reqs/s :) That's far surpassing our current cluster speed ;-)
Domas
Hey,
> How hard would it be to come up with these word-stem normalizers for other
> languages (i.e. did you base Esperanto off of another similar language or
> did you come up with it yourself relatively easily)? Is there a good
> description somewhere on how to come up with them?
That may require some linguistic abilities and some coding abilities :)
There are several stemmers floating around, with one being used by
various opensource software - snowball (http://snowball.tartarus.org/).
It has English, French, Spanish, Portuguese, Italian, German, Dutch,
Swedish, Norwegian, Danish, Russian, Finnish supports. Maybe there's
possibility to adapt rules from there.
Cheers,
Domas
Kate's Lucene-based search server is now up and running experimentally
to cover searches on en.wikipedia.org. It's compiled with GCJ, so it's
not polluted by any of that dirty icky not-quite-free Sun Java VM stuff. ;)
For those of you new to the game, Lucene is a text search engine written
in Java, sponsored by the Apache project: http://lucene.apache.org/
Using a separate search server like this instead of MySQL's fulltext
index lets us take some load off the main databases.
To compare our options I did an experimental port to C# using dotlucene;
some benchmarking showed that while the C# version running on Mono
outpaced the Java version on GCJ for building the index, Java+GCJ did
better on actual searches (even surpassing Sun's Java in some tests).
Since searches are more time-critical (as long as updates can keep up
with the rate of edits), we'll probably stick with Java.
More info at:
* http://www.livejournal.com/community/wikitech/9608.html
* http://meta.wikimedia.org/wiki/User:Brion_VIBBER/MWDaemon
At the moment the drop-down suggest-while-you-type box is disabled as
GCJ and BerkeleyDB Java Edition really don't get along. I'll either hack
it to use the native library version of BDB or just rewrite the title
prefix matcher to use a different backend.
-- brion vibber (brion @ pobox.com)
Oh, one last thing:
Looking at the recent Alexa traffic graphs reminded me of the recent
E-mail to Wikitech-l, telling the developers that the Wikipedia
infrastructure was (and I quote)
"U N S C A L A B L E"
I think that should now read "unscalable, apart from the ever-improving
performance and continued exponential growth in load", or something?
By the way, what _has_ caused the recent traffic spike? Added hardware?
Software improvements? Better load-balancing? Something else?
In any case, the results have been like letting the handbrake off on a
sports-car.
Kudos again to the developers.
-- Neil
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Here it is! The millionth pseudo-parser I wrote for wiki(p|m)edia! :-)
Written as a single class, it takes a MediaWiki-style markuped (is that
a word?) source and generates the XML flavor Timwi and I have been using
in all our unfinished projects! ;-)
Try it out at
http://www.magnusmanske.de/wikipedia/wiki2xml.php
Just paste a wiki source text in, and get the XML. As you will notice,
it wasn't written for speed.
It is not a "real" parser, but the structure is simlar to what a parser
generator would make, except taking a few shortcuts here and there.
This could be the heart of a *real* export function. Just write a
XML-to-PDF generator (and replace the templates, and get rid of the
categories and language links) and you're done! :-)
Magnus
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFCVYLOCZKBJbEFcz0RAntrAJ0cbZTbRf5IIjyK2ltgv2oNZ28+cgCfWTjn
tl8TiwYzjBFGHpHUnpkVJXQ=
=1Y9V
-----END PGP SIGNATURE-----
Hey,
> Are there any ideas for some kind of incremental dump?
> Or are bandwith and disk storage no problem compared to the
> more complex implementation of incremental dumps?
> And are there any statistics about how much bandwidth is
> used by downloads and database dumps (i.e. not normal visitor traffic)?
Incremental dumps have been always constant rave by new developers.
Maybe it would be not so sophisticated to provide streams of changed articles,
but doing proper sync yet is possible only by replicating SQL commands.
We do have currently incremental dumps (eh, MySQL binlogs), but those,
unlike public dump service, contain internal tables (with all sensitive information).
Providing those into public we either would have to set yet another mysql slave,
with limited replication, or somehow filter binlogs, which is PITA as well.
There's yet another issue, with 1.5 major database redesign will happen,
which will force us to provide non-SQL dumps. We're even not sure if
revision texts will be kept in MySQL in future.
Cheers,
Domas
Hello wikitech-l,
Belarusian language (http://en.wikipedia.org/wiki/Belarusian_language)
has now two quite widely used alphabet versions - Cyrillics and Latin
(actually, it also has Arabic alphabet, but it is too rarely used).
For now, be: wikipedia uses Cyrillics. But we really need Latin
version for those, who prefers to use this alphabet. We have strict
bidirectional rules to transform any text between Cyrillics <-> Latin.
We are interested in creating of a "live mirror" (automatic
translator) between Cyrillics and Latin alphabets for be: Wikipedia.
I mean, it would be great, if anyone could read and submit any article
in either alphabet.
As far as i know, something similar was created for different
alphabets of Chinese language, so this issue should be worked over
already.
I'm myself an experienced PHP+MySQL developer, so I can directly
participate in this project.
Can anyone provide their thoughts and any help about this issue? It
is surely interesting and quite important thing.
Thank you.
--
Best regards,
Monk ([[en:User:Monkbel]] mailto:monk@zoomcon.com
Hi,
I'm trying to play around with the category page a little bit. Everything would
be a lot easier if I could just somehow get an array with all pages listed in
category x. Can you guys give me some hints on how to create such a function?
Unfortunately the databaselayout does not talk about categorylinks.
cheers,
dennis
(History: This issue was initially directed to vandalism list for lack of a clue. After discussion with User:CaesarB, he recommended that I approach a developer. I hope this is the right method to do that.)
On April 12th, I visited the main page of en.wikipedia.org while not logged on. At the top of the page was notice that "you have messages". When I selected the associated link, I was directed to the page http://en.wikipedia.org/wiki/User_talk:10.0.0.23, upon which was posted a "no more nonsense" warning (which is still there).
I have limited understanding of the IP and wikimedia, but it seemed that this should not happen, since I am external to wiki. To the best of my knowledge, I should have been identified as IP 204.4.13.x (probably IP 204.4.13.72 but this may be shared). Also, the presence of the warning at the 10.0.0.23 user talk page suggests that at least some edits were identified as coming from that address.
CaesarB suggested I approach a developer because, "[...] it might mean a misconfigured Squid proxy on Wikipedia's side [...]".
I'm still fairly new to wiki, so I don't know if this is the right venue to broach such a request or not. If am not in the right place to pursue this, please direct me at the reply address (or at http://en.wikipedia.org/wiki/User_talk:CoyneT).
If this is the right venue, would someone please keep me posted?
Thanks.
Coyne Tibbets
--
___________________________________________________________
Sign-up for Ads Free at Mail.comhttp://promo.mail.com/adsfreejump.htm