Wikitech-l April 2005

wikitech-l@lists.wikimedia.org

130 participants
178 discussions

RE: [Wikitech-l] Re: dumps
by Domas Mituzas 15 Apr '05

15 Apr '05

>Of course we can split en: (by namespace, by first letter, by whatever >you want). Sure. Such splitting assumes that we'd be able to do multiple-server joins. Wiki has quite a lot of relations inside, and even having interwiki is quite a PITA. On the other hand, if we 'split' and use other database engines, we may not use relational databases at all. Those were designed for relations, we need storage. > What was the purpose of this dismissive remark? It is clearly better > than Wikipedia's current state, judging from the server speed and the > Frequency of database error messages. It has different task to deal with. If we had to deal with wiki pages as separate entities without too many connections with outside world, having more async information paths, etc, we could work with far less servers. But it's collaboration platform, and requires what collaboration needs - real time information management. And sure, some more async stuff should come in future, but that could mean moving lots of tasks from relational databases to various event brokers, etc. And yes, we could serve HTML dump with extreme performance. There'd be no database errors, and you'd love the speed. Single p4 may handle 4000reqs/s :) That's far surpassing our current cluster speed ;-) Domas

1 0

RE: [Wikitech-l] Lucene search
by Domas Mituzas 15 Apr '05

15 Apr '05

Hey, > How hard would it be to come up with these word-stem normalizers for other > languages (i.e. did you base Esperanto off of another similar language or > did you come up with it yourself relatively easily)? Is there a good > description somewhere on how to come up with them? That may require some linguistic abilities and some coding abilities :) There are several stemmers floating around, with one being used by various opensource software - snowball (http://snowball.tartarus.org/). It has English, French, Spanish, Portuguese, Italian, German, Dutch, Swedish, Norwegian, Danish, Russian, Finnish supports. Maybe there's possibility to adapt rules from there. Cheers, Domas

2 1

Lucene search
by Brion Vibber 15 Apr '05

15 Apr '05

Kate's Lucene-based search server is now up and running experimentally to cover searches on en.wikipedia.org. It's compiled with GCJ, so it's not polluted by any of that dirty icky not-quite-free Sun Java VM stuff. ;) For those of you new to the game, Lucene is a text search engine written in Java, sponsored by the Apache project: http://lucene.apache.org/ Using a separate search server like this instead of MySQL's fulltext index lets us take some load off the main databases. To compare our options I did an experimental port to C# using dotlucene; some benchmarking showed that while the C# version running on Mono outpaced the Java version on GCJ for building the index, Java+GCJ did better on actual searches (even surpassing Sun's Java in some tests). Since searches are more time-critical (as long as updates can keep up with the rate of edits), we'll probably stick with Java. More info at: * http://www.livejournal.com/community/wikitech/9608.html * http://meta.wikimedia.org/wiki/User:Brion_VIBBER/MWDaemon At the moment the drop-down suggest-while-you-type box is disabled as GCJ and BerkeleyDB Java Edition really don't get along. I'll either hack it to use the native library version of BDB or just rewrite the title prefix matcher to use a different backend. -- brion vibber (brion @ pobox.com)

6 12

Re: [Wikipedia-l] Alexa.com blogs about Wikipedia
by Neil Harris 14 Apr '05

14 Apr '05

Oh, one last thing: Looking at the recent Alexa traffic graphs reminded me of the recent E-mail to Wikitech-l, telling the developers that the Wikipedia infrastructure was (and I quote) "U N S C A L A B L E" I think that should now read "unscalable, apart from the ever-improving performance and continued exponential growth in load", or something? By the way, what _has_ caused the recent traffic spike? Added hardware? Software improvements? Better load-balancing? Something else? In any case, the results have been like letting the handbrake off on a sports-car. Kudos again to the developers. -- Neil

2 1

Wiki-to-XML pseudo-parser in PHP
by Magnus Manske 14 Apr '05

14 Apr '05

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Here it is! The millionth pseudo-parser I wrote for wiki(p|m)edia! :-) Written as a single class, it takes a MediaWiki-style markuped (is that a word?) source and generates the XML flavor Timwi and I have been using in all our unfinished projects! ;-) Try it out at http://www.magnusmanske.de/wikipedia/wiki2xml.php Just paste a wiki source text in, and get the XML. As you will notice, it wasn't written for speed. It is not a "real" parser, but the structure is simlar to what a parser generator would make, except taking a few shortcuts here and there. This could be the heart of a *real* export function. Just write a XML-to-PDF generator (and replace the templates, and get rid of the categories and language links) and you're done! :-) Magnus -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.1 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFCVYLOCZKBJbEFcz0RAntrAJ0cbZTbRf5IIjyK2ltgv2oNZ28+cgCfWTjn tl8TiwYzjBFGHpHUnpkVJXQ= =1Y9V -----END PGP SIGNATURE-----

6 13

RE: [Wikitech-l] dumps
by Domas Mituzas 14 Apr '05

14 Apr '05

Hey, > Are there any ideas for some kind of incremental dump? > Or are bandwith and disk storage no problem compared to the > more complex implementation of incremental dumps? > And are there any statistics about how much bandwidth is > used by downloads and database dumps (i.e. not normal visitor traffic)? Incremental dumps have been always constant rave by new developers. Maybe it would be not so sophisticated to provide streams of changed articles, but doing proper sync yet is possible only by replicating SQL commands. We do have currently incremental dumps (eh, MySQL binlogs), but those, unlike public dump service, contain internal tables (with all sensitive information). Providing those into public we either would have to set yet another mysql slave, with limited replication, or somehow filter binlogs, which is PITA as well. There's yet another issue, with 1.5 major database redesign will happen, which will force us to provide non-SQL dumps. We're even not sure if revision texts will be kept in MySQL in future. Cheers, Domas

3 3

How to read database using only mysql?
by mm 14 Apr '05

14 Apr '05

I have a question. Can I read a local wikpedia database using only mtsql? Can somebody help me? How can I do this? Thanks, Mircea

2 1

Different alphabets for the same language
by monk＠zoomcon.com 14 Apr '05

14 Apr '05

Hello wikitech-l, Belarusian language (http://en.wikipedia.org/wiki/Belarusian_language) has now two quite widely used alphabet versions - Cyrillics and Latin (actually, it also has Arabic alphabet, but it is too rarely used). For now, be: wikipedia uses Cyrillics. But we really need Latin version for those, who prefers to use this alphabet. We have strict bidirectional rules to transform any text between Cyrillics <-> Latin. We are interested in creating of a "live mirror" (automatic translator) between Cyrillics and Latin alphabets for be: Wikipedia. I mean, it would be great, if anyone could read and submit any article in either alphabet. As far as i know, something similar was created for different alphabets of Chinese language, so this issue should be worked over already. I'm myself an experienced PHP+MySQL developer, so I can directly participate in this project. Can anyone provide their thoughts and any help about this issue? It is surely interesting and quite important thing. Thank you. -- Best regards, Monk ([[en:User:Monkbel]] mailto:monk@zoomcon.com

9 33

Category array
by Dennis Schaaf 14 Apr '05

14 Apr '05

Hi, I'm trying to play around with the category page a little bit. Everything would be a lot easier if I could just somehow get an array with all pages listed in category x. Can you guys give me some hints on how to create such a function? Unfortunately the databaselayout does not talk about categorylinks. cheers, dennis

3 4

"You have messages" directed me to User_talk:10.0.0.23
by Coyne Tibbets 14 Apr '05

14 Apr '05

(History: This issue was initially directed to vandalism list for lack of a clue. After discussion with User:CaesarB, he recommended that I approach a developer. I hope this is the right method to do that.) On April 12th, I visited the main page of en.wikipedia.org while not logged on. At the top of the page was notice that "you have messages". When I selected the associated link, I was directed to the page http://en.wikipedia.org/wiki/User_talk:10.0.0.23, upon which was posted a "no more nonsense" warning (which is still there). I have limited understanding of the IP and wikimedia, but it seemed that this should not happen, since I am external to wiki. To the best of my knowledge, I should have been identified as IP 204.4.13.x (probably IP 204.4.13.72 but this may be shared). Also, the presence of the warning at the 10.0.0.23 user talk page suggests that at least some edits were identified as coming from that address. CaesarB suggested I approach a developer because, "[...] it might mean a misconfigured Squid proxy on Wikipedia's side [...]". I'm still fairly new to wiki, so I don't know if this is the right venue to broach such a request or not. If am not in the right place to pursue this, please direct me at the reply address (or at http://en.wikipedia.org/wiki/User_talk:CoyneT). If this is the right venue, would someone please keep me posted? Thanks. Coyne Tibbets -- ___________________________________________________________ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm

1 0

← Newer
1
...
8
9
10
11
12
13
14
...
18
Older →

Jump to page:

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Wikitech-l April 2005