Wikitech-l April 2004

wikitech-l@lists.wikimedia.org

71 participants
126 discussions

Disk space on suda
by Brion Vibber 13 Apr '04

13 Apr '04

Disk space on suda is really, *really* tight. Every couple of days we find ourselves clearing off files because it's *full*. Would it be possible to toss in another disk, any disk, temporarily to which we could migrate ~20gb of the database + future binlogs? -- brion vibber (brion @ pobox.com)

1 0

User name, nick, real name
by Evan Prodromou 13 Apr '04

13 Apr '04

So, the way that MediaWiki is currently set up, we have two fields for identifying a contributor: * user name * nick I think (but don't know) that the idea here is that your "user name" is your "real" name, like "Evan Prodromou", and your "nick" is going to be a nickname, handle, or pseudonym, like "Mister Bad". This may come from the tradition on some wikis, like Ward's Wiki, Meatball, others, where using your real name is the norm. It seems that on Wikipedia, other Wikimedia projects, and Wikitravel (which I'm most interested in), this is not the case. People treat a user name like a Unix, IRC, or other "user account": an abbreviated name or a pseudonym. The "nick" field is generally just used for making fancy signatures; in other cases, it's just used to provide a _second_ pseudonym or abbreviation. Now, I'm the last person to put down pseudonyms. I think they're a crucial part of Internet culture. But real names can be useful for, say, getting credit as a contributor to an article. Somewhere along the way here we lost the slot for adding a "real name" to a user account. You can't provide your real name even if you want to. Putting your real name in the user name slot is lost in the noise; I don't know, when you have a user account like "Bob Frapples", whether that's a clever pseudonym or actually your real name. Contributors who want to have their real name recognized now put them on their user pages. But this is kind of difficult for software to determine what a user's real name is. I'd like to embrace the reality of the situation and have two identity fields, plus a display field: * User account name -- a pseudonym or abbrev or whatever * Real name -- preferred form of legal name * Signature -- fancy formatting for signatures For these reasons, I'd like to propose the following: * We add a nullable user field "user_real_name". * The login/account creation page has an additional field for "real name", with an explanation that it's optional, and only for attribution, etc. * The preferences page lets you change your real name. * We change the documentation for the user name to note that it's a nickname and doesn't need to be your real name. * We change the documentation for the "nick" field to note its use as a "signature" format. Automatic attribution tools can use the real name field if it's provided, or the preferred pseudonym ("Wikitravel user Hogwallop") if not. The user account name would continue to be shown everywhere it is now, and the "nick" field would continue to be used primarily (exclusively?) in the ~~~ signature areas. The main thing is that if contributors want attribution under their real name, but identity in the system under a nickname, they get it. Lastly, I think an easy way to change your user name is necessary, to make this shift in emphasis easier for those who want to. That's a whole can of worms, there, but I don't think it's impossible to deal with. ~ESP -- Evan Prodromou <evan(a)wikitravel.org> Wikitravel - http://www.wikitravel.org/ The free, complete, up-to-date and reliable world-wide travel guide

1 0

wpSection=new flood attack on zh
by Tim Starling 13 Apr '04

13 Apr '04

Ordinarily, the POST request for an edit must contain the correct wpEditTime parameter, otherwise an edit conflict will be triggered. This has the side-effect of forcing bots (malicious or otherwise) to request an edit page before they POST their text. However, if wpSection=new, that is, the "post a comment" feature is being used, the edit time is not checked. At around 2003-04-08 14:30 UTC, an attack was performed on the Chinese Wikipedia, using this fact. The attacker sent a flood of POST requests with wpSection=new, apparently not waiting for the response from the server before sending the next one. This allowed him/her to vandalise about 2 pages per second. The content of the message was not corporate spam as previous bot attacks have been, rather it was a puerile anti-China message to do with eating dogs and cats. My suggestions for dealing with this kind of attack are: * Limiting the rate at which any given IP address can send POST requests, i.e. throttling * A facility for fast filter configuration, perhaps even at the sysop level. Frantically editing EditPage.php at 1:30am is not my idea of fun. * Securing edit submission such that a bot must request an edit page first -- Tim Starling

5 11

Worldwide Lexicon and Wikipedia
by Brian McConnell 13 Apr '04

13 Apr '04

Hello. I am writing to introduce a project that complements Wikipedia. Jimmy suggested I post here, so here goes... The goal of the Worldwide Lexicon Project (www.worldwidelexicon.org) is to create a standard procedure for discovering and querying a wide range of language resources, including: machine translation servers, dictionaries, encyclopedia (i.e. Wikipedia), and even human translators. Sounds ambitious, but all WWL does is to create a machine-readable directory of language services, and to define a standard set of CGI parm/value pairs and XML response as a poor man's web services interface. Read about the REST method of building web services, we borrowed heavily from this model. We would like to ask the Wikipedia community to consider supporting WWL. This is very easy to do. All that is required is to modify the existing search script to recognize standard parameters used in WWL queries, and to generate an XML response in place of HTML. How simple is this, here's an example. A WWL client wants to search a Spanish Wikipedia for "Antoni Gaudi". This is a two-step process. First the client, perhaps an embedded browser plug-in, queries a WWL directory server as follows: http://www.trekmail.com/wwl/sn.asp?action=findservices&sl=esp&servicetype=c… Note: if the client does not already know the location of a directory server, it simply loads http://root.worldwidelexicon.org to get a list of directory servers. The directory server replies with an XML dataset containing a list of Spanish encyclopediae including, presumably, a Spanish wiki. Each record contains a baseurl for each server, which informs the client which script to invoke. The record for the spanish Wikipedia might point to: http://esp.wikipedia.org/search.php The client then queries this script as follows: http://esp.wikipedia.org/search.php? qt=wwl&action=searchtext&searchscope=title&sl=esp&stext=antonio+gaudi The wiki server replies with an XML dataset containing matching records, full text, and pointers to HTML URLs for each record. All it does is respond with XML versus HTML when it sees the qt=wwl parm/value pair. The client application parses the XML and displays it as desired. (Note that I am using querystring notation, the queries could also be submitted via the POST method). If client side developers prefer, they can also access WWL services via JavaBean or ActiveX objects. These are basically a collection of wrapper functions that further simplify this process and also handle common errors. These will be available later this month when we begin testing a multilingual instant messaging client. These tools are not necessary, so developers can build applications in any development environment that allows you to open a URL and parse XML. To support a basic WWL implementation requires trivial changes to the existing search script. In the future, you may also consider implementing an extended feature set that describes parent and child categories related to an entry. I believe this will also be easy to implement, but the basic implementation described above is a quick and easy job. Why do this? What WWL does is create a peer-to-peer equivalent of the Google API. So application developers will be able to build tools that can talk to many types of WWL enabled resources. The range of possible applications is quite broad. One application (among many) that is well suited to Wikipedia is a translation memory. Commercial translation memory tools are expensive and often not well maintained. By combining WWL and Wikipedia, it will be possible to do the same thing with a lightweight client app. If anyone is interested in creating a WWL front-end to Wikipedia, you can contact me at brian AT mcconnell.net I will be glad to explain the system in detail and answer questions regarding implementation. Thank you for your time. Brian McConnell, Project Leader Worldwide Lexicon PS - we will also be releasing a Java based Jabber client that talks to machine translation servers, and that also matches users with bilingual users who are willing to translate for other IM users.

3 5

Re: Re: Re: New db server
by user_Jamesday 12 Apr '04

12 Apr '04

I agree on the quote from my writing. I've changed that from: "RAID 1 mirroring offers about twice the number of read seeks as write seeks (each drive seeks independently). RAID 5 does not offer more read seeks than write seeks because each stripe is on all disks and all must seek together to get the data" to: "RAID 1 mirroring or RAID 10 offers about twice the number of read seeks as write seeks (each drive or stripe seeks independently). RAID 5 does not offer more read seeks than a single drive, RAID 1 or RAID 10 can deliver because each stripe is on all disks and all must seek together to get the data. In addition, in RAID 5 writes are slowed because at least one read is required to get the parity data unless it has been cached." so it doesn't ignore the write rate reduction in RAID 5. That might be significant if it turns out that we become write rate limited and it might be one part of the reasons why LiveJournal, which is almost exclusively write-limited, is switching from RAID 5 to RAID 10. Doesn't affect what I was intending to write about, though, which was read rates of the various RAID systems. Experience at the Wikipedia is that Suda with a 3 disk RAID 5 setup is far slower than Geoffrin with a 4 disk RAID 10 setup. I'm interested in reading your views on why that is the case. Either way, though, I'm inclined to go with what we've seen of performance in the Wikipedia environment until we can make Suda with RAID 5 faster than it has been. If you can come up with some proposals which might do that, it's worth considering trying them, since the greater space efficiency of RAID 5 will be useful eventually. RAID 5 compared to RAID 10 is interesting when it comes to sequential read rates because the RAID 5 system can read the data from more drives, so it can get a higher sequential transfer rate. The catch is that this is a database system and database systems are generally considered to be limited primarily by their seek rate, not their sequential transfer rate. There are some potential gotchas in that though - cases with large chunk/cluster sizes in the database and some access patterns might change it. "Transaction rate" rather than seek rate or sequential transfer rate has lots of significant details not spelled out, which is one reason why I stuck to the comparatively unambiguous seek and sustained transfer rate measures (though those have a fair amount of varying potential as well). Yes, I agree that it's possible to have RAID 5 systems set up not to have striping across all drives in the RAID 5 system box. However, that's not how people normally think of RAID 5 - they are normally thinking in terms of one set of drives. The RAID 5 option offers less independent seeks than RAID 1, unless you start to do things like splitting the array stripes as you described. Not really sure what I'd call that but RAID 5 probably isn't it. Maybe a pair of RAID 5s. In any case, I expect that to offer less seeks than RAID 1 because that two drive minimum per stripe has to seek together and RAID 1 drives can seek independently. I do not agree that RAID 5 offers the highest read transaction rate, in general. Please support that claim compared to RAID 1 and RAID 10 over in the article talk page. It'll be interesting to see your data and any you can point to which compares the systems. Since we're considering Wikipedia use, data with Wikipedia access patterns, including transfer sizes, is what really interests me. I don't know the typical transfer size per seek for Wikipedia, though. In a past life I was disk then overall manager for CompuServe's benchmarks and standards community, so I'm always happy to discuss disk system performance - it's a fun subject for me.:) But probably best not done on this list.:)

2 1

Database Split & Revisions
by RameezDon 12 Apr '04

12 Apr '04

2 1

Re: New db server
by user_Jamesday 12 Apr '04

12 Apr '04

A summary of current performance issues and discussions in IRC, combined with some purchase options discussions is available at: http://meta.wikipedia.org/wiki/Upgrade_discussion_April_2004 Also linked from there are the new, excellent, Ganglia statistics Tim Starling set up a few days ago. I'll integrate the discussion from the mailing list which isn't yet covered there shortly.

2 1

Deletion log is parsed and breaks layout
by Thomas Luft 11 Apr '04

11 Apr '04

Hi, I don't know if this error was already reported, at least I found no bug report at sourceforge. When viewing the deletion log (http://de.wikipedia.org/wiki/Wikipedia:L%F6sch-Logbuch) of the German Wikipedia there are several deleted entries which get parsed and break the layout of the page. I just cheked en: and there it works normal?! Any clues? Regards Thomas aka Urbanus

2 1

Moreri
by Brion Vibber 11 Apr '04

11 Apr '04

There's a batch compression of de's old revisions running on moreri. This doesn't seem to enjoy coexisting with apache+wiki, sending the machine into *huge* loads (~50) every few minutes, so I've shut down apache on that machine to take it out of the rotation and avoid bogging things down. -- brion vibber (brion @ pobox.com)

1 0

IMDB update
by erik_moeller＠gmx.de 11 Apr '04

11 Apr '04

Hi! Because of IMDB's no-bot policy I inquired a couple of weeks ago what the best way would be to submit links to Wikipedia articles about movies, so they can add them to the "External reviews" section for each movie. Here's the reply: ---------------------------------------- I apologize for the delay in getting back to you (I was out of the office for the past 2 weeks). The easiest thing to automatically submit several links is to send them to our mail server. You will need to send a specially formatted email to adds(a)imdb.com with details of the title and link. Please note that our mail server doesn't normally accept incoming email from unauthorized users so you will need to let us know the email address of the sender so we can allow access to it. This is the syntax you will need to use. URLTITLE title|type|URL|description| END or URLNAME name|type|URL|description| END where: title = Title of film exactly as it appears on IMDb.com type = a 3-letter code that identifies the type of link URL = the link description = description of the link "type" can be one of the following: COM comments/reviews IMG image SND sound MOV movie FAQ Frequently Asked Questions list. OFF official sites POS movie posters TRA movie trailers MSC miscellaneous i.e. anything that doesn't fit into a type above. a few examples: URLTITLE Alien (1979)|COM|http://crazy4cinema.com/Review/FilmsA/f_alien.html|Crazy for Cinema Alien (1979)|COM|http://efilmcritic.com/hbs.cgi?movie=583|eFilmCritic Alien (1979)|COM|http://www.igs.net/~mtr/haiku-reviews.html#Alien|Haiku Reviews Cheaper by the Dozen (2003)|COM|http://www.suntimes.com/ebert/ebert_reviews/2003/12/122402.html|R… Ebert, Chicago Sun-Times Jaws (1975)|MSC|http://www.sharks.net/bigger_boats.html|Real shark attacks END or URLNAME Garbo, Greta|IMG|http://www.goldensilents.com/stars/gretagarbo.html|Golden Silents Portrait Photos| Eastwood, Clint|MSC|http://www.sensesofcinema.com/contents/directors/03/eastwood.html… of Cinema - Great Directors Critical Database| Hitchcock, Alfred (I)|MSC|http://hitchcock.tv/|Alfred Hitchcock - The Master Of Suspense| END What I suggest is that you create a sample submission and send it to me along with the email address that you would like to use as a submitter, so I can check that everything looks ok and set things up on our side. GC ------------------------------------ So what we need is a mailbot which takes a list of Wikipedia articles and the corresponding IMDB titles, apparently of the syntax "Title (year)" and generates and sends the respective mails. Preferably it would keep track of its submissions (could be done easily in Perl using a tie'd hash) so we can update the list. For extra points, it could try to auto-guess the title in the IMDB, say, by filtering out "(movie)" from the Wikipedia title and looking for the first [1-2][0,8,9][0-9][0-9] match in the article. Mike, are you still interested? Regards, Erik

4 3

← Newer
1
...
5
6
7
8
9
10
11
12
13
Older →

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Wikitech-l April 2004