Note: some load problems on upload & image scaler servers

List overview All Threads
Download

newer

older

Quick notes on SVG support...

Re: [Wikitech-l] wmf-deployment...

Brion Vibber

13 Jul 2009 13 Jul '09

11:49 a.m.

A couple quick notes I tossed up on the tech blog: http://techblog.wikimedia.org/2009/07/intermittent-media-server-load-problem...

Domas thinks it's related to this problem with ZFS snapshots badly affecting NFS server performance in some cases: http://www.opensolaris.org/jive/thread.jspa?messageID=64379

Actual load from clients doesn't seem problematic, but the NFS horror can cause things to time out badly, which sometimes affects the main apaches as well as the image scalers. (Especially when, say, deleting a category of 100 image pages. :)

We've got it behaving reasonably well at the moment, but we'll want to keep an eye on things until we've reduced the coupling between things a bit...

-- brion

Show replies by date

Domas Mituzas

13 Jul 13 Jul

5:10 p.m.

Hi!

my view of things what was happening, may not be accurate, as it was first time I was touching this part of the cluster ;-)

1. ms1 had just 128 web threads configured, which could be occupied by both looking up file, serving it (and blocking if squid doesn't consume it fast enough), and blocking on fastcgi handlers 2. even though ms1 I/O was loaded, it wasn't loaded enough to justify 20s waits on empty file creations via NFS 3. read operations via NFS were relatively fast, comparing with write operations 4. pybal was quite aggressive in depooling scalers 5. scalers were in one of two states, blocked on NFS write or depooled, due to stampede of requests hitting one server, and all others being depooled 6. if scaler would be depooled, not only it would not get requests, but wouldn't be able to write output back (my assumption, based on their frozen states, not actual verified fact) 7. due to 4-6, ms1 http threads would be stuck in 404-handler fastcgi (waiting for scalers to respond), thus also blocking way out for existing files 8. ms1 was spending lots of CPU in zfs`metaslab_alloc() tree (full tree, if anyone is interested, is at http://flack.defau.lt/ms1.svg - use FF3.5, then you can search for metaslab_alloc in it. 9. some digging around the internals (it is amazing, how I forgot pretty much everything I learned in 6h of ZFS sessions last November ;-) showed that the costs could have been increased by amount of our snapshots.

What was done:

1. Increased amount of worker threads on ms1 (why the heck was it that small anyway) 2. Made balancing way less eager to depool servers (thanks sir Mark) 3. Disabled ZIL (didn't give much expected effect though, as problem was elsewhere) 4. Dropped few oldest snapshots - thus targeting the metaslab_alloc() issue.

Cheers, Domas

P.S. For anyone Solaris savvy (I am not, despite where I work), you know what this means:

unix`mutex_delay_default+0x7 unix`mutex_vector_enter+0x99 genunix`cv_wait+0x70 zfs`space_map_load_wait+0x20 zfs`space_map_load+0x36 zfs`metaslab_activate+0x6f zfs`metaslab_group_alloc+0x18d zfs`metaslab_alloc_dva+0xdb zfs`metaslab_alloc+0x6d zfs`zio_dva_allocate+0x62 zfs`zio_execute+0x60 genunix`taskq_thread+0xbc unix`thread_start+0x8 3816

unix`mutex_delay_default+0xa unix`mutex_vector_enter+0x99 genunix`cv_wait+0x70 zfs`space_map_load_wait+0x20 zfs`space_map_load+0x36 zfs`metaslab_activate+0x6f zfs`metaslab_group_alloc+0x18d zfs`metaslab_alloc_dva+0xdb zfs`metaslab_alloc+0x6d zfs`zio_dva_allocate+0x62 zfs`zio_execute+0x60 genunix`taskq_thread+0xbc unix`thread_start+0x8 4068

unix`mutex_delay_default+0xa unix`mutex_vector_enter+0x99 zfs`metaslab_group_alloc+0x136 zfs`metaslab_alloc_dva+0xdb zfs`metaslab_alloc+0x6d zfs`zio_write_allocate_gang_members+0x171 zfs`zio_dva_allocate+0xcc zfs`zio_execute+0x60 genunix`taskq_thread+0xbc unix`thread_start+0x8 4615

7500

unix`mutex_delay_default+0x7 unix`mutex_vector_enter+0x99 zfs`metaslab_group_alloc+0x136 zfs`metaslab_alloc_dva+0xdb zfs`metaslab_alloc+0x6d zfs`zio_dva_allocate+0x62 zfs`zio_execute+0x60 genunix`taskq_thread+0xbc unix`thread_start+0x8 7785

unix`mutex_delay_default+0xa unix`mutex_vector_enter+0x99 zfs`metaslab_group_alloc+0x136 zfs`metaslab_alloc_dva+0xdb zfs`metaslab_alloc+0x6d zfs`zio_dva_allocate+0x62 zfs`zio_execute+0x60 genunix`taskq_thread+0xbc unix`thread_start+0x8 10487

genunix`avl_walk+0x39 zfs`space_map_alloc+0x21 zfs`metaslab_group_alloc+0x1a2 zfs`metaslab_alloc_dva+0xdb zfs`metaslab_alloc+0x6d zfs`zio_write_allocate_gang_members+0x171 zfs`zio_dva_allocate+0xcc zfs`zio_execute+0x60 genunix`taskq_thread+0xbc unix`thread_start+0x8 16297

genunix`avl_walk+0x39 zfs`space_map_alloc+0x21 zfs`metaslab_group_alloc+0x1a2 zfs`metaslab_alloc_dva+0xdb zfs`metaslab_alloc+0x6d zfs`zio_dva_allocate+0x62 zfs`zio_execute+0x60 genunix`taskq_thread+0xbc unix`thread_start+0x8 26149

William Allen Simpson

10:35 p.m.

New subject: Note: some load problems on upload & image scaler servers

While editing, I've never seen this odd error before:

302 (Moved Temporarily)

Also, still seeing a fair number of:

Proxy Error

The proxy server received an invalid response from an upstream server. The proxy server could not handle the request POST /wikipedia/en/w/index.php.

Reason: Error reading from remote server

In either case, when I show history, the edit *has* been posted.

David Gerard

11:01 p.m.

2009/7/13 Domas Mituzas midom.lists@gmail.com:

...

ms1 was spending lots of CPU in zfs`metaslab_alloc() tree (full

tree, if anyone is interested, is at http://flack.defau.lt/ms1.svg - use FF3.5, then you can search for metaslab_alloc in it.

This sounds *very* like a ZFS bug in Solaris 10 that we struck at work a while ago:

Precis: if the file system is very busy (being hammered) *and* it's over 85% full, the block allocator can get stuck trying to work out the *very best* allocation rather than one that'll do and let it get on with other work. To the point where you see CPU go through the roof, with 80% system CPU and a very unresponsive system. You can't stop this without rebooting the box.

Sun acknowledged it as a bug and it'll be fixed in a future release; they gave us a hotpatch. The workaround? Keep the ZFS filesystem in question under 70% full ...

This is an obscure bug and isn't reason to avoid ZFS in general - the bug only gets tickled in particular circumstances, when ZFS is having the heck beaten out of it. I'd still happily recommend ZFS for almost anything, because it really is *that cool*.

- d.

Domas Mituzas

11:09 p.m.

Dude,

...

Precis: if the file system is very busy (being hammered) *and* it's over 85% full, the block allocator can get stuck trying to work out the *very best* allocation rather than one that'll do and let it get on with other work. To the point where you see CPU go through the roof, with 80% system CPU and a very unresponsive system. You can't stop this without rebooting the box.

This is exactly what we're seeing, except that we could get out of it by dropping older snapshots.

...

Sun acknowledged it as a bug and it'll be fixed in a future release; they gave us a hotpatch. The workaround? Keep the ZFS filesystem in question under 70% full ...

:-)

...

This is an obscure bug and isn't reason to avoid ZFS in general - the bug only gets tickled in particular circumstances, when ZFS is having the heck beaten out of it. I'd still happily recommend ZFS for almost anything, because it really is *that cool*.

hehehehehe, 'the heck beaten out of it' sounds like what we tend to do to our systems at wikimedia ;-) by the way, if you know such details, what are you doing in editing community. get over to the dark side ;-))

Domas

David Gerard

11:30 p.m.

2009/7/13 Domas Mituzas midom.lists@gmail.com:

...

hehehehehe, 'the heck beaten out of it' sounds like what we tend to do to our systems at wikimedia ;-) by the way, if you know such details, what are you doing in editing community. get over to the dark side ;-))

If the WMF can pay me £35k to sysadmin (currently looking for £45k but this is for charity), I am SO THERE.

If not, I have a family to feed and rent to pay ;-p

- d.

David Gerard

14 Jul 14 Jul

3:31 a.m.

2009/7/13 David Gerard dgerard@gmail.com:

...

2009/7/13 Domas Mituzas midom.lists@gmail.com:

...

...
hehehehehe, 'the heck beaten out of it' sounds like what we tend to do to our systems at wikimedia ;-) by the way, if you know such details, what are you doing in editing community. get over to the dark side ;-))

...

If the WMF can pay me £35k to sysadmin (currently looking for £45k but this is for charity), I am SO THERE. If not, I have a family to feed and rent to pay ;-p

And of course if the WMF has £0, as is more likely, feel free to ask me Solaris horrors and I'll do what I can when I can ;-)

(good lord, 8 yrs Solaris on my CV. Let's hope Oracle keeps it well-fed and it doesn't become the next VMS jobmarketwise.)

- d.

Paul Houle

29 Jul 29 Jul

1:53 a.m.

New subject: URLs that aren't cool...

I've been looking at the id structure of dbpedia and wikipedia and finally found an example where case sensitivity issues really bite.

Cases like this with a "redirect" are a little obnoxious,

http://en.wikipedia.org/wiki/New_York_City http://en.wikipedia.org/wiki/New_york_city

largely because there isn't a redirect... The same page gets displayed at each URL. (Ok, the "redirect" has a little extra stuff at the top saying that's a redirect)

dbpedia has separate resource pages for the above cases, so at least it's explaining the situation clearly -- reasoning systems that work with dbpedia need to be able to read this.

Here's a case that's just plain bad...

http://en.wikipedia.org/wiki/Direct_instruction http://en.wikipedia.org/wiki/Direct_Instruction

Last time I looked there were about 10,000 wikipedia urls that varied only by case. In this particular one, it's two articles about the same topic, but there could be some cases where the two articles are about something different.

Aryeh Gregor

2:26 a.m.

New subject: URLs that aren't cool...

On Tue, Jul 28, 2009 at 11:53 AM, Paul Houlepaul@ontology2.com wrote:

...

I've been looking at the id structure of dbpedia and wikipedia and finally found an example where case sensitivity issues really bite.

We should keep in mind that case isn't so clear-cut if you move away from English, though -- is "groß" the same as "GROSS" and thus the same as "gross"? How about languages that don't even have bijections between uppercase and lowercase if you stick to the same dialect? (I'm pretty sure there are some; don't some language strip diacritics from uppercase letters?) There's probably some Unicode standard on normalization with respect to case, but it's not actually so simple in an international context.

That said, I think case-insensitivity would be a good thing to support in the long run, optionally, and that it would probably be suitable for all Wikipedias. Or at least almost all, if there are languages out there where case insensitivity is a real headache -- hopefully not, since most languages don't have letter case at all. At any rate it would be good on enwiki.

But it would require a lot of tedious and error-prone conversion of old code. Everything tends to assume that a) $title->getPrefixedText() is what should be displayed to the user, but b) two titles are equal if and only if their $title->getPrefixedText()s are equal. Likewise for $title->getPrefixedDbKey(). Those would need to be systematically and thoroughly fixed. We'd also have to add a field to the page table or such to store the normalized form of the title, and fiddle with the indexes appropriately, and update all other tables to use the normalized form. A lot of work.

(But at least we could get rid of the silly Text/DbKey distinction while we're doing this. I've heard recent MySQL versions actually support storage of ASCII space characters in text fields!)

Mark Williamson

2:52 a.m.

New subject: URLs that aren't cool...

Case insensitivity shouldn't be a problem for any language, as long as you do it properly.

Turkish and other languages using dotless i, for example, will need a special rule - Turkish lowercase dotted i capitalizes to a capital dotted İ while lowercase undotted ı capitalizes to regular undotted I.

skype: node.ue

On Tue, Jul 28, 2009 at 9:26 AM, Aryeh GregorSimetrical+wikilist@gmail.com wrote:

...

On Tue, Jul 28, 2009 at 11:53 AM, Paul Houlepaul@ontology2.com wrote:

...
I've been looking at the id structure of dbpedia and wikipedia and finally found an example where case sensitivity issues really bite.

We should keep in mind that case isn't so clear-cut if you move away from English, though -- is "groß" the same as "GROSS" and thus the same as "gross"? How about languages that don't even have bijections between uppercase and lowercase if you stick to the same dialect? (I'm pretty sure there are some; don't some language strip diacritics from uppercase letters?) There's probably some Unicode standard on normalization with respect to case, but it's not actually so simple in an international context.

That said, I think case-insensitivity would be a good thing to support in the long run, optionally, and that it would probably be suitable for all Wikipedias. Or at least almost all, if there are languages out there where case insensitivity is a real headache -- hopefully not, since most languages don't have letter case at all. At any rate it would be good on enwiki.

But it would require a lot of tedious and error-prone conversion of old code. Everything tends to assume that a) $title->getPrefixedText() is what should be displayed to the user, but b) two titles are equal if and only if their $title->getPrefixedText()s are equal. Likewise for $title->getPrefixedDbKey(). Those would need to be systematically and thoroughly fixed. We'd also have to add a field to the page table or such to store the normalized form of the title, and fiddle with the indexes appropriately, and update all other tables to use the normalized form. A lot of work.

(But at least we could get rid of the silly Text/DbKey distinction while we're doing this. I've heard recent MySQL versions actually support storage of ASCII space characters in text fields!)

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Aryeh Gregor

3:04 a.m.

New subject: URLs that aren't cool...

On Tue, Jul 28, 2009 at 12:52 PM, Mark Williamsonnode.ue@gmail.com wrote:

...

Case insensitivity shouldn't be a problem for any language, as long as you do it properly.

Turkish and other languages using dotless i, for example, will need a special rule - Turkish lowercase dotted i capitalizes to a capital dotted İ while lowercase undotted ı capitalizes to regular undotted I.

And so what if a wiki is multilingual and you don't know what language the page name is in? What if a Turkish wiki contains some English page names as loan words, for instance?

Brion Vibber

3:16 a.m.

New subject: URLs that aren't cool...

On 7/28/09 10:04 AM, Aryeh Gregor wrote:

...

On Tue, Jul 28, 2009 at 12:52 PM, Mark Williamsonnode.ue@gmail.com wrote:

...
Case insensitivity shouldn't be a problem for any language, as long as you do it properly.

Turkish and other languages using dotless i, for example, will need a special rule - Turkish lowercase dotted i capitalizes to a capital dotted İ while lowercase undotted ı capitalizes to regular undotted I.

And so what if a wiki is multilingual and you don't know what language the page name is in? What if a Turkish wiki contains some English page names as loan words, for instance?

Indeed, good handling of case-insensitive matchings would be a big win for human usability, but it's not easy to get right in all cases.

The main problems are:

1) Conflicts when we really do consider something separate, but the case folding rules match them together

2) Language-specific case folding rules in a multilingual environment

Turkish I with/without dot and German ß not always matching to SS are the primary examples off the top of my head. Also, some languages tend to drop accent markers in capital form (eg, Spanish). What can or should we do here?

A nearer-term help would be to go ahead and implement what we talked about a billion years ago but never got around to -- a decent "did you mean X?" message to display when you go to an empty page but there's something similar nearby.

If it's at least trivial to click through from [[New york city]] to [[New York City]], that's better than having to search for it anew.

Of course we have some case-insensitive matching for near-matches on "go" searches... we could pull from that easily. [Note this is done via TitleKey for full case-insensitivity at present... and it probably doesn't handle Turkish correctly yet.]

-- brion

Mark Williamson

3:21 a.m.

New subject: URLs that aren't cool...

Since when does Spanish drop accent markers in capital form? If you have seen anybody do this, it is just a misspelling. For example: http://es.wikipedia.org/wiki/%C3%93pera or http://es.wikipedia.org/wiki/%C3%81frica or http://es.wikipedia.org/wiki/Oc%C3%A9ano_%C3%8Dndico

I have been told that Greek drops accents in capital form but this may not be true. Other than that, though, I am not acquainted with any language that does such a thing (but of course that doesn't mean none exist).

Mark

skype: node.ue

On Tue, Jul 28, 2009 at 10:16 AM, Brion Vibberbrion@wikimedia.org wrote:

...

On 7/28/09 10:04 AM, Aryeh Gregor wrote:

...
On Tue, Jul 28, 2009 at 12:52 PM, Mark Williamsonnode.ue@gmail.com wrote:

...
Case insensitivity shouldn't be a problem for any language, as long as you do it properly.

Turkish and other languages using dotless i, for example, will need a special rule - Turkish lowercase dotted i capitalizes to a capital dotted İ while lowercase undotted ı capitalizes to regular undotted I.

And so what if a wiki is multilingual and you don't know what language the page name is in? What if a Turkish wiki contains some English page names as loan words, for instance?

Indeed, good handling of case-insensitive matchings would be a big win for human usability, but it's not easy to get right in all cases.

The main problems are:

Conflicts when we really do consider something separate, but the case

folding rules match them together

Language-specific case folding rules in a multilingual environment

Turkish I with/without dot and German ß not always matching to SS are the primary examples off the top of my head. Also, some languages tend to drop accent markers in capital form (eg, Spanish). What can or should we do here?

A nearer-term help would be to go ahead and implement what we talked about a billion years ago but never got around to -- a decent "did you mean X?" message to display when you go to an empty page but there's something similar nearby.

If it's at least trivial to click through from [[New york city]] to [[New York City]], that's better than having to search for it anew.

Of course we have some case-insensitive matching for near-matches on "go" searches... we could pull from that easily. [Note this is done via TitleKey for full case-insensitivity at present... and it probably doesn't handle Turkish correctly yet.]

-- brion

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Tei

3:30 a.m.

New subject: URLs that aren't cool...

The related wikipedia article write that it was a urband leyend:

http://es.wikipedia.org/wiki/Acentuaci%C3%B3n_de_las_may%C3%BAsculas

So is wrong to drop these accents.

On Tue, Jul 28, 2009 at 7:21 PM, Mark Williamsonnode.ue@gmail.com wrote:

...

Since when does Spanish drop accent markers in capital form? If you have seen anybody do this, it is just a misspelling. For example: http://es.wikipedia.org/wiki/%C3%93pera or http://es.wikipedia.org/wiki/%C3%81frica or http://es.wikipedia.org/wiki/Oc%C3%A9ano_%C3%8Dndico

I have been told that Greek drops accents in capital form but this may not be true. Other than that, though, I am not acquainted with any language that does such a thing (but of course that doesn't mean none exist).

Mark

skype: node.ue

On Tue, Jul 28, 2009 at 10:16 AM, Brion Vibberbrion@wikimedia.org wrote:

...
On 7/28/09 10:04 AM, Aryeh Gregor wrote:

...
On Tue, Jul 28, 2009 at 12:52 PM, Mark Williamsonnode.ue@gmail.com wrote:

...
Case insensitivity shouldn't be a problem for any language, as long as you do it properly.

Turkish and other languages using dotless i, for example, will need a special rule - Turkish lowercase dotted i capitalizes to a capital dotted İ while lowercase undotted ı capitalizes to regular undotted I.

And so what if a wiki is multilingual and you don't know what language the page name is in? What if a Turkish wiki contains some English page names as loan words, for instance?

Indeed, good handling of case-insensitive matchings would be a big win for human usability, but it's not easy to get right in all cases.

The main problems are:

Conflicts when we really do consider something separate, but the case

folding rules match them together

Language-specific case folding rules in a multilingual environment

Turkish I with/without dot and German ß not always matching to SS are the primary examples off the top of my head. Also, some languages tend to drop accent markers in capital form (eg, Spanish). What can or should we do here?

-- -- ℱin del ℳensaje.

Brion Vibber

3:51 a.m.

New subject: URLs that aren't cool...

On 7/28/09 10:30 AM, Tei wrote:

...

The related wikipedia article write that it was a urband leyend:

http://es.wikipedia.org/wiki/Acentuaci%C3%B3n_de_las_may%C3%BAsculas

Dang! I've been taken in again by exposure to real-world practice instead of what's correct. ;)

(In any case, handling that case nicely is wise too.)

-- brion

Platonides

10:51 a.m.

New subject: URLs that aren't cool...

Brion Vibber wrote:

...

On 7/28/09 10:30 AM, Tei wrote:

...
The related wikipedia article write that it was a urband leyend:

http://es.wikipedia.org/wiki/Acentuaci%C3%B3n_de_las_may%C3%BAsculas

Dang! I've been taken in again by exposure to real-world practice instead of what's correct. ;)

Once upon a time, mechanical typewriters weren't able to properly acceuntate them.

...

(In any case, handling that case nicely is wise too.)

-- brion

At Spanish wikipedia there're some bots creating redirects from titles lowercased with accents dropped, to make the article show up when searching without the exact spelling. I don't really like it, but where the software doesn't work, users get inventive.

Roan Kattouw

3:51 a.m.

New subject: URLs that aren't cool...

2009/7/28 Mark Williamson node.ue@gmail.com:

...

Since when does Spanish drop accent markers in capital form? If you have seen anybody do this, it is just a misspelling. For example: http://es.wikipedia.org/wiki/%C3%93pera or http://es.wikipedia.org/wiki/%C3%81frica or http://es.wikipedia.org/wiki/Oc%C3%A9ano_%C3%8Dndico

I have been told that Greek drops accents in capital form but this may not be true. Other than that, though, I am not acquainted with any language that does such a thing (but of course that doesn't mean none exist).

Frisian (fy) does drop accents in capitals, FWIW.

Roan Kattouw (Catrope)

Helder Geovane Gomes de Lima

7:07 a.m.

New subject: URLs that aren't cool...

2009/7/28 Brion Vibber brion@wikimedia.org:

...

A nearer-term help would be to go ahead and implement what we talked about a billion years ago but never got around to -- a decent "did you mean X?" message to display when you go to an empty page but there's something similar nearby.

If it's at least trivial to click through from [[New york city]] to [[New York City]], that's better than having to search for it anew.

I think this would be really good to implement this, since it also help us when creating and following interwiki links (see also the point 3 I was talking here: http://lists.wikimedia.org/pipermail/wikitech-l/2009-July/044007.html)

Helder

Nikola Smolenski

3:55 p.m.

New subject: URLs that aren't cool...

Дана Tuesday 28 July 2009 19:16:22 Brion Vibber написа:

...

On 7/28/09 10:04 AM, Aryeh Gregor wrote:

...
On Tue, Jul 28, 2009 at 12:52 PM, Mark Williamsonnode.ue@gmail.com

wrote:

...

...
...
Case insensitivity shouldn't be a problem for any language, as long as you do it properly.

Turkish and other languages using dotless i, for example, will need a special rule - Turkish lowercase dotted i capitalizes to a capital dotted İ while lowercase undotted ı capitalizes to regular undotted I.

And so what if a wiki is multilingual and you don't know what language the page name is in? What if a Turkish wiki contains some English page names as loan words, for instance?

Indeed, good handling of case-insensitive matchings would be a big win for human usability, but it's not easy to get right in all cases.

The main problems are:

Conflicts when we really do consider something separate, but the case

folding rules match them together

Language-specific case folding rules in a multilingual environment

Turkish I with/without dot and German ß not always matching to SS are the primary examples off the top of my head. Also, some languages tend to drop accent markers in capital form (eg, Spanish). What can or should we do here?

Similar to automatic redirect, we could build an authomatic disambiguation page. For example, someone on srwiki going to [[Dj]] would get:

Did you mean:

* [[Đ]] * [[DJ]] * [[D.J.]]

...

A nearer-term help would be to go ahead and implement what we talked about a billion years ago but never got around to -- a decent "did you mean X?" message to display when you go to an empty page but there's something similar nearby.

Was thinking a lot about this. The best solution I thought of would be to add a column to page table "page_title_canonical". When an article is created/moved, this canonical title is built from the real title. When an article is looked up, if there is no match in page_title, build the canonical title from the URL and see if there is a match in page_title_canonical and if yes, display "did you mean X" or even go there automatically as if from a redirect (if there is only one match) or "did you mean *X, *X1" if there are multiple matches.

This canonical title would be made like this: * Remove disambiguator from the title if it exists * Remove punctuation and the like * Transliterate the title to Latin alphabet * Transliterate to pure ASCII * Lowercase * Order the words alphabetically

What could possibly go wrong?

Note that this would also be very helpful for non-Latin wikis - people often want Latin-only URLs since non-Latin URLs are toooo long. I also recall a recent discussion about a wiki in a language with nonstandard spelling (nds?) where they use bots to create dozens or even hundreds of redirects to an article title - this would also make that unneeded.

Andrew Dunbar

4:51 p.m.

New subject: URLs that aren't cool...

2009/7/29 Nikola Smolenski smolensk@eunet.yu:

...

Дана Tuesday 28 July 2009 19:16:22 Brion Vibber написа:

...
On 7/28/09 10:04 AM, Aryeh Gregor wrote:

...
On Tue, Jul 28, 2009 at 12:52 PM, Mark Williamsonnode.ue@gmail.com

wrote:

...
...
...
Case insensitivity shouldn't be a problem for any language, as long as you do it properly.

Turkish and other languages using dotless i, for example, will need a special rule - Turkish lowercase dotted i capitalizes to a capital dotted İ while lowercase undotted ı capitalizes to regular undotted I.

And so what if a wiki is multilingual and you don't know what language the page name is in? What if a Turkish wiki contains some English page names as loan words, for instance?

Indeed, good handling of case-insensitive matchings would be a big win for human usability, but it's not easy to get right in all cases.

The main problems are:

Conflicts when we really do consider something separate, but the case

folding rules match them together

Language-specific case folding rules in a multilingual environment

Turkish I with/without dot and German ß not always matching to SS are the primary examples off the top of my head. Also, some languages tend to drop accent markers in capital form (eg, Spanish). What can or should we do here?

Similar to automatic redirect, we could build an authomatic disambiguation page. For example, someone on srwiki going to [[Dj]] would get:

Did you mean:

[[Đ]]

[[DJ]]

[[D.J.]]

...
A nearer-term help would be to go ahead and implement what we talked about a billion years ago but never got around to -- a decent "did you mean X?" message to display when you go to an empty page but there's something similar nearby.

Was thinking a lot about this. The best solution I thought of would be to add a column to page table "page_title_canonical". When an article is created/moved, this canonical title is built from the real title. When an article is looked up, if there is no match in page_title, build the canonical title from the URL and see if there is a match in page_title_canonical and if yes, display "did you mean X" or even go there automatically as if from a redirect (if there is only one match) or "did you mean *X, *X1" if there are multiple matches.

This canonical title would be made like this:

Remove disambiguator from the title if it exists

Remove punctuation and the like

Transliterate the title to Latin alphabet

Transliterate to pure ASCII

Lowercase

Order the words alphabetically

What could possibly go wrong?

Note that this would also be very helpful for non-Latin wikis - people often want Latin-only URLs since non-Latin URLs are toooo long. I also recall a recent discussion about a wiki in a language with nonstandard spelling (nds?) where they use bots to create dozens or even hundreds of redirects to an article title - this would also make that unneeded.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

I actually did make this extension a couple of years, intended for the English Wiktionary where we manually add an {{also}} template to the top of pages to like to other pages whose titles differ in minor ways such as capitalization, hyphenation, apostrophes, accents, periods. I think I had it working with Hebrew and Arabic and a few other exotic languages besides.

It was running on Brion's test box for some time but getting little interest. It's been offline and unmaintained since Brion moved and I did a couple of overseas trips.

http://www.mediawiki.org/wiki/Extension:DidYouMean http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/DidYouMean/ https://bugzilla.wikimedia.org/show_bug.cgi?id=8648

It hooked all ways to create delete or move a page to maintain a separate table of normalized page titles which it consulted when displaying a page. The code for display was designed for compatibility with the then-current Wiktionary templates and would need to be implemented in a more general way. A core version would probably just add a field to the existing table.

Andrew Dunbar (hippietrail)

-- http://wiktionarydev.leuksman.com http://linguaphile.sf.net

Tim Starling

30 Jul 30 Jul

1:46 a.m.

New subject: URLs that aren't cool...

Aryeh Gregor wrote:

...

(But at least we could get rid of the silly Text/DbKey distinction while we're doing this. I've heard recent MySQL versions actually support storage of ASCII space characters in text fields!)

Apparently this poor design choice was made due to some bogus concept of backwards compatibility with UseMod, or some similarly crappy wiki engine that stores articles in the filesystem, with filenames chosen to avoid distressing shellscript fanboys.

-- Tim Starling

David Gerard

13 Jul 13 Jul

11:32 p.m.

2009/7/13 Domas Mituzas midom.lists@gmail.com:

...

...
Precis: if the file system is very busy (being hammered) *and* it's over 85% full, the block allocator can get stuck trying to work out the *very best* allocation rather than one that'll do and let it get on with other work. To the point where you see CPU go through the roof, with 80% system CPU and a very unresponsive system. You can't stop this without rebooting the box.

...

This is exactly what we're seeing, except that we could get out of it by dropping older snapshots.

Yeah - cutting down how full the file system is.

...

...
Sun acknowledged it as a bug and it'll be fixed in a future release; they gave us a hotpatch. The workaround? Keep the ZFS filesystem in question under 70% full ...

...

:-) hehehehehe, 'the heck beaten out of it' sounds like what we tend to do to our systems at wikimedia ;-)

It's useful testing, and you can be sure Sun will be interested in your results in detail, we're a reasonably famous site! A coworker spoke to the Sun kernel engineer tearing his hair out over this one ...

I fear the answer re: ZFS is to some extent "don't do that then" until it's fixed. Of course, you want snapshots. It's a tricky one.

- d.

5621

Age (days ago)

5637

Last active (days ago)

wikitech-l@lists.wikimedia.org

21 comments

14 participants

tags (0)

participants (14)

Andrew Dunbar
Aryeh Gregor
Brion Vibber
David Gerard
Domas Mituzas
Helder Geovane Gomes de Lima
Mark Williamson
Nikola Smolenski
Paul Houle
Platonides
Roan Kattouw
Tei
Tim Starling
William Allen Simpson