Huh???
www.englishfreeroam.co.cc
On 17 Jan 2011, at 17:41, wikitech-l-request@lists.wikimedia.org wrote:
Send Wikitech-l mailing list submissions to wikitech-l@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/wikitech-l or, via email, send a message with subject or body 'help' to wikitech-l-request@lists.wikimedia.org
You can reach the person managing the list at wikitech-l-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Wikitech-l digest..."
Today's Topics:
- Category sorting and first letters (Tim Starling)
- Re: From page history to sentence history (Bryan Tong Minh)
- Re: From page history to sentence history (Alex Brollo)
- WMDE Developer Meetup moved to May (Daniel Kinzler)
- Re: WYSIFTW status (Aryeh Gregor)
- Re: [Toolserver-l] WMDE Developer Meetup moved to May (Daniel Kinzler)
- Re: June 8th 2011, World IPv6 Day (Aryeh Gregor)
- Re: WMDE Developer Meetup moved to May (Chad)
- Re: From page history to sentence history (Aryeh Gregor)
- Re: From page history to sentence history (Anthony)
Message: 1 Date: Tue, 18 Jan 2011 02:00:09 +1100 From: Tim Starling tstarling@wikimedia.org Subject: [Wikitech-l] Category sorting and first letters To: wikitech-l@lists.wikimedia.org Message-ID: ih1lhs$pmn$1@dough.gmane.org Content-Type: text/plain; charset=UTF-8
In r80443 I added a feature allowing categories to be sorted using the Unicode Collation Algorithm (UCA). I wanted to briefly talk about the potential user impact, the design choices and the caveats.
Sorting was the easy part. The hard part was providing a "first letter" concept which would be reasonably sane. The idea I came up with was to compile a list of first letters, themselves sorted using the UCA. Then the "first letter" of a given string is the nearest letter in the list which sorts above the string.
For instance if you have letters A, B, C, and a string Aardvark, if you sort them you get:
A Aardvark B C
So we know that A is the first letter of Aardvark because Aardvark sorts immediately below A. This algorithm gives us a number of nice properties:
- It automatically drops accents, since accented letters sort the same
as unaccented letters (at the primary level). Same with case differences, hiragana/katakana, etc.
- You can work out the initial Jamo of a Hangul syllable character by
just omitting the composed syllables from the "first letter" list. Previously this was done with a special-case hack in Language::firstChar().
- Vowel reordering in Thai and Lao is automatically supported.
So "??" sorts under heading "?" and "??" sorts under heading "?".
- The collation can be expanded to support all sorts of other crazy
features, and the first letter feature will keep working in a sane way. For instance, you could have an English collation which removed "the" from the start of a title.
I compiled a list of 14,742 suitable header characters, identified by processing various Unicode data files. That list probably still needs lots of tweaks.
There is a down side to this scheme. The default UCA table gives all characters with a similar logical function to the digits 0-9 the same primary sort order as the corresponding ASCII digits. So a page like [[????]] on the Bihari Wikipedia will sort under a heading of "1" instead of "?". There may be other instances of accidental cultural imperialism. However, this can be fixed by compiling language-dependent lists of header characters.
The UCA default table is not meant to sort any language correctly, it's just a compromise collation. Support for language-specific collations can easily be added. Whether we get language-specific collations or not, I'd like to think about enabling this feature on Wikimedia.
The most glaring omission from the UCA default tables is sensible sorting of the unified Han.
In a Chinese context, there's an obvious way to sort characters, and that's by their order in the KangXi dictionary. The Unihan database gives such an ordering, and it's used within code blocks. But it's not used between code blocks. So if you sort by code point, all the Han characters that aren't in the U+4E00 to U+9FFF block will sort incorrectly. That's what the default UCA does, with a few minor exceptions.
In a Japanese context, the way to sort ideographic characters is to convert them to phonetic hiragana and then to sort the resulting string. I don't know if there is any free software for doing this. On the Japanese Wikipedia, they achieve the same result by manually setting the sort key of every page to be the hiragana version of the title.
There's lots of room here for other people to get involved, especially if you know a language other than English.
-- Tim Starling
Message: 2 Date: Mon, 17 Jan 2011 16:29:58 +0100 From: Bryan Tong Minh bryan.tongminh@gmail.com Subject: Re: [Wikitech-l] From page history to sentence history To: Wikimedia developers wikitech-l@lists.wikimedia.org Message-ID: AANLkTi=w=6we2xngMMNikuFfMTH8KRtiVzXRSibJU-pX@mail.gmail.com Content-Type: text/plain; charset=ISO-8859-1
On Mon, Jan 17, 2011 at 3:49 PM, Anthony wikimail@inbox.org wrote:
How would you define a particular sentence, paragraph or section of an article? ?The difficulty of the solution lies in answering that question.
Difficult, but doable. Jan-Paul's sentence-level editing tool is able to make the distinction. It would perhaps be possible to use that as a framework for sentence-level diffs.
Bryan
Message: 3 Date: Mon, 17 Jan 2011 16:40:28 +0100 From: Alex Brollo alex.brollo@gmail.com Subject: Re: [Wikitech-l] From page history to sentence history To: Wikimedia developers wikitech-l@lists.wikimedia.org Message-ID: AANLkTi=WhAZ1d5ty9hbkdD-7LkfSd_Fy0VtEvjxAdPQn@mail.gmail.com Content-Type: text/plain; charset=ISO-8859-1
2011/1/17 Bryan Tong Minh bryan.tongminh@gmail.com
Difficult, but doable. Jan-Paul's sentence-level editing tool is able to make the distinction. It would perhaps be possible to use that as a framework for sentence-level diffs.
Difficult, but diff between versions of a page does it. Looking at diff between pages, I simply thought firmly that only diff paragraphs were stored, so that the page was built as updated diff segments. I had no idea how this could be done, but all was "magic"!
Alex
Message: 4 Date: Mon, 17 Jan 2011 17:11:12 +0100 From: Daniel Kinzler daniel@brightbyte.de Subject: [Wikitech-l] WMDE Developer Meetup moved to May To: wikitech-l@lists.wikimedia.org, toolserver-l@lists.wikimedia.org, MediaWiki announcements and site admin list mediawiki-l@lists.wikimedia.org Cc: Nicole Ebber nicole.ebber@wikimedia.de, Pavel Richter pavel.richter@wikimedia.de Message-ID: 4D346A20.107@brightbyte.de Content-Type: text/plain; charset=UTF-8
Hi all
after some discussion, Wikimedia Germany decided not to hold a developer's meet-up around the Chapter's conference in March. We just couldn't fit this in nicely with the venue and the overall organization. Don't despair though:
This is what we will do instead:
- There will be a hackathon hosted by Wikimedia Germany in (late) May, probably
in Berlin, but that's not decided yet. This will mostly about hacking, with a strong focus on GLAM related stuff. There will be little in terms of presentations.
- There will be the hacking days attached to Wikimania in Haifa, August 3./4.
I'm in charge of setting up the program for that, and I'll try to make it a nice mix of discussing technology and actually hacking. I would also like to have a get-together with thechies and chapter folks at some point during Wikimania.
I hope that this way, we can give the hacking events the attention they deserve. Let me know what you think.
-- daniel
Message: 5 Date: Mon, 17 Jan 2011 11:31:27 -0500 From: Aryeh Gregor Simetrical+wikilist@gmail.com Subject: Re: [Wikitech-l] WYSIFTW status To: Wikimedia developers wikitech-l@lists.wikimedia.org Message-ID: AANLkTikudZhXBHndkeHEwsUqHvCqBZ2VESTKM7xoZTn2@mail.gmail.com Content-Type: text/plain; charset=UTF-8
On Sun, Jan 16, 2011 at 7:16 PM, Magnus Manske magnusmanske@googlemail.com wrote:
There is the question of what browsers/versions to test for. Should I invest large amounts of time optimising performance in Firefox 3, when FF4 will probably be released before WYSIFTW, and everyone and their cousin upgrades?
Design for only the fastest browsers. Other browsers could always just be dropped back to the old-fashioned editor.
Message: 6 Date: Mon, 17 Jan 2011 17:39:31 +0100 From: Daniel Kinzler daniel@brightbyte.de Subject: Re: [Wikitech-l] [Toolserver-l] WMDE Developer Meetup moved to May To: toolserver-l@lists.wikimedia.org Cc: MediaWiki announcements and site admin list mediawiki-l@lists.wikimedia.org, wikitech-l@lists.wikimedia.org, Asaf Bartov asaf.bartov@gmail.com, Pavel Richter pavel.richter@wikimedia.de, Nicole Ebber nicole.ebber@wikimedia.de Message-ID: 4D3470C3.4040304@brightbyte.de Content-Type: text/plain; charset=ISO-8859-1
On 17.01.2011 17:14, Asaf Bartov wrote:
Correction: Haifa Hacking Days are to be held August 2nd-3rd. Wikimania itself will be Aug 4th-6th.
Gah! Thanks Asaf.
There I went and looked it up, and then wrote the wrong thing into the email. Curses.
-- daniel
Message: 7 Date: Mon, 17 Jan 2011 11:44:28 -0500 From: Aryeh Gregor Simetrical+wikilist@gmail.com Subject: Re: [Wikitech-l] June 8th 2011, World IPv6 Day To: Happy-melon happy-melon@live.com, Wikimedia developers wikitech-l@lists.wikimedia.org Message-ID: AANLkTikk20OAKv-vreinxD-oBmfnzLbo97=xROQebpDX@mail.gmail.com Content-Type: text/plain; charset=UTF-8
On Sun, Jan 16, 2011 at 7:12 PM, Happy-melon happy-melon@live.com wrote:
I don't entirely understand the point of this. ?The plan seems to be """get a large enough fraction of 'the internet' to make a change which breaks for some people all at the same time, so that those people get angry with the ISPs that haven't got off their arses to fix said breakage, rather than angry with the broken sites""", which is fair enough.
No, the point is to test what happens if IPv6 is supported on a large scale. It's known from small-scale testing that this will break things for some small percentage of users, but no one's sure what the consequences are of switching this on fully for everyone.
But AFAICT, the breakage won't occur if your connection can't 'do' IPv6, but only if your connection can't 'do' both IPv4 *and* IPv6 on the same site at the same time. ?Surely that's not actually the problem that we need to solve if we're to be able to migrate smoothly onto IPv6? ?When the IPv4 addresses run out, we need to be able to start setting up websites which are *only* v6, surely?
There are many more clients in the world than servers, and servers have always been able to get dedicated IPv4 addresses much more easily than clients. A server Internet connection in America will typically come with as many IPv4 addresses as you need, while you usually can't get a dedicated residential IP address unless you pay extra. (And America has more IP addresses allocated per capita than anywhere else in the world, since it originally developed the Internet.)
So as IPv4 addresses become scarcer, the pressure to use IPv6 only will fall mostly on residential users. Clients with only an IPv6 address will only be able to get direct connections to IPv6-enabled servers. The way servers are supposed to do this is serve both A and AAAA records for the same domain, so IPv4 clients use the A record and IPv6 clients use the AAAA record.
Unfortunately, someone at some point decided that if the client supports both IPv4 and IPv6, and the server publishes both A and AAAA records, the client should connect via IPv6. In practice, almost no sites use IPv6, so the infrastructure is much less well-tested. Clients that think they have IPv6 connections might actually have the connection eaten by a middlebox, or just be slower or less reliable. So sites don't turn on the AAAA records in practice because it degrades service for clients with IPv6 connections, which means the servers aren't accessible to IPv6-only clients without workarounds.
IPv6 day is an attempt to see what happens if major sites publish AAAA records for a while. Stuff will break, but hopefully not too horribly, and it will give both site operators and ISPs the chance to analyze what's wrong with their IPv6 support and what they can do to fix it. This is a step toward major sites publishing AAAA records all the time, which is necessary to support IPv6-only clients.
Something like that, anyway. I'm hardly an expert on these things.
Message: 8 Date: Mon, 17 Jan 2011 11:45:33 -0500 From: Chad innocentkiller@gmail.com Subject: Re: [Wikitech-l] WMDE Developer Meetup moved to May To: Wikimedia developers wikitech-l@lists.wikimedia.org Cc: toolserver-l@lists.wikimedia.org, MediaWiki announcements and site admin list mediawiki-l@lists.wikimedia.org Message-ID: AANLkTim3Q5CS20O=CRVo0A2z7nNbqftrhaUFFgvBq2+g@mail.gmail.com Content-Type: text/plain; charset=UTF-8
On Mon, Jan 17, 2011 at 11:11 AM, Daniel Kinzler daniel@brightbyte.de wrote:
- There will be a hackathon hosted by Wikimedia Germany in (late) May, probably
in Berlin, but that's not decided yet. This will mostly about hacking, with a strong focus on GLAM related stuff. There will be little in terms of presentations.
Late May? That's actually *really* awesome. Now I don't have to miss school to come :D
-Chad
Message: 9 Date: Mon, 17 Jan 2011 11:47:35 -0500 From: Aryeh Gregor Simetrical+wikilist@gmail.com Subject: Re: [Wikitech-l] From page history to sentence history To: Wikimedia developers wikitech-l@lists.wikimedia.org Message-ID: AANLkTinBdUX_v4d0gvxzm=BF_LE+1aQrMmjhk8xsvFE8@mail.gmail.com Content-Type: text/plain; charset=UTF-8
On Mon, Jan 17, 2011 at 5:55 AM, Alex Brollo alex.brollo@gmail.com wrote:
Before I dig a little more into wiki mysteries, I was absolutely sure that wiki articles were stored into small pieces (paragraphs?) so that a small edit into a long long page would take exactly the same disk space than a small edit into a short page. But I discovered soon, that things are different. :-)
Wikimedia stores diffs using delta compression, so actually this is basically what happens. The size of the edit is what determines the size of the stored diff, not the size of the page. (I don't know how this works in detail, though.) IIRC, default MediaWiki doesn't work this way.
Message: 10 Date: Mon, 17 Jan 2011 12:41:22 -0500 From: Anthony wikimail@inbox.org Subject: Re: [Wikitech-l] From page history to sentence history To: Wikimedia developers wikitech-l@lists.wikimedia.org Message-ID: AANLkTinfD+PEoAWN1T4XyZaeCwPO1_NeXm0EoDgLjzoH@mail.gmail.com Content-Type: text/plain; charset=ISO-8859-1
On Mon, Jan 17, 2011 at 10:40 AM, Alex Brollo alex.brollo@gmail.com wrote:
2011/1/17 Bryan Tong Minh bryan.tongminh@gmail.com
Difficult, but doable. Jan-Paul's sentence-level editing tool is able to make the distinction. It would perhaps be possible to use that as a framework for sentence-level diffs.
Difficult, but diff between versions of a page does it. Looking at diff between pages, I simply thought firmly that only diff paragraphs were stored, so that the page was built as updated diff segments. I had no idea how this could be done, but ?all was "magic"!
Paragraphs are much easier to recognize than sentences, as wikitext has a paragraph delimiter - a blank line. To truly recognize sentences, you basically have to engage in natural language processing, though you can probably get it right 90% of the time without too much effort.
And to recognize what's going on when a sentence changes *and* is moved from one paragraph to another, requires an even greater level of natural language understanding. Again though, you can probably get it right most of the time without too much effort.
Wikitext actually makes it easier for the most part, as you can use tricks such as the fact that the periods in [[I.M. Someone]] don't represent sentence delimiters, since they are contained in square brackets. But not all periods which occur in the middle of a sentence are contained in square brackets, and not all sentences end with a period.
I'd say "difficult but doable" is quite accurate, although with the caveat that even the state of the art tools available today are probably going to make mistakes that would be obvious to a human. I'm sure there are tools for this, and there are probably some decent ones that are open source. But it's not as simple as just adding an index.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
End of Wikitech-l Digest, Vol 90, Issue 33
wikitech-l@lists.wikimedia.org