Re: [Wikitech-l] Wikitech-l Digest, Vol 90, Issue 33 - Wikitech-l

18 Jan 2011

Huh???

www.englishfreeroam.co.cc

On 17 Jan 2011, at 17:41, wikitech-l-request(a)lists.wikimedia.org wrote:

...
  Send Wikitech-l mailing list submissions to
    wikitech-l(a)lists.wikimedia.org

 To subscribe or unsubscribe via the World Wide Web, visit
    https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 or, via email, send a message with subject or body 'help' to
    wikitech-l-request(a)lists.wikimedia.org

 You can reach the person managing the list at
    wikitech-l-owner(a)lists.wikimedia.org

 When replying, please edit your Subject line so it is more specific
 than "Re: Contents of Wikitech-l digest..."

 Today's Topics:

   1. Category sorting and first letters (Tim Starling)
   2. Re: From page history to sentence history (Bryan Tong Minh)
   3. Re: From page history to sentence history (Alex Brollo)
   4. WMDE Developer Meetup moved to May (Daniel Kinzler)
   5. Re: WYSIFTW status (Aryeh Gregor)
   6. Re: [Toolserver-l] WMDE Developer Meetup moved to May
      (Daniel Kinzler)
   7. Re: June 8th 2011, World IPv6 Day (Aryeh Gregor)
   8. Re: WMDE Developer Meetup moved to May (Chad)
   9. Re: From page history to sentence history (Aryeh Gregor)
  10. Re: From page history to sentence history (Anthony)

 ----------------------------------------------------------------------

 Message: 1
 Date: Tue, 18 Jan 2011 02:00:09 +1100
 From: Tim Starling &lt;tstarling(a)wikimedia.org&gt;
 Subject: [Wikitech-l] Category sorting and first letters
 To: wikitech-l(a)lists.wikimedia.org
 Message-ID: &lt;ih1lhs$pmn$1(a)dough.gmane.org&gt;
 Content-Type: text/plain; charset=UTF-8

 In r80443 I added a feature allowing categories to be sorted using the
 Unicode Collation Algorithm (UCA). I wanted to briefly talk about the
 potential user impact, the design choices and the caveats.

 Sorting was the easy part. The hard part was providing a "first
 letter" concept which would be reasonably sane. The idea I came up
 with was to compile a list of first letters, themselves sorted using
 the UCA. Then the "first letter" of a given string is the nearest
 letter in the list which sorts above the string.

 For instance if you have letters A, B, C, and a string Aardvark, if
 you sort them you get:

 A
 Aardvark
 B
 C

 So we know that A is the first letter of Aardvark because Aardvark
 sorts immediately below A. This algorithm gives us a number of nice
 properties:

 * It automatically drops accents, since accented letters sort the same
 as unaccented letters (at the primary level). Same with case
 differences, hiragana/katakana, etc.

 * You can work out the initial Jamo of a Hangul syllable character by
 just omitting the composed syllables from the "first letter" list.
 Previously this was done with a special-case hack in
 Language::firstChar().

 * Vowel reordering in Thai and Lao is automatically supported.
 So "??" sorts under heading "?" and "??" sorts under
heading "?".

 * The collation can be expanded to support all sorts of other crazy
 features, and the first letter feature will keep working in a sane
 way. For instance, you could have an English collation which removed
 "the" from the start of a title.

 I compiled a list of 14,742 suitable header characters, identified by
 processing various Unicode data files. That list probably still needs
 lots of tweaks.

 There is a down side to this scheme. The default UCA table gives all
 characters with a similar logical function to the digits 0-9 the same
 primary sort order as the corresponding ASCII digits. So a page like
 [[????]] on the Bihari Wikipedia will sort under a heading of "1"
 instead of "?". There may be other instances of accidental cultural
 imperialism. However, this can be fixed by compiling
 language-dependent lists of header characters.

 The UCA default table is not meant to sort any language correctly,
 it's just a compromise collation. Support for language-specific
 collations can easily be added. Whether we get language-specific
 collations or not, I'd like to think about enabling this feature on
 Wikimedia.

 The most glaring omission from the UCA default tables is sensible
 sorting of the unified Han.

 In a Chinese context, there's an obvious way to sort characters, and
 that's by their order in the KangXi dictionary. The Unihan database
 gives such an ordering, and it's used within code blocks. But it's not
 used between code blocks. So if you sort by code point, all the Han
 characters that aren't in the U+4E00 to U+9FFF block will sort
 incorrectly. That's what the default UCA does, with a few minor
 exceptions.

 In a Japanese context, the way to sort ideographic characters is to
 convert them to phonetic hiragana and then to sort the resulting
 string. I don't know if there is any free software for doing this. On
 the Japanese Wikipedia, they achieve the same result by manually
 setting the sort key of every page to be the hiragana version of the
 title.

 There's lots of room here for other people to get involved, especially
 if you know a language other than English.

 -- Tim Starling

 ------------------------------

 Message: 2
 Date: Mon, 17 Jan 2011 16:29:58 +0100
 From: Bryan Tong Minh &lt;bryan.tongminh(a)gmail.com&gt;
 Subject: Re: [Wikitech-l] From page history to sentence history
 To: Wikimedia developers &lt;wikitech-l(a)lists.wikimedia.org&gt;
 Message-ID:
    &lt;AANLkTi=w=6we2xngMMNikuFfMTH8KRtiVzXRSibJU-pX(a)mail.gmail.com&gt;
 Content-Type: text/plain; charset=ISO-8859-1

 On Mon, Jan 17, 2011 at 3:49 PM, Anthony &lt;wikimail(a)inbox.org&gt; wrote:
  How would you define a particular sentence,
paragraph or section of an
 article? ?The difficulty of the solution lies in answering that
 question.

 Difficult, but doable. Jan-Paul's sentence-level editing tool is able
 to make the distinction. It would perhaps be possible to use that as a
 framework for sentence-level diffs.

 Bryan

 ------------------------------

 Message: 3
 Date: Mon, 17 Jan 2011 16:40:28 +0100
 From: Alex Brollo &lt;alex.brollo(a)gmail.com&gt;
 Subject: Re: [Wikitech-l] From page history to sentence history
 To: Wikimedia developers &lt;wikitech-l(a)lists.wikimedia.org&gt;
 Message-ID:
    &lt;AANLkTi=WhAZ1d5ty9hbkdD-7LkfSd_Fy0VtEvjxAdPQn(a)mail.gmail.com&gt;
 Content-Type: text/plain; charset=ISO-8859-1

 2011/1/17 Bryan Tong Minh &lt;bryan.tongminh(a)gmail.com&gt;

 Difficult, but doable. Jan-Paul's sentence-level editing tool is able
 to make the distinction. It would perhaps be possible to use that as a
 framework for sentence-level diffs.

 Difficult, but diff between versions of a page does it. Looking at diff
 between pages, I simply thought firmly that only diff paragraphs were
 stored, so that the page was built as updated diff segments. I had no idea
 how this could be done, but  all was "magic"!

 Alex

 ------------------------------

 Message: 4
 Date: Mon, 17 Jan 2011 17:11:12 +0100
 From: Daniel Kinzler &lt;daniel(a)brightbyte.de&gt;
 Subject: [Wikitech-l] WMDE Developer Meetup moved to May
 To: wikitech-l(a)lists.wikimedia.org, toolserver-l(a)lists.wikimedia.org,
    MediaWiki announcements and site admin list
    &lt;mediawiki-l(a)lists.wikimedia.org&gt;
 Cc: Nicole Ebber &lt;nicole.ebber(a)wikimedia.de&gt;de>,    Pavel Richter
    &lt;pavel.richter(a)wikimedia.de&gt;
 Message-ID: &lt;4D346A20.107(a)brightbyte.de&gt;
 Content-Type: text/plain; charset=UTF-8

 Hi all

 after some discussion, Wikimedia Germany decided not to hold a developer's
 meet-up around the Chapter's conference in March. We just couldn't fit this in
 nicely with the venue and the overall organization. Don't despair though:

 This is what we will do instead:

 * There will be a hackathon hosted by Wikimedia Germany in (late) May, probably
 in Berlin, but that's not decided yet. This will mostly about hacking, with a
 strong focus on GLAM related stuff. There will be little in terms of presentations.

 * There will be the hacking days attached to Wikimania in Haifa, August 3./4.
 I'm in charge of setting up the program for that, and I'll try to make it a nice
 mix of discussing technology and actually hacking. I would also like to have a
 get-together with thechies and chapter folks at some point during Wikimania.

 I hope that this way, we can give the hacking events the attention they deserve.
 Let me know what you think.

 -- daniel

 ------------------------------

 Message: 5
 Date: Mon, 17 Jan 2011 11:31:27 -0500
 From: Aryeh Gregor &lt;Simetrical+wikilist(a)gmail.com&gt;
 Subject: Re: [Wikitech-l] WYSIFTW status
 To: Wikimedia developers &lt;wikitech-l(a)lists.wikimedia.org&gt;
 Message-ID:
    &lt;AANLkTikudZhXBHndkeHEwsUqHvCqBZ2VESTKM7xoZTn2(a)mail.gmail.com&gt;
 Content-Type: text/plain; charset=UTF-8

 On Sun, Jan 16, 2011 at 7:16 PM, Magnus Manske
 &lt;magnusmanske(a)googlemail.com&gt; wrote:
  There is the question of what browsers/versions
to test for. Should I
 invest large amounts of time optimising performance in Firefox 3, when
 FF4 will probably be released before WYSIFTW, and everyone and their
 cousin upgrades?  
 Design for only the fastest browsers.  Other browsers could always
 just be dropped back to the old-fashioned editor.

 ------------------------------

 Message: 6
 Date: Mon, 17 Jan 2011 17:39:31 +0100
 From: Daniel Kinzler &lt;daniel(a)brightbyte.de&gt;
 Subject: Re: [Wikitech-l] [Toolserver-l] WMDE Developer Meetup moved
    to May
 To: toolserver-l(a)lists.wikimedia.org
 Cc: MediaWiki announcements and site admin list
    &lt;mediawiki-l(a)lists.wikimedia.org&gt;rg>, wikitech-l(a)lists.wikimedia.org,
    Asaf Bartov &lt;asaf.bartov(a)gmail.com&gt;om>,    Pavel Richter
    &lt;pavel.richter(a)wikimedia.de&gt;de>,    Nicole Ebber &lt;nicole.ebber(a)wikimedia.de&gt;
 Message-ID: &lt;4D3470C3.4040304(a)brightbyte.de&gt;
 Content-Type: text/plain; charset=ISO-8859-1

 On 17.01.2011 17:14, Asaf Bartov wrote:
  Correction: Haifa Hacking Days are to be held
August 2nd-3rd.
 Wikimania itself will be Aug 4th-6th.  
 Gah! Thanks Asaf.

 There I went and looked it up, and then wrote the wrong thing into the email.
 Curses.

 -- daniel

 ------------------------------

 Message: 7
 Date: Mon, 17 Jan 2011 11:44:28 -0500
 From: Aryeh Gregor &lt;Simetrical+wikilist(a)gmail.com&gt;
 Subject: Re: [Wikitech-l] June 8th 2011, World IPv6 Day
 To: Happy-melon &lt;happy-melon(a)live.com&gt;om>,    Wikimedia developers
    &lt;wikitech-l(a)lists.wikimedia.org&gt;
 Message-ID:
    &lt;AANLkTikk20OAKv-vreinxD-oBmfnzLbo97=xROQebpDX(a)mail.gmail.com&gt;
 Content-Type: text/plain; charset=UTF-8

 On Sun, Jan 16, 2011 at 7:12 PM, Happy-melon &lt;happy-melon(a)live.com&gt; wrote:
  I don't entirely understand the point of
this. ?The plan seems to be """get
 a large enough fraction of 'the internet' to make a change which breaks for
 some people all at the same time, so that those people get angry with the
 ISPs that haven't got off their arses to fix said breakage, rather than
 angry with the broken sites""", which is fair enough.  
 No, the point is to test what happens if IPv6 is supported on a large
 scale.  It's known from small-scale testing that this will break
 things for some small percentage of users, but no one's sure what the
 consequences are of switching this on fully for everyone.

  But AFAICT, the
 breakage won't occur if your connection can't 'do' IPv6, but only if
your
 connection can't 'do' both IPv4 *and* IPv6 on the same site at the same
 time. ?Surely that's not actually the problem that we need to solve if we're
 to be able to migrate smoothly onto IPv6? ?When the IPv4 addresses run out,
 we need to be able to start setting up websites which are *only* v6, surely?  
 There are many more clients in the world than servers, and servers
 have always been able to get dedicated IPv4 addresses much more easily
 than clients.  A server Internet connection in America will typically
 come with as many IPv4 addresses as you need, while you usually can't
 get a dedicated residential IP address unless you pay extra.  (And
 America has more IP addresses allocated per capita than anywhere else
 in the world, since it originally developed the Internet.)

 So as IPv4 addresses become scarcer, the pressure to use IPv6 only
 will fall mostly on residential users.  Clients with only an IPv6
 address will only be able to get direct connections to IPv6-enabled
 servers.  The way servers are supposed to do this is serve both A and
 AAAA records for the same domain, so IPv4 clients use the A record and
 IPv6 clients use the AAAA record.

 Unfortunately, someone at some point decided that if the client
 supports both IPv4 and IPv6, and the server publishes both A and AAAA
 records, the client should connect via IPv6.  In practice, almost no
 sites use IPv6, so the infrastructure is much less well-tested.
 Clients that think they have IPv6 connections might actually have the
 connection eaten by a middlebox, or just be slower or less reliable.
 So sites don't turn on the AAAA records in practice because it
 degrades service for clients with IPv6 connections, which means the
 servers aren't accessible to IPv6-only clients without workarounds.

 IPv6 day is an attempt to see what happens if major sites publish AAAA
 records for a while.  Stuff will break, but hopefully not too
 horribly, and it will give both site operators and ISPs the chance to
 analyze what's wrong with their IPv6 support and what they can do to
 fix it.  This is a step toward major sites publishing AAAA records all
 the time, which is necessary to support IPv6-only clients.

 Something like that, anyway.  I'm hardly an expert on these things.

 ------------------------------

 Message: 8
 Date: Mon, 17 Jan 2011 11:45:33 -0500
 From: Chad &lt;innocentkiller(a)gmail.com&gt;
 Subject: Re: [Wikitech-l] WMDE Developer Meetup moved to May
 To: Wikimedia developers &lt;wikitech-l(a)lists.wikimedia.org&gt;
 Cc: toolserver-l(a)lists.wikimedia.org,    MediaWiki announcements and site
    admin list    &lt;mediawiki-l(a)lists.wikimedia.org&gt;
 Message-ID:
    &lt;AANLkTim3Q5CS20O=CRVo0A2z7nNbqftrhaUFFgvBq2+g(a)mail.gmail.com&gt;
 Content-Type: text/plain; charset=UTF-8

 On Mon, Jan 17, 2011 at 11:11 AM, Daniel Kinzler &lt;daniel(a)brightbyte.de&gt; wrote:
  * There will be a hackathon hosted by Wikimedia
Germany in (late) May, probably
 in Berlin, but that's not decided yet. This will mostly about hacking, with a
 strong focus on GLAM related stuff. There will be little in terms of presentations.

 Late May? That's actually *really* awesome. Now I don't have
 to miss school to come :D

 -Chad

 ------------------------------

 Message: 9
 Date: Mon, 17 Jan 2011 11:47:35 -0500
 From: Aryeh Gregor &lt;Simetrical+wikilist(a)gmail.com&gt;
 Subject: Re: [Wikitech-l] From page history to sentence history
 To: Wikimedia developers &lt;wikitech-l(a)lists.wikimedia.org&gt;
 Message-ID:
    &lt;AANLkTinBdUX_v4d0gvxzm=BF_LE+1aQrMmjhk8xsvFE8(a)mail.gmail.com&gt;
 Content-Type: text/plain; charset=UTF-8

 On Mon, Jan 17, 2011 at 5:55 AM, Alex Brollo &lt;alex.brollo(a)gmail.com&gt; wrote:
  Before I dig a little more into wiki mysteries, I
was absolutely sure that
 wiki articles were stored into small pieces (paragraphs?) so that a small
 edit into a long long page would take exactly the same disk space than a
 small edit into a short page. But I discovered soon, that things are
 different. :-)  
 Wikimedia stores diffs using delta compression, so actually this is
 basically what happens.  The size of the edit is what determines the
 size of the stored diff, not the size of the page.  (I don't know how
 this works in detail, though.)  IIRC, default MediaWiki doesn't work
 this way.

 ------------------------------

 Message: 10
 Date: Mon, 17 Jan 2011 12:41:22 -0500
 From: Anthony &lt;wikimail(a)inbox.org&gt;
 Subject: Re: [Wikitech-l] From page history to sentence history
 To: Wikimedia developers &lt;wikitech-l(a)lists.wikimedia.org&gt;
 Message-ID:
    &lt;AANLkTinfD+PEoAWN1T4XyZaeCwPO1_NeXm0EoDgLjzoH(a)mail.gmail.com&gt;
 Content-Type: text/plain; charset=ISO-8859-1

 On Mon, Jan 17, 2011 at 10:40 AM, Alex Brollo &lt;alex.brollo(a)gmail.com&gt; wrote:
  2011/1/17 Bryan Tong Minh
&lt;bryan.tongminh(a)gmail.com&gt;

 Difficult, but doable. Jan-Paul's sentence-level editing tool is able
 to make the distinction. It would perhaps be possible to use that as a
 framework for sentence-level diffs.

 Difficult, but diff between versions of a page does it. Looking at diff
 between pages, I simply thought firmly that only diff paragraphs were
 stored, so that the page was built as updated diff segments. I had no idea
 how this could be done, but ?all was "magic"!  
 Paragraphs are much easier to recognize than sentences, as wikitext
 has a paragraph delimiter - a blank line.  To truly recognize
 sentences, you basically have to engage in natural language
 processing, though you can probably get it right 90% of the time
 without too much effort.

 And to recognize what's going on when a sentence changes *and* is
 moved from one paragraph to another, requires an even greater level of
 natural language understanding.  Again though, you can probably get it
 right most of the time without too much effort.

 Wikitext actually makes it easier for the most part, as you can use
 tricks such as the fact that the periods in [[I.M. Someone]] don't
 represent sentence delimiters, since they are contained in square
 brackets.  But not all periods which occur in the middle of a sentence
 are contained in square brackets, and not all sentences end with a
 period.

 I'd say "difficult but doable" is quite accurate, although with the
 caveat that even the state of the art tools available today are
 probably going to make mistakes that would be obvious to a human.  I'm
 sure there are tools for this, and there are probably some decent ones
 that are open source.  But it's not as simple as just adding an index.

 ------------------------------

 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

 End of Wikitech-l Digest, Vol 90, Issue 33
 ******************************************