Send Wikitech-l mailing list submissions to
wikitech-l(a)lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
or, via email, send a message with subject or body 'help' to
wikitech-l-request(a)lists.wikimedia.org
You can reach the person managing the list at
wikitech-l-owner(a)lists.wikimedia.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Wikitech-l digest..."
Today's Topics:
1. Category sorting and first letters (Tim Starling)
2. Re: From page history to sentence history (Bryan Tong Minh)
3. Re: From page history to sentence history (Alex Brollo)
4. WMDE Developer Meetup moved to May (Daniel Kinzler)
5. Re: WYSIFTW status (Aryeh Gregor)
6. Re: [Toolserver-l] WMDE Developer Meetup moved to May
(Daniel Kinzler)
7. Re: June 8th 2011, World IPv6 Day (Aryeh Gregor)
8. Re: WMDE Developer Meetup moved to May (Chad)
9. Re: From page history to sentence history (Aryeh Gregor)
10. Re: From page history to sentence history (Anthony)
----------------------------------------------------------------------
Message: 1
Date: Tue, 18 Jan 2011 02:00:09 +1100
From: Tim Starling <tstarling(a)wikimedia.org>
Subject: [Wikitech-l] Category sorting and first letters
To: wikitech-l(a)lists.wikimedia.org
Message-ID: <ih1lhs$pmn$1(a)dough.gmane.org>
Content-Type: text/plain; charset=UTF-8
In r80443 I added a feature allowing categories to be sorted using the
Unicode Collation Algorithm (UCA). I wanted to briefly talk about the
potential user impact, the design choices and the caveats.
Sorting was the easy part. The hard part was providing a "first
letter" concept which would be reasonably sane. The idea I came up
with was to compile a list of first letters, themselves sorted using
the UCA. Then the "first letter" of a given string is the nearest
letter in the list which sorts above the string.
For instance if you have letters A, B, C, and a string Aardvark, if
you sort them you get:
A
Aardvark
B
C
So we know that A is the first letter of Aardvark because Aardvark
sorts immediately below A. This algorithm gives us a number of nice
properties:
* It automatically drops accents, since accented letters sort the same
as unaccented letters (at the primary level). Same with case
differences, hiragana/katakana, etc.
* You can work out the initial Jamo of a Hangul syllable character by
just omitting the composed syllables from the "first letter" list.
Previously this was done with a special-case hack in
Language::firstChar().
* Vowel reordering in Thai and Lao is automatically supported.
So "??" sorts under heading "?" and "??" sorts under
heading "?".
* The collation can be expanded to support all sorts of other crazy
features, and the first letter feature will keep working in a sane
way. For instance, you could have an English collation which removed
"the" from the start of a title.
I compiled a list of 14,742 suitable header characters, identified by
processing various Unicode data files. That list probably still needs
lots of tweaks.
There is a down side to this scheme. The default UCA table gives all
characters with a similar logical function to the digits 0-9 the same
primary sort order as the corresponding ASCII digits. So a page like
[[????]] on the Bihari Wikipedia will sort under a heading of "1"
instead of "?". There may be other instances of accidental cultural
imperialism. However, this can be fixed by compiling
language-dependent lists of header characters.
The UCA default table is not meant to sort any language correctly,
it's just a compromise collation. Support for language-specific
collations can easily be added. Whether we get language-specific
collations or not, I'd like to think about enabling this feature on
Wikimedia.
The most glaring omission from the UCA default tables is sensible
sorting of the unified Han.
In a Chinese context, there's an obvious way to sort characters, and
that's by their order in the KangXi dictionary. The Unihan database
gives such an ordering, and it's used within code blocks. But it's not
used between code blocks. So if you sort by code point, all the Han
characters that aren't in the U+4E00 to U+9FFF block will sort
incorrectly. That's what the default UCA does, with a few minor
exceptions.
In a Japanese context, the way to sort ideographic characters is to
convert them to phonetic hiragana and then to sort the resulting
string. I don't know if there is any free software for doing this. On
the Japanese Wikipedia, they achieve the same result by manually
setting the sort key of every page to be the hiragana version of the
title.
There's lots of room here for other people to get involved, especially
if you know a language other than English.
-- Tim Starling
------------------------------
Message: 2
Date: Mon, 17 Jan 2011 16:29:58 +0100
From: Bryan Tong Minh <bryan.tongminh(a)gmail.com>
Subject: Re: [Wikitech-l] From page history to sentence history
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
Message-ID:
<AANLkTi=w=6we2xngMMNikuFfMTH8KRtiVzXRSibJU-pX(a)mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
On Mon, Jan 17, 2011 at 3:49 PM, Anthony <wikimail(a)inbox.org> wrote:
How would you define a particular sentence,
paragraph or section of an
article? ?The difficulty of the solution lies in answering that
question.
Difficult, but doable. Jan-Paul's sentence-level editing tool is able
to make the distinction. It would perhaps be possible to use that as a
framework for sentence-level diffs.
Bryan
------------------------------
Message: 3
Date: Mon, 17 Jan 2011 16:40:28 +0100
From: Alex Brollo <alex.brollo(a)gmail.com>
Subject: Re: [Wikitech-l] From page history to sentence history
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
Message-ID:
<AANLkTi=WhAZ1d5ty9hbkdD-7LkfSd_Fy0VtEvjxAdPQn(a)mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
2011/1/17 Bryan Tong Minh <bryan.tongminh(a)gmail.com>
Difficult, but doable. Jan-Paul's sentence-level editing tool is able
to make the distinction. It would perhaps be possible to use that as a
framework for sentence-level diffs.
Difficult, but diff between versions of a page does it. Looking at diff
between pages, I simply thought firmly that only diff paragraphs were
stored, so that the page was built as updated diff segments. I had no idea
how this could be done, but all was "magic"!
Alex
------------------------------
Message: 4
Date: Mon, 17 Jan 2011 17:11:12 +0100
From: Daniel Kinzler <daniel(a)brightbyte.de>
Subject: [Wikitech-l] WMDE Developer Meetup moved to May
To: wikitech-l(a)lists.wikimedia.org, toolserver-l(a)lists.wikimedia.org,
MediaWiki announcements and site admin list
<mediawiki-l(a)lists.wikimedia.org>
Cc: Nicole Ebber <nicole.ebber(a)wikimedia.de>de>, Pavel Richter
<pavel.richter(a)wikimedia.de>
Message-ID: <4D346A20.107(a)brightbyte.de>
Content-Type: text/plain; charset=UTF-8
Hi all
after some discussion, Wikimedia Germany decided not to hold a developer's
meet-up around the Chapter's conference in March. We just couldn't fit this in
nicely with the venue and the overall organization. Don't despair though:
This is what we will do instead:
* There will be a hackathon hosted by Wikimedia Germany in (late) May, probably
in Berlin, but that's not decided yet. This will mostly about hacking, with a
strong focus on GLAM related stuff. There will be little in terms of presentations.
* There will be the hacking days attached to Wikimania in Haifa, August 3./4.
I'm in charge of setting up the program for that, and I'll try to make it a nice
mix of discussing technology and actually hacking. I would also like to have a
get-together with thechies and chapter folks at some point during Wikimania.
I hope that this way, we can give the hacking events the attention they deserve.
Let me know what you think.
-- daniel
------------------------------
Message: 5
Date: Mon, 17 Jan 2011 11:31:27 -0500
From: Aryeh Gregor <Simetrical+wikilist(a)gmail.com>
Subject: Re: [Wikitech-l] WYSIFTW status
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
Message-ID:
<AANLkTikudZhXBHndkeHEwsUqHvCqBZ2VESTKM7xoZTn2(a)mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
On Sun, Jan 16, 2011 at 7:16 PM, Magnus Manske
<magnusmanske(a)googlemail.com> wrote:
There is the question of what browsers/versions
to test for. Should I
invest large amounts of time optimising performance in Firefox 3, when
FF4 will probably be released before WYSIFTW, and everyone and their
cousin upgrades?
Design for only the fastest browsers. Other browsers could always
just be dropped back to the old-fashioned editor.
------------------------------
Message: 6
Date: Mon, 17 Jan 2011 17:39:31 +0100
From: Daniel Kinzler <daniel(a)brightbyte.de>
Subject: Re: [Wikitech-l] [Toolserver-l] WMDE Developer Meetup moved
to May
To: toolserver-l(a)lists.wikimedia.org
Cc: MediaWiki announcements and site admin list
<mediawiki-l(a)lists.wikimedia.org>rg>, wikitech-l(a)lists.wikimedia.org,
Asaf Bartov <asaf.bartov(a)gmail.com>om>, Pavel Richter
<pavel.richter(a)wikimedia.de>de>, Nicole Ebber <nicole.ebber(a)wikimedia.de>
Message-ID: <4D3470C3.4040304(a)brightbyte.de>
Content-Type: text/plain; charset=ISO-8859-1
On 17.01.2011 17:14, Asaf Bartov wrote:
Correction: Haifa Hacking Days are to be held
August 2nd-3rd.
Wikimania itself will be Aug 4th-6th.
Gah! Thanks Asaf.
There I went and looked it up, and then wrote the wrong thing into the email.
Curses.
-- daniel
------------------------------
Message: 7
Date: Mon, 17 Jan 2011 11:44:28 -0500
From: Aryeh Gregor <Simetrical+wikilist(a)gmail.com>
Subject: Re: [Wikitech-l] June 8th 2011, World IPv6 Day
To: Happy-melon <happy-melon(a)live.com>om>, Wikimedia developers
<wikitech-l(a)lists.wikimedia.org>
Message-ID:
<AANLkTikk20OAKv-vreinxD-oBmfnzLbo97=xROQebpDX(a)mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
On Sun, Jan 16, 2011 at 7:12 PM, Happy-melon <happy-melon(a)live.com> wrote:
I don't entirely understand the point of
this. ?The plan seems to be """get
a large enough fraction of 'the internet' to make a change which breaks for
some people all at the same time, so that those people get angry with the
ISPs that haven't got off their arses to fix said breakage, rather than
angry with the broken sites""", which is fair enough.
No, the point is to test what happens if IPv6 is supported on a large
scale. It's known from small-scale testing that this will break
things for some small percentage of users, but no one's sure what the
consequences are of switching this on fully for everyone.
But AFAICT, the
breakage won't occur if your connection can't 'do' IPv6, but only if
your
connection can't 'do' both IPv4 *and* IPv6 on the same site at the same
time. ?Surely that's not actually the problem that we need to solve if we're
to be able to migrate smoothly onto IPv6? ?When the IPv4 addresses run out,
we need to be able to start setting up websites which are *only* v6, surely?
There are many more clients in the world than servers, and servers
have always been able to get dedicated IPv4 addresses much more easily
than clients. A server Internet connection in America will typically
come with as many IPv4 addresses as you need, while you usually can't
get a dedicated residential IP address unless you pay extra. (And
America has more IP addresses allocated per capita than anywhere else
in the world, since it originally developed the Internet.)
So as IPv4 addresses become scarcer, the pressure to use IPv6 only
will fall mostly on residential users. Clients with only an IPv6
address will only be able to get direct connections to IPv6-enabled
servers. The way servers are supposed to do this is serve both A and
AAAA records for the same domain, so IPv4 clients use the A record and
IPv6 clients use the AAAA record.
Unfortunately, someone at some point decided that if the client
supports both IPv4 and IPv6, and the server publishes both A and AAAA
records, the client should connect via IPv6. In practice, almost no
sites use IPv6, so the infrastructure is much less well-tested.
Clients that think they have IPv6 connections might actually have the
connection eaten by a middlebox, or just be slower or less reliable.
So sites don't turn on the AAAA records in practice because it
degrades service for clients with IPv6 connections, which means the
servers aren't accessible to IPv6-only clients without workarounds.
IPv6 day is an attempt to see what happens if major sites publish AAAA
records for a while. Stuff will break, but hopefully not too
horribly, and it will give both site operators and ISPs the chance to
analyze what's wrong with their IPv6 support and what they can do to
fix it. This is a step toward major sites publishing AAAA records all
the time, which is necessary to support IPv6-only clients.
Something like that, anyway. I'm hardly an expert on these things.
------------------------------
Message: 8
Date: Mon, 17 Jan 2011 11:45:33 -0500
From: Chad <innocentkiller(a)gmail.com>
Subject: Re: [Wikitech-l] WMDE Developer Meetup moved to May
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
Cc: toolserver-l(a)lists.wikimedia.org, MediaWiki announcements and site
admin list <mediawiki-l(a)lists.wikimedia.org>
Message-ID:
<AANLkTim3Q5CS20O=CRVo0A2z7nNbqftrhaUFFgvBq2+g(a)mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
On Mon, Jan 17, 2011 at 11:11 AM, Daniel Kinzler <daniel(a)brightbyte.de> wrote:
* There will be a hackathon hosted by Wikimedia
Germany in (late) May, probably
in Berlin, but that's not decided yet. This will mostly about hacking, with a
strong focus on GLAM related stuff. There will be little in terms of presentations.
Late May? That's actually *really* awesome. Now I don't have
to miss school to come :D
-Chad
------------------------------
Message: 9
Date: Mon, 17 Jan 2011 11:47:35 -0500
From: Aryeh Gregor <Simetrical+wikilist(a)gmail.com>
Subject: Re: [Wikitech-l] From page history to sentence history
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
Message-ID:
<AANLkTinBdUX_v4d0gvxzm=BF_LE+1aQrMmjhk8xsvFE8(a)mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
On Mon, Jan 17, 2011 at 5:55 AM, Alex Brollo <alex.brollo(a)gmail.com> wrote:
Before I dig a little more into wiki mysteries, I
was absolutely sure that
wiki articles were stored into small pieces (paragraphs?) so that a small
edit into a long long page would take exactly the same disk space than a
small edit into a short page. But I discovered soon, that things are
different. :-)
Wikimedia stores diffs using delta compression, so actually this is
basically what happens. The size of the edit is what determines the
size of the stored diff, not the size of the page. (I don't know how
this works in detail, though.) IIRC, default MediaWiki doesn't work
this way.
------------------------------
Message: 10
Date: Mon, 17 Jan 2011 12:41:22 -0500
From: Anthony <wikimail(a)inbox.org>
Subject: Re: [Wikitech-l] From page history to sentence history
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
Message-ID:
<AANLkTinfD+PEoAWN1T4XyZaeCwPO1_NeXm0EoDgLjzoH(a)mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
On Mon, Jan 17, 2011 at 10:40 AM, Alex Brollo <alex.brollo(a)gmail.com> wrote:
2011/1/17 Bryan Tong Minh
<bryan.tongminh(a)gmail.com>
Difficult, but doable. Jan-Paul's sentence-level editing tool is able
to make the distinction. It would perhaps be possible to use that as a
framework for sentence-level diffs.
Difficult, but diff between versions of a page does it. Looking at diff
between pages, I simply thought firmly that only diff paragraphs were
stored, so that the page was built as updated diff segments. I had no idea
how this could be done, but ?all was "magic"!
Paragraphs are much easier to recognize than sentences, as wikitext
has a paragraph delimiter - a blank line. To truly recognize
sentences, you basically have to engage in natural language
processing, though you can probably get it right 90% of the time
without too much effort.
And to recognize what's going on when a sentence changes *and* is
moved from one paragraph to another, requires an even greater level of
natural language understanding. Again though, you can probably get it
right most of the time without too much effort.
Wikitext actually makes it easier for the most part, as you can use
tricks such as the fact that the periods in [[I.M. Someone]] don't
represent sentence delimiters, since they are contained in square
brackets. But not all periods which occur in the middle of a sentence
are contained in square brackets, and not all sentences end with a
period.
I'd say "difficult but doable" is quite accurate, although with the
caveat that even the state of the art tools available today are
probably going to make mistakes that would be obvious to a human. I'm
sure there are tools for this, and there are probably some decent ones
that are open source. But it's not as simple as just adding an index.
------------------------------
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
End of Wikitech-l Digest, Vol 90, Issue 33
******************************************