TeX, version 13

List overview All Threads
Download

newer

older

RE: [Wikitech-l] Languages &...

wikipedia mailling lists to usenet

Tomasz Wegrzanowski

6 Dec 2002 6 Dec '02

11:21 a.m.

Main changes:

1. All characters have been automatically checked whether they require AMS or not, so dynamic loading of AMS should be more robust.

2. It is possible to use UTF-8 inside \mbox{}, not only ASCII. I managed to render Polish characters, but it seems that more work is needed to make latex work with CJK.

Unicode packages are not available in default TeTeX installation, and have to be downloaded separately from http://www.unruh.de/DniQ/latex/unicode/

Unicode packages (ucs and inputenc:utf8) are loaded only if there is \mbox containing non-ASCII characters.

I'm not really happy about this solution, but it should at least work for English, Polish and Esperanto, and \mbox is not the most important feature.

Btw. which Wikipedias on new software haven't moved to UTF-8 ? It would be so much easier if everyone used the same encoding.

Attachments:

tex13.diff.gz (application/octet-stream — 12.0 KB)

Show replies by date

Brion Vibber

6 Dec 6 Dec

12:04 p.m.

On Thu, 2002-12-05 at 19:21, Tomasz Wegrzanowski wrote:

...

Btw. which Wikipedias on new software haven't moved to UTF-8 ? It would be so much easier if everyone used the same encoding.

utf-8: cs* el* eo ja* ko* pl ru* tr* zh* meta latin1: da de en es fr ms* nl sv sep11*

(* not used much yet; few or no contributors, articles)

I'm all in favor of moving to full utf-8, but there are still problems with popular browsers that break on UTF-8 chars. (See the Help page on meta.) We need a transparent conversion to/from latin1 to allow known bad browsers (or not known good) to work with something safer, eg latin1 + numeric references.

-- brion vibber (brion @ pobox.com)

Jonathan Walther

12:41 p.m.

On Thu, Dec 05, 2002 at 08:04:08PM -0800, Brion Vibber wrote:

...

utf-8: cs* el* eo ja* ko* pl ru* tr* zh* meta latin1: da de en es fr ms* nl sv sep11*

Do meta and sep11 belong in the list of languages, or in the list of namespaces? This question is in reference to recent discussion surrounding the page

http://www.wikipedia.org/wiki/User:Clutch/mod_wiki

Jonathan

-- Geek House Productions, Ltd. Providing Unix & Internet Contracting and Consulting, QA Testing, Technical Documentation, Systems Design & Implementation, General Programming, E-commerce, Web & Mail Services since 1998 Phone: 604-435-1205 Email: djw@reactor-core.org Webpage: http://reactor-core.org Address: 2459 E 41st Ave, Vancouver, BC V5R2W2

Jens Frank

7 Dec 7 Dec

7:17 a.m.

On Thu, Dec 05, 2002 at 08:41:21PM -0800, Jonathan Walther wrote:

...

On Thu, Dec 05, 2002 at 08:04:08PM -0800, Brion Vibber wrote:

...
utf-8: cs* el* eo ja* ko* pl ru* tr* zh* meta latin1: da de en es fr ms* nl sv sep11*

Do meta and sep11 belong in the list of languages, or in the list of namespaces?

meta and sep11 are similar to "languages". They are not namespaces, they have a db of their own and a domain of their own. Links like [[m:bla bla bla]] are treated different to language-links. They are rendered inline, similar to a [http://www.blabla.com/] link.

JeLuF

Jonathan Walther

4:11 p.m.

On Sat, Dec 07, 2002 at 12:17:58AM +0100, Jens Frank wrote:

...

On Thu, Dec 05, 2002 at 08:41:21PM -0800, Jonathan Walther wrote:

...
On Thu, Dec 05, 2002 at 08:04:08PM -0800, Brion Vibber wrote:

...
utf-8: cs* el* eo ja* ko* pl ru* tr* zh* meta latin1: da de en es fr ms* nl sv sep11*

Do meta and sep11 belong in the list of languages, or in the list of namespaces?

meta and sep11 are similar to "languages". They are not namespaces, they have a db of their own and a domain of their own. Links like [[m:bla bla bla]] are treated different to language-links. They are rendered inline, similar to a [http://www.blabla.com/] link.

Can you explain that in more detail? What do you mean, "rendered inline"? From where I'm standing, if all the databases were folded into one wiki database, there would be no problem.

And I still haven't heard any suggestions for a different syntax for specifying an external wiki; it doesn't seem right to overload the namespace syntax to handle that function.

Jonathan

Brion Vibber

4:47 p.m.

On Sat, 2002-12-07 at 00:11, Jonathan Walther wrote:

...

On Sat, Dec 07, 2002 at 12:17:58AM +0100, Jens Frank wrote:

...
meta and sep11 are similar to "languages". They are not namespaces, they have a db of their own and a domain of their own. Links like [[m:bla bla bla]] are treated different to language-links. They are rendered inline, similar to a [http://www.blabla.com/] link.

Can you explain that in more detail? What do you mean, "rendered inline"?

In the text instead of spirited up to a list at the top of the screen. eg the wikicode:

This [[m:Linking gone mad at meta-wikipedia|link to meta]] is in the text. But [[fr:Liens, liens sans fin|this French one]] isn't.

renders as:

Other languages: <French> Some Page from Wikipedia, the free encyclopedia

This <link to meta> is in the text. But isn't.

unless you're in a talk page or on meta or sep11, in which case you get:

Talk:Some Page from Wikipedia, the free encyclopedia

This <link to meta> is in the text. But <this French one> isn't.

...

From where I'm standing, if all the databases were folded into one wiki database, there would be no problem.

And I still haven't heard any suggestions for a different syntax for specifying an external wiki; it doesn't seem right to overload the namespace syntax to handle that function.

WikiName:ArticleTitle is the standard interwiki link syntax; the language codes are just abbreviated forms of the wiki name.

That local namespaces and interwikis both involve some text followed by a colon shouldn't bug us too badly... after all, URLs start the same way! (And we really _should_ allow URL links in the same double-bracket style as wikilinks; newbies try it naturally and get confused when it doesn't work.)

-- brion vibber (brion @ pobox.com)

Jonathan Walther

4:47 p.m.

On Sat, Dec 07, 2002 at 12:47:05AM -0800, Brion Vibber wrote:

...

...
Can you explain that in more detail? What do you mean, "rendered inline"?

In the text instead of spirited up to a list at the top of the screen. eg the wikicode:

This [[m:Linking gone mad at meta-wikipedia|link to meta]] is in the text. But [[fr:Liens, liens sans fin|this French one]] isn't.

Ok, that makes sense. I had intended for language links to also be inline, unless they were at the very top of the page (somewhat like with #REDIRECT). Any objections to doing it that way?

...

WikiName:ArticleTitle is the standard interwiki link syntax; the language codes are just abbreviated forms of the wiki name.

How so? In that case, maybe we should change the syntax from lang:wiki:title to wiki:lang:title? Much easier conceptually, because then you can leave out the language and it will make sense if it's a unilingual wiki

...

That local namespaces and interwikis both involve some text followed by a colon shouldn't bug us too badly... after all, URLs start the same way! (And we really _should_ allow URL links in the same double-bracket style as wikilinks; newbies try it naturally and get confused when it doesn't work.)

Yes, I am wondering why we use [[]] for wiki links, and [] for URL's proper. Actually, I am not wondering. It is a useful syntactic aid to parsing. If only the pipe '|' syntax for [] was the same as [[]], that would be great. Any objections?

Jonathan

Brion Vibber

6:46 p.m.

On Sat, 2002-12-07 at 00:47, Jonathan Walther wrote: (language links & inline appearance)

...

Ok, that makes sense. I had intended for language links to also be inline, unless they were at the very top of the page (somewhat like with #REDIRECT). Any objections to doing it that way?

Sounds nice; a link vanishing from the middle of the text is confusing, but the edges are where 'magic' can be reasonably expected to take place. Note that a few pages have the lang links stuck at the bottom instead of the top, and whitespace is common.

...

...
WikiName:ArticleTitle is the standard interwiki link syntax; the language codes are just abbreviated forms of the wiki name.

How so?

"de:Deutschland" is shorter and more site-specific than "DeWikiPedia:Deutschland", but functionally the same (seeing the special prefix, we use a stored URL associated with the prefix and slap the remainder of the title onto the end of the URL as an external link, instead of treating it as an internal wiki link - ie checking for page existence, linking to our own server, using an edit link form if the page doesn't exist).

That we sometimes move the link to the top of the screen and name it for the language is just a UI matter.

...

Yes, I am wondering why we use [[]] for wiki links, and [] for URL's proper.

Historical cruft; our wikicode syntax is largely inherited from UseModWiki. I think the freelink syntax ([[foo|bar]]) was added after the single-bracket named URL links; the space isn't usable as the separator for free links since they can contain spaces in the link portion.

...

Actually, I am not wondering. It is a useful syntactic aid to parsing. If only the pipe '|' syntax for [] was the same as [[]], that would be great. Any objections?

The space should be maintained for compatibility, but allowing the pipe to work as expected would be a very good thing. At that point, using double brackets would give a working link with an extra pair brackets around it; though some find that annoying. I would prefer having the double-brackets work 'as expected'.

-- brion vibber (brion @ pobox.com)

Jens Frank

5:25 p.m.

On Sat, Dec 07, 2002 at 12:11:46AM -0800, Jonathan Walther wrote:

...

On Sat, Dec 07, 2002 at 12:17:58AM +0100, Jens Frank wrote:

...
On Thu, Dec 05, 2002 at 08:41:21PM -0800, Jonathan Walther wrote: rendered inline, similar to a [http://www.blabla.com/] link.

Can you explain that in more detail? What do you mean, "rendered inline"?

A language link is added at the top of the page in a list of "related articles". An inline link is placed in the article, exactly where an author has put it.

...

From where I'm standing, if all the databases were folded into one wiki database, there would be no problem.

Several other language teams have split over the time, some have merged with the main project again. If joining db's some will feal that they won't be free to choose anymore.

...

And I still haven't heard any suggestions for a different syntax for specifying an external wiki; it doesn't seem right to overload the namespace syntax to handle that function.

Think of useability. Having similar things done in a similar way is very convenient for users. Perhaps it's not that convenient for a programmer, but from a users point of view, what is he doing? Putting a link. To put a link, start with [[, so what do I want to link to, yes, meta, so put [[m: . That's simply easy.

While we're at it, couldn't we change external link syntax from [http://www.bla.com/] to [[http://www.bla.com/%5D%5D? Helga is always doing it wrong, and from a useability point of view she is the benchmark.

Regards,

JeLuF

Brion Vibber

7:12 p.m.

On Sat, 2002-12-07 at 01:25, Jens Frank wrote:

...

On Sat, Dec 07, 2002 at 12:11:46AM -0800, Jonathan Walther wrote:

...
From where I'm standing, if all the databases were folded into one wiki database, there would be no problem.

Several other language teams have split over the time, some have merged with the main project again. If joining db's some will feal that they won't be free to choose anymore.

Please comment and encourage others to comment at http://meta.wikipedia.org/wiki/Thoughts_on_language_integration ! I want to hear from people who have concerns *so we can address them*.

It would be trivial to make partial database dumps for each language available for download, so if someone wants to exercise their right to fork and only wants to take material from one language, that's easy enough to accomodate.

Rejoining later wouldn't be more difficult than it is now; if every single contributor for that language had left for the forked wiki, we could simply replace every changed article by importing their db dump. If some people remained at Wikipedia while others worked on a forked project, we would have to integrate them.

Of course, forking isn't the only thing you can do -- getting involved and becoming part of the Wikipedia development and maintenance team to improve it where it's got problems is always an option.

-- brion vibber (brion @ pobox.com)

Jonathan Walther

11:36 p.m.

On Sat, Dec 07, 2002 at 03:12:59AM -0800, Brion Vibber wrote:

...

On Sat, 2002-12-07 at 01:25, Jens Frank wrote:

...
On Sat, Dec 07, 2002 at 12:11:46AM -0800, Jonathan Walther wrote:

...
From where I'm standing, if all the databases were folded into one wiki database, there would be no problem.

Several other language teams have split over the time, some have merged with the main project again. If joining db's some will feal that they won't be free to choose anymore.

Please comment and encourage others to comment at http://meta.wikipedia.org/wiki/Thoughts_on_language_integration ! I want to hear from people who have concerns *so we can address them*.

I have changed my mind. Languages are definately namespaces. Talk about an English article can go in the en_Talk namespace. Users, Meta, etc, can all be non-lingual.

I propose the following:

CREATE TABLE namespaces ( ns integer, name text, external boolean, urlprefix text );

INSERT INTO TABLE namespaces VALUES ( 1, 'en', false, 'http://www.wikipedia.org/en' );

INSERT INTO TABLE namespaces VALUES ( 2, 'en_Talk', false, 'http://www.wikipedia.org/en/Talk' );

INSERT INTO TABLE namespaces VALUES ( 3, 'Images', false, 'http://www.wikipedia.org/Images' );

INSERT INTO TABLE namespaces VALUES ( 4, 'fr', true, 'http://fr.wikipedia.org' );

I also propose only a single level of namespaces; none of this en:Image:foo stuff. If the image really needs to be "English" we can have an en_Image namespace, yielding en_Image:foo.

The external flag tells us that the link is "external"... not contained within the database, not dealt with by us. As official Wikipedias split off or come back into the fold, we can just change the flag in the namespaces table, and it will all just work.

Anyone see anything essential missing?

Jonathan

Toby Bartels

9 Dec 9 Dec

8:24 a.m.

New subject: Languages & namespaces (Was: TeX, version 13)

Clutch wrote in part:

...

I have changed my mind. Languages are definately namespaces. Talk about an English article can go in the en_Talk namespace. Users, Meta, etc, can all be non-lingual.

...

I also propose only a single level of namespaces; none of this en:Image:foo stuff. If the image really needs to be "English" we can have an en_Image namespace, yielding en_Image:foo.

...

Anyone see anything essential missing?

What's missing for me is why. Wikipedia pages now have a 3level name structure: language (or meta or sep11), namespace (possibly empty), and title. What's wrong with this?

-- Toby

Jonathan Walther

8:27 a.m.

New subject: Languages & namespaces (Was: TeX, version 13)

On Sun, Dec 08, 2002 at 04:24:34PM -0800, Toby Bartels wrote:

...

...
I have changed my mind. Languages are definately namespaces. Talk about an English article can go in the en_Talk namespace. Users, Meta, etc, can all be non-lingual.

...
I also propose only a single level of namespaces; none of this en:Image:foo stuff. If the image really needs to be "English" we can have an en_Image namespace, yielding en_Image:foo.

Wikipedia pages now have a 3level name structure: language (or meta or sep11), namespace (possibly empty), and title. What's wrong with this?

What's wrong is that it is unnecessary complexity. Can you give any reason why it has to be 3 levels, instead of 2? Having the three levels makes the coding tricky when you have pages that only need the two and are language independant.

Jonathan

Brion Vibber

9:10 a.m.

New subject: Languages & namespaces (Was: TeX, version 13)

On Sun, 2002-12-08 at 16:27, Jonathan Walther wrote:

...

On Sun, Dec 08, 2002 at 04:24:34PM -0800, Toby Bartels wrote:

...
...
I have changed my mind. Languages are definately namespaces. Talk about an English article can go in the en_Talk namespace. Users, Meta, etc, can all be non-lingual.

I would disagree. User pages should be available per-language as they are now. The namespace names also should be localizable, and must not be explicitly and forever tied to some particular foreign language (ie English). Special page names are English-only at the moment, and that is a deficiency that should be repaired.

...

...
...
I also propose only a single level of namespaces; none of this en:Image:foo stuff. If the image really needs to be "English" we can have an en_Image namespace, yielding en_Image:foo.

That's not an "english image", it's an english *image description* page. Description pages are text, which is in some language.

...

...
Wikipedia pages now have a 3level name structure: language (or meta or sep11), namespace (possibly empty), and title. What's wrong with this?

What's wrong is that it is unnecessary complexity. Can you give any reason why it has to be 3 levels, instead of 2? Having the three levels makes the coding tricky when you have pages that only need the two and are language independant.

Language has to be explicitly known so we can output Content-Language headers for browser config and search engine filtering; select the default user interface language to use, in some cases use a slightly different layout (right-to-left support?); select which bookseller links to use for ISBN links, etc.

If language and namespce are squished into one level, we would still need to have a language<->namespace map, both to get the language of a given page and to select pages in multiple namespace that belong to a language; this would also make it a lot harder to track certain kinds of pages across languages (ie, all talk pages; all user pages; all article pages) since we'd need a map of namespace to namespace functionality... which pretty much brings us back to square one -- the namespace-language key and namespace-functionality key would be functionally indentical to the distinct language and namespace keys. There may be some advantage to an extra level of indirection there, but I don't know what.

In a use where you only ever have one language/section, you'd just only ever _use_ one section. Just as you could run the current phase III software and only ever use pages in the "" namespace, it would sit in the background and you'd pay it no mind.

-- brion vibber (brion @ pobox.com)

Toby Bartels

10 Dec 10 Dec

2:14 a.m.

New subject: Languages & namespaces

Clutch wrote:

...

Toby Bartels wrote:

...

...
Wikipedia pages now have a 3level name structure: language (or meta or sep11), namespace (possibly empty), and title. What's wrong with this?

...

What's wrong is that it is unnecessary complexity. Can you give any reason why it has to be 3 levels, instead of 2? Having the three levels makes the coding tricky when you have pages that only need the two and are language independant.

Because all 3 levels have substantially different meanings. I would argue that a really logical structure would have *4* levels: language, namespace, talk (a boolean), and title. But the current user interface is designed with 3 in mind.

-- Toby

Jonathan Walther

2:17 a.m.

New subject: Languages & namespaces

On Mon, Dec 09, 2002 at 10:14:09AM -0800, Toby Bartels wrote:

...

Because all 3 levels have substantially different meanings. I would argue that a really logical structure would have *4* levels: language, namespace, talk (a boolean), and title. But the current user interface is designed with 3 in mind.

Well, since we are doing a redesign, why not have 4? If that floats your boat. Just figure out what changes would need to be made to the schema on http://www.wikipedia.org/wiki/User:Clutch/mod_wiki

Jonathan

Tomasz Wegrzanowski

2:45 a.m.

New subject: Languages & namespaces

On Mon, Dec 09, 2002 at 10:17:49AM -0800, Jonathan Walther wrote:

...

On Mon, Dec 09, 2002 at 10:14:09AM -0800, Toby Bartels wrote:

...
Because all 3 levels have substantially different meanings. I would argue that a really logical structure would have *4* levels: language, namespace, talk (a boolean), and title. But the current user interface is designed with 3 in mind.

Well, since we are doing a redesign, why not have 4? If that floats your boat. Just figure out what changes would need to be made to the schema on http://www.wikipedia.org/wiki/User:Clutch/mod_wiki

Stop now !

Why are we doing another redesign ? Weren't 2 we had so far enough ?

Please make only evolutionary changes - fix only what's broken, add new features without breaking existing ones, fix performance problems and not introduce new. This way everyone will be happy.

Jonathan Walther

2:48 a.m.

New subject: Languages & namespaces

On Mon, Dec 09, 2002 at 07:45:02PM +0100, Tomasz Wegrzanowski wrote:

...

Why are we doing another redesign ? Weren't 2 we had so far enough ?

Please make only evolutionary changes - fix only what's broken, add new features without breaking existing ones, fix performance problems and not introduce new. This way everyone will be happy.

I like the Wikipedias version of what it is to be a "Wiki", and want to use it for many other projects, groups, and committees. For that purpose, I am doing the redesign, so I and others can easily deploy the Wikipedia software in a wide variety of contexts.

So, whether Wikipedia ever uses it or not, I am doing this. But I want to ensure that we COULD host the Wikipedia on it. And because it is written in C as a custom Apache module, with Postgres backend, I will be able to make it very fast and efficient.

If we switch to Postgres, to get the benefits of Postgres we must do a redesign of the table schema, at least. Did anyone notice that Postgres is now fully SQL92 compliant? And a redesign of table schema means we can design it so that generating a regular wiki page only involves ONE query, instead of the current TWELVE. A long term redesign of the software could definately be beneficial.

The better we design the future Wikipedia software, the easier it will be to write, maintain, and extend.

In the process of the redesign, we hope to make it so that current issues with the Wiki software are natural and easy to solve. Join in if you like.

http://www.wikipedia.org/wiki/User:Clutch/mod_wiki

Jonathan

Tomasz Wegrzanowski

3:22 a.m.

New subject: Languages & namespaces

On Mon, Dec 09, 2002 at 10:48:24AM -0800, Jonathan Walther wrote:

...

So, whether Wikipedia ever uses it or not, I am doing this. But I want to ensure that we COULD host the Wikipedia on it. And because it is written in C as a custom Apache module, with Postgres backend, I will be able to make it very fast and efficient.

C won't help any, database is the bottleneck, not PHP script.

...

If we switch to Postgres, to get the benefits of Postgres we must do a redesign of the table schema, at least. Did anyone notice that Postgres is now fully SQL92 compliant? And a redesign of table schema means we can design it so that generating a regular wiki page only involves ONE query, instead of the current TWELVE. A long term redesign of the software could definately be beneficial.

We have done it twice, without much benefit. I'm all for evolutionary way.

...

The better we design the future Wikipedia software, the easier it will be to write, maintain, and extend.

In the process of the redesign, we hope to make it so that current issues with the Wiki software are natural and easy to solve. Join in if you like.

Most important current issues are: * poor markup * no media independence * 15-or-so-pass parser instead of good hierarchical LALR parser * slow database * poor mirroring ability * no dedicated offline client

I'm currently working on first two (well, three).

Rewriting it in C won't help with any of these issues.

Moving current script to Postgres is easy modulo FULLTEXT index. Better think what to do with that one.

Oh, and think how to you implement database mirroring, as this is what lost most after move from filesystem database (rsync) to MySQL.

Jonathan Walther

3:46 a.m.

New subject: Languages & namespaces

On Mon, Dec 09, 2002 at 08:22:42PM +0100, Tomasz Wegrzanowski wrote:

...

...
If we switch to Postgres, to get the benefits of Postgres we must do a redesign of the table schema, at least. Did anyone notice that Postgres is now fully SQL92 compliant? And a redesign of table schema means we can design it so that generating a regular wiki page only involves ONE query, instead of the current TWELVE. A long term redesign of the software could definately be beneficial.

We have done it twice, without much benefit. I'm all for evolutionary way.

Yes, but we were limited by MySQL each time. Have you used Postgres or Oracle or Sybase before? I have a hard time expressing briefly how great SQL92 compliance is compared to the subset that MySQL supports. Using MySQL for this is walking the long way around the bay when you could paddle the Postgres canoe directly across it in a fraction of the time.

...

Most important current issues are:

poor markup

no media independence

15-or-so-pass parser instead of good hierarchical LALR parser

slow database

poor mirroring ability

no dedicated offline client

I'm currently working on first two (well, three).

I agree and I would love to see your thoughts on all of these issues, and how best to solve them. The last two items should be fairly simple, if we require users to install Apache on their local boxes.

How can we improve markup? I too feel it might be lacking somehow.

What do you mean by media independance?

...

Rewriting it in C won't help with any of these issues.

By using C, we can use lex and yacc to make an excellent LALR parser

Postgres will solve the slow database issue.

...

Moving current script to Postgres is easy modulo FULLTEXT index. Better think what to do with that one.

No, Postgres really does require a full redesign to reap the benefits.

...

Oh, and think how to you implement database mirroring, as this is what lost most after move from filesystem database (rsync) to MySQL.

Replication (database mirroring) comes with Postgres; we have a choice of the dbmirror, and the dbbalancer contributed modules. We can choose synchronous or asynchronous; I recommend asynchronous for speed.

Jonathan

Tomasz Wegrzanowski

4:18 a.m.

New subject: Languages & namespaces

On Mon, Dec 09, 2002 at 11:46:30AM -0800, Jonathan Walther wrote:

...

On Mon, Dec 09, 2002 at 08:22:42PM +0100, Tomasz Wegrzanowski wrote:

...
We have done it twice, without much benefit. I'm all for evolutionary way.

Yes, but we were limited by MySQL each time. Have you used Postgres or Oracle or Sybase before?

Yes.

...

I have a hard time expressing briefly how great SQL92 compliance is compared to the subset that MySQL supports. Using MySQL for this is walking the long way around the bay when you could paddle the Postgres canoe directly across it in a fraction of the time.

I'm more concerned about row-level locking really. Could you try to make minimal number of changes necessary to make Wikipedia run on Postgres so we can see how much would better locking help ? Then we could make it even faster.

...

...
Most important current issues are:

poor markup

no media independence

15-or-so-pass parser instead of good hierarchical LALR parser

slow database

poor mirroring ability

no dedicated offline client

I'm currently working on first two (well, three).

I agree and I would love to see your thoughts on all of these issues, and how best to solve them. The last two items should be fairly simple, if we require users to install Apache on their local boxes.

Not really. SQL databases are hard to mirror - nothing close to rsync comes for free with them.

...

How can we improve markup? I too feel it might be lacking somehow.

Now I'm implementing <math> tags, see all these "TeX, version X" threads on wikitech-l for details.

...

What do you mean by media independance?

Mainly being able to export pure-HTML versions and plaintext dict versions.

...

...
Rewriting it in C won't help with any of these issues.

By using C, we can use lex and yacc to make an excellent LALR parser

You would have hard time, as C doesn't implement good text processing or symbolic trees processing.

texvc is in ocaml now.

...

Postgres will solve the slow database issue.

...
Moving current script to Postgres is easy modulo FULLTEXT index. Better think what to do with that one.

No, Postgres really does require a full redesign to reap the benefits.

80% of benefit comes with 20% of effort usually, of course if you start from right 20%.

But how do you plan to implement searching in Postgres ?

...

...
Oh, and think how to you implement database mirroring, as this is what lost most after move from filesystem database (rsync) to MySQL.

Replication (database mirroring) comes with Postgres; we have a choice of the dbmirror, and the dbbalancer contributed modules. We can choose synchronous or asynchronous; I recommend asynchronous for speed.

How do they compare to rsync ?

Magnus Manske

3:29 a.m.

New subject: Languages & namespaces

Jonathan Walther wrote:

...

I like the Wikipedias version of what it is to be a "Wiki", and want to use it for many other projects, groups, and committees. For that purpose, I am doing the redesign, so I and others can easily deploy the Wikipedia software in a wide variety of contexts.

There are other wiki software projects that aim at a broad range of uses. This one is for running wikipedia. If others can use it too, fine. Focus should remain on wikipedia.

...

So, whether Wikipedia ever uses it or not, I am doing this. But I want to ensure that we COULD host the Wikipedia on it. And because it is written in C as a custom Apache module, with Postgres backend, I will be able to make it very fast and efficient.

OK, I understand that, and though I don't have experience with Postgres, I agree about the C part. I wrote a C++ wiki parser as proof-of-principle once, and it was fast as hell on my local machine, compared to the PHP software (which was using cached pages!). It got lost in some systems crash or Linux update, though ;-)

...

If we switch to Postgres, to get the benefits of Postgres we must do a redesign of the table schema, at least. Did anyone notice that Postgres is now fully SQL92 compliant? And a redesign of table schema means we can design it so that generating a regular wiki page only involves ONE query, instead of the current TWELVE. A long term redesign of the software could definately be beneficial.

Now *that* would really help!

Magnus

Brion Vibber

5:53 a.m.

New subject: Languages & namespaces

On Mon, 2002-12-09 at 10:48, Jonathan Walther wrote:

...

And a redesign of table schema means we can design it so that generating a regular wiki page only involves ONE query, instead of the current TWELVE. A long term redesign of the software could definately be beneficial.

Surely it would take two: *fetch current page's contents, view count, restrictions, and last edited date *check existence, size/type of all linked pages

-- brion vibber (brion @ pobox.com)

Jonathan Walther

9:43 p.m.

New subject: Languages & namespaces

On Mon, Dec 09, 2002 at 01:53:19PM -0800, Brion Vibber wrote:

...

Surely it would take two: *fetch current page's contents, view count, restrictions, and last edited date *check existence, size/type of all linked pages

Add a third one: we need to know if the user has read his Talk page since it was last modified.

For the second one, don't we just need to check the existance and type of the linked pages? Why bother about the size?

We do also need to get the size of the primary page.

What do the "restrictions" on a page consist of? Whether it is readable or editable?

Still, that is a 2/3 savings.

Jonathan

Brion Vibber

11 Dec 11 Dec

6:54 a.m.

New subject: Languages & namespaces

On Tue, 2002-12-10 at 05:43, Jonathan Walther wrote:

...

For the second one, don't we just need to check the existance and type of the linked pages? Why bother about the size?

For the stub detector. (Optional.)

...

We do also need to get the size of the primary page.

What for? We parse it until we hit the null byte at the end.

...

What do the "restrictions" on a page consist of? Whether it is readable or editable?

Currently, a comma-separated list of string keys which a user's user_rights field must contain to be allowed to edit the page.

-- brion vibber (brion @ pobox.com)

Jonathan Walther

6:51 a.m.

New subject: Languages & namespaces

On Tue, Dec 10, 2002 at 02:54:49PM -0800, Brion Vibber wrote:

...

On Tue, 2002-12-10 at 05:43, Jonathan Walther wrote:

...
For the second one, don't we just need to check the existance and type of the linked pages? Why bother about the size?

For the stub detector. (Optional.)

Ah. I was assuming that empty articles wouldn't exist unless they had been purposely cleared in an edit war. *cough*RK*cough*

...

...
We do also need to get the size of the primary page.

What for? We parse it until we hit the null byte at the end.

Even though we are storing the text in UTF-8? Is that safe?

...

...
What do the "restrictions" on a page consist of? Whether it is readable or editable?

Currently, a comma-separated list of string keys which a user's user_rights field must contain to be allowed to edit the page.

Ok, fair enough. It is the permissions scheme that I was hoping could be redesigned, but I agree some sort of permissions scheme is needed.

A note about searching: with Postgres we can just set up the UTF-8 collation on the article text fields, and then we can use LIKE clauses to search for text in the articles and titles, and the characters will be collated correctly, without any extra coding on our part. Another reason I am eager to use Postgres. The search engine will be trivial.

Jonathan

Brion Vibber

7:16 a.m.

New subject: Languages & namespaces

On Tue, 2002-12-10 at 14:51, Jonathan Walther wrote:

...

On Tue, Dec 10, 2002 at 02:54:49PM -0800, Brion Vibber wrote:

...
On Tue, 2002-12-10 at 05:43, Jonathan Walther wrote:

...
For the second one, don't we just need to check the existance and type of the linked pages? Why bother about the size?

For the stub detector. (Optional.)

Ah. I was assuming that empty articles wouldn't exist unless they had been purposely cleared in an edit war. *cough*RK*cough*

The stub detector marks links to articles below an (arbitrary, user-selectable) size as with a distinct link class.

...

...
...
We do also need to get the size of the primary page.

What for? We parse it until we hit the null byte at the end.

Even though we are storing the text in UTF-8? Is that safe?

The only time a null byte ever appears in UTF-8 is to represent the null character. Since the null is generally considered déclassé in human-readable text, I think we're safe. :)

...

A note about searching: with Postgres we can just set up the UTF-8 collation on the article text fields, and then we can use LIKE clauses to search for text in the articles and titles, and the characters will be collated correctly, without any extra coding on our part. Another reason I am eager to use Postgres. The search engine will be trivial.

Will that be reasonably fast? How would we rank pages in the search?

-- brion vibber (brion @ pobox.com)

Jonathan Walther

7:15 a.m.

New subject: Languages & namespaces

On Tue, Dec 10, 2002 at 03:16:51PM -0800, Brion Vibber wrote:

...

...
Ah. I was assuming that empty articles wouldn't exist unless they had been purposely cleared in an edit war. *cough*RK*cough*

The stub detector marks links to articles below an (arbitrary, user-selectable) size as with a distinct link class.

Well, IF we are not going to store article sizes separately, then we need to retrieve the text of all articles linked to when we render an article page. And THAT means I can do this with a single query after all, instead of three.

...

...
A note about searching: with Postgres we can just set up the UTF-8 collation on the article text fields, and then we can use LIKE clauses to search for text in the articles and titles, and the characters will be collated correctly, without any extra coding on our part. Another reason I am eager to use Postgres. The search engine will be trivial.

Will that be reasonably fast? How would we rank pages in the search?

Yes, quite fast. We can rank pages however we want; we could use an ORDER BY clause to sort results by article title, or by timestamp, or anything.

How is searching currently done?

Jonathan

Brion Vibber

8:07 a.m.

New subject: Languages & namespaces

On Tue, 2002-12-10 at 15:15, Jonathan Walther wrote: (stub detector needs size of linked articles)

...

Well, IF we are not going to store article sizes separately, then we need to retrieve the text of all articles linked to when we render an article page. And THAT means I can do this with a single query after all, instead of three.

Is retrieving, transferring, and separately counting the length of the text of potentially hundreds or thousands of linked articles *really* more efficient than a second query for LENGTH(text), a value which the database should already know and thus does not have to spend time zipping through strings looking for null bytes?

(using LIKE for searches)

...

...
Will that be reasonably fast? How would we rank pages in the search?

Yes, quite fast. We can rank pages however we want; we could use an ORDER BY clause to sort results by article title, or by timestamp, or anything.

How is searching currently done?

http://www.mysql.com/doc/en/Fulltext_Search.html

For multi-word searches we AND and OR multiple MATCH/AGAINSTs together in one query, which may not the best way to do it. MySQL 4.0 has boolean features built right in, but we're using the more stable 3.x.

-- brion vibber (brion @ pobox.com)

Jonathan Walther

7:04 p.m.

New subject: Languages & namespaces

On Tue, Dec 10, 2002 at 04:07:04PM -0800, Brion Vibber wrote:

...

Is retrieving, transferring, and separately counting the length of the text of potentially hundreds or thousands of linked articles *really* more efficient than a second query for LENGTH(text), a value which the database should already know and thus does not have to spend time zipping through strings looking for null bytes?

In fact, I was assuming the opposite; that doing a second query to find the size and existance of all the links would be much more efficient.

I also think we see significant savings (if not now, in the future when there are more pageviews than edits) of CPU by storing the article size, instead of calculating it every time we access it.

...

...
How is searching currently done?

http://www.mysql.com/doc/en/Fulltext_Search.html

For multi-word searches we AND and OR multiple MATCH/AGAINSTs together in one query, which may not the best way to do it. MySQL 4.0 has boolean features built right in, but we're using the more stable 3.x.

Thank you for pointing that out, Brion. I did some more research; PostgreSQL has a contributed fulltext searching module which we can enable to get the same functionality. It will definately be preferable to using a LIKE clause.

Jonathan

Toby Bartels

10 Dec 10 Dec

5:53 a.m.

New subject: Languages & namespaces

Clutch wrote:

...

Toby Bartels wrote:

...

...
I would argue that a really logical structure would have *4* levels: language, namespace, talk (a boolean), and title.

...

Well, since we are doing a redesign, why not have 4? If that floats your boat. Just figure out what changes would need to be made to the schema on http://www.wikipedia.org/wiki/User:Clutch/mod_wiki

Well, *you* are doing a redesign, not *we* ^_^. Only [[User:Clutch]] has modified [[en:User:Clutch/mod_wiki]] -- although Eloquence (Erik Moeller) and Brion Vibber have commented. I'd rather work on PediaWiki itself (which I don't have time to do yet). So you can implement my idea on a really logical structure if you like, but I won't try to figure out how to do it myself -- at least not now.

-- Toby

8040

Age (days ago)

8045

Last active (days ago)

wikitech-l@lists.wikimedia.org

30 comments

6 participants

tags (0)

participants (6)

Brion Vibber
Jens Frank
Jonathan Walther
Magnus Manske
Toby Bartels
Tomasz Wegrzanowski