Now that my own proposal to re-write the entire code in Perl again was met with great resistance, I propose to at least create an entirely new database structure, and then adapt the current code to it. I have studied the current database structure and see the following rather severe problems with it:
* BLOBs that store article text are combined in the same table as meta-data (e.g. date, username of a change, change summary, minor flag, etc.). This is bad because variable-length fields like BLOBs negatively affect the performance of reading the table. Pages like the watchlist should not have to bother with variable-length data such as article text and would run a lot faster if they could get their data entirely from fixed-length rows.
* Currently, all user properties and preferences, as well as article properties, are columns in a table. Although this is not a problem in terms of DB-reading performance, introducing a new user preference or other enhancements involve adding a new column, and adding a new column becomes extremely database-intensive as the database grows. LiveJournal uses a very clever system that will easily remedy this. One table, 'userproplist', stores the possible user properties (userprops), and another table, 'userprop', stores what user has what userprop with what value. This way all that is needed for adding a new userprop is adding a single row to the 'userproplist' table. The same would analogously apply to articleprops. Once we have that, the user table will hopefully remain very small (in terms of number of columns), so looking up a username (to name just an example) would be ridiculously efficient.
* BLOBs that fulfill the same function (article text) are scattered across two tables (cur and old). This is bad because it means variable-length text has to be moved across tables on every edit. Very slow. Better to give every version/revision of an article (i.e. each item of article text) a permanent ID and use those IDs in the 'cur' and 'old' tables instead. Then have one large table, 'articletext' perhaps, mapping the IDs to their actual BLOBs. This eliminates the need to ever delete a BLOB (except perhaps when actually deleting an article with all its history, which is rare enough). Additionally, there isn't really a need for separate 'cur' and 'old' tables, especially when MemCacheD can take care of most recent versions.
* You are using a 'recentchanges' table which, I presume, gets also updated with every edit. This, I assume is the idea behind it, allows the 'Recent Changes' page to quickly grab the most recent changes without having to find them elsewhere in the DB. Contrary to intuition, this is a bad idea. It is always a better idea to optimise for less DB writes even if it means a few more DB reads, because writes are so much slower. (I am so sure of this because LiveJournal has made these experiences with their "Friends Page": grabbing entries from all the friends' journals all over the place in the DB is faster than updating a "hint table" with every newly created entry.)
In addition to these existing problems, of course there are things the database cannot currently handle, but were planned. While we're changing the DB, we could also add the following functionality to it:
* Store translated website text, so translators don't have to dig through PHP code and submit a file to the mailing list.
* A global table for bidirectional inter-wiki links. People should not have to add the same link to so many articles. In fact, taking this a step further, people should not even have to enter text like '[[fr:démocratie]]' into the article text when it's not part of the article text. There should be drop-downs underneath the big textbox listing languages, and little text boxes next to them for the target article name.
Are you all still convinced that adapting the current code to all these radical changes is easier than rewriting it all from scratch? :-)
Anyway. I'm tired.
I'm going to bed.
Good night.
Timwi
On Sat, 12 Jul 2003, Timwi wrote:
- BLOBs that store article text are combined in the same table as
meta-data (e.g. date, username of a change, change summary, minor flag, etc.). This is bad because variable-length fields like BLOBs negatively affect the performance of reading the table.
How much of a difference does this make when we're usually taking single rows found via an index?
One table, 'userproplist', stores the possible user properties (userprops), and another table, 'userprop', stores what user has what userprop with what value. This way all that is needed for adding a new userprop is adding a single row to the 'userproplist' table.
Potentially interesting...
- BLOBs that fulfill the same function (article text) are scattered
across two tables (cur and old).
Yeah, this is bad mojo. They should be combined in one 'revisions' table.
- You are using a 'recentchanges' table which, I presume, gets also
updated with every edit. This, I assume is the idea behind it, allows the 'Recent Changes' page to quickly grab the most recent changes without having to find them elsewhere in the DB.
There were two reasons for this: first, cur and old being separate and no good way in MySQL to read them together. Second, old mysql couldn't do an inverse sort on an index, so sorting by reverse timestamp was, well, suckage. :) This is remedied already using a hackish 'inverse_timestamp' column, which itself is no longer really needed on mysql 4 which can sort descending on indexes.
Recentchanges would be irrelevent with cur and old combined as they ought to be, plus a key in the revisions table for "first edit" so we know which to mark as "new".
- Store translated website text, so translators don't have to dig
through PHP code and submit a file to the mailing list.
We certainly could do this, though there are performance concerns whenever the idea is brought up. Caching the strings in shared memory may alleviate this.
There's been some talk of adapting the translation system we use at some of Esperanto cxe Interreto's sites, such as http://lernu.net/, so the interface and source-file-scanner doesn't have to be written from scratch. I haven't looked at the table structure used, but I imagine it's a fairly straightforward language-key-string triplet set.
- A global table for bidirectional inter-wiki links. People should not
have to add the same link to so many articles.
There's an experimental table for interwiki links, but it's not entirely the best setup. It's questionable whether bidirectional is really right, though, as there's not always a 1:1 matchup between articles.
Are you all still convinced that adapting the current code to all these radical changes is easier than rewriting it all from scratch? :-)
Yes, certainly.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
On Sat, 12 Jul 2003, Timwi wrote:
- BLOBs that store article text are combined in the same table as
meta-data (e.g. date, username of a change, change summary, minor flag, etc.). This is bad because variable-length fields like BLOBs negatively affect the performance of reading the table.
How much of a difference does this make when we're usually taking single rows found via an index?
Hm. Good point. I haven't thought about it in that much detail -- I've only been "taught" this from experience on LJ. I think it has something to do with hard disk seeks and stuff -- very technical. Regardless, though, it should be clear that at least *updating* a table with variable-length fields is quite a lot more complex than updating one without.
- Store translated website text, so translators don't have to dig
through PHP code and submit a file to the mailing list.
We certainly could do this, though there are performance concerns whenever the idea is brought up. Caching the strings in shared memory may alleviate this.
It's worked perfectly on LJ ever since it was introduced. Additionally, I'm keeping MemCacheD in mind while thinking this all through. It would probably keep all the text in memory all the time.
There's been some talk of adapting the translation system we use at some of Esperanto cxe Interreto's sites, such as http://lernu.net/, so the interface and source-file-scanner doesn't have to be written from scratch.
Well, one of the problems I find with that system is that it is too easily vandalisable. If you vandalise a single Wiki page, that doesn't matter too much, it can be reverted within a few minutes, but if you vandalise, say, the wording of the "Edit this page" link, it would affect everybody who would visit Wikipedia within the minutes it takes someone to revert it.
I can see several ways of doing this:
* Restrict access to assorted people. Of course, this is un-wiki-like and not a real lot better than modifying LanguageXY.php.
* Make changes in the translatable strings not take effect until they have been kept unchanged for 24 hours. Some trusted few could be given the privilege to be able to change the strings directly (in case, for example, a vandalism goes unnoticed for 24 hours).
However, for this to work, the changes in the translation system should appear on the Recent Changes pages with the Wiki pages. Perhaps they *should* be Wiki pages with their own namespace (String:XYZ?). Which in turn would deviate from the concept that lernu.net uses.
I haven't looked at the table structure used, but I imagine it's a fairly straightforward language-key-string triplet set.
I don't know about lernu.net either, but as for LJ, it's a little bit more complex than that. If you're interested, LJ's database is here: http://www.livejournal.com/doc/server/ljp.dbschema.ref.html The tables beginning with "ml_" are the ones pertaining to the translation system.
- A global table for bidirectional inter-wiki links. People should not
have to add the same link to so many articles.
There's an experimental table for interwiki links, but it's not entirely the best setup. It's questionable whether bidirectional is really right, though, as there's not always a 1:1 matchup between articles.
Colour me ignorant, but why shouldn't there always be a 1:1 matchup? Maybe there isn't now, but articles certainly can (and perhaps they should) be changed to comply. Or do you know of a particularly striking example where it should not?
Oh, by the way. How would you prefer to do the conversion from the old database to the new? Myself, I thought perhaps we could have the software do this whenever an article is edited by a user. This way, we don't have to take Wikipedia down for the time it takes to convert the entire database.
Are you all still convinced that adapting the current code to all these radical changes is easier than rewriting it all from scratch? :-)
Yes, certainly.
Okay then. I'll take your word for it and learn some PHP. I'll create a preliminary SQL table-creation script for all this tomorrow. Which is really today, but I should really go to bed first...
Good night, Timwi
Timwi wrote:
...
- A global table for bidirectional inter-wiki links. People should not
have to add the same link to so many articles. In fact, taking this a step further, people should not even have to enter text like '[[fr:démocratie]]' into the article text when it's not part of the article text. There should be drop-downs underneath the big textbox listing languages, and little text boxes next to them for the target article name.
Are you all still convinced that adapting the current code to all these radical changes is easier than rewriting it all from scratch? :-)
Anyway. I'm tired.
I'm going to bed.
Good night.
Timwi
While it seems to be a very good thing, in fact it can't really work so because the relation between articles between the wikis is not a one-to-one relation For example on en:, the roman god and the greek ones share the same article but on fr: this is not the case, they have they own article. How a global table can resolve this sort of problems?
---- Luc Van Oostenryck aka [[Looxix]]
Luc Van Oostenryck wrote:
Timwi wrote:
- A global table for bidirectional inter-wiki links. People should not
have to add the same link to so many articles. In fact, taking this a step further, people should not even have to enter text like '[[fr:démocratie]]' into the article text when it's not part of the article text. There should be drop-downs underneath the big textbox listing languages, and little text boxes next to them for the target article name.
While it seems to be a very good thing, in fact it can't really work so because the relation between articles between the wikis is not a one-to-one relation For example on en:, the roman god and the greek ones share the same article but on fr: this is not the case, they have they own article. How a global table can resolve this sort of problems?
If the French Wikipedia has separate articles for them, it probably means several articles are warranted, in which case the English ones should be split to match. If not, the French ones should be combined. Or is there a good reason why French *needs* them separate when English doesn't?
If you have an answer to that, then use redirection pages. As I see it, French has two articles [[fr:Zeus]] and [[fr:Jupiter (mythologie)]], while English has [[en:Zeus]] and lets [[en:Jupiter (god)]] redirect to that. What's wrong with linking [[fr:Zeus]] <=> [[en:Zeus]] and [[fr:Jupiter (mythologie)]] <=> [[en:Jupiter (god)]]?
Timwi
On Sun, Jul 13, 2003 at 01:58:51AM +0200, Timwi wrote:
Luc Van Oostenryck wrote:
Timwi wrote:
- A global table for bidirectional inter-wiki links. People should not
have to add the same link to so many articles. In fact, taking this a step further, people should not even have to enter text like '[[fr:démocratie]]' into the article text when it's not part of the article text. There should be drop-downs underneath the big textbox listing languages, and little text boxes next to them for the target article name.
While it seems to be a very good thing, in fact it can't really work so because the relation between articles between the wikis is not a one-to-one relation For example on en:, the roman god and the greek ones share the same article but on fr: this is not the case, they have they own article. How a global table can resolve this sort of problems?
If the French Wikipedia has separate articles for them, it probably means several articles are warranted, in which case the English ones should be split to match. If not, the French ones should be combined. Or is there a good reason why French *needs* them separate when English doesn't?
If you have an answer to that, then use redirection pages. As I see it, French has two articles [[fr:Zeus]] and [[fr:Jupiter (mythologie)]], while English has [[en:Zeus]] and lets [[en:Jupiter (god)]] redirect to that. What's wrong with linking [[fr:Zeus]] <=> [[en:Zeus]] and [[fr:Jupiter (mythologie)]] <=> [[en:Jupiter (god)]]?
There's no reason why one language's Wikipedia should do something one way just because some other language's does it so. Forcing 1:1 correspondence because it's easier to code with it, and generally dictating policy based on purely technical reasons, is very rarely a good idea.
I think we should allow named multiple links and just do it automatically.
Markup: fr "Zeus" says [[en:Zeus]] fr "Jupiter (mythologie)" says [[en:Zeus]] pl "Zeus" [[en:Zeus]]
Links on en (if nothing is specified for French and Polish): [[fr:Zeus|French (Zeus)]], [[fr:Jupiter (mythologie)|French (Jupiter (mythologie))]], [[pl:Zeus|Polish]]
If we have relation more complex than 1:N we will link too much. For imaginary example: Polish: [[Historia - lata 1918-1939]], [[Historia - lata 1939-1945]]
English: [[History - years 1918-1941]], [[History - years 1941-1945]]
[[Historia - lata 1918-1939]] links to [[History - years 1918-1941]] [[Historia - lata 1939-1945]] links to both English articles [[History - years 1918-1941]] links to both Polish articles [[History - years 1941-1945]] links to [[Historia - lata 1939-1945]]
Tomasz Wegrzanowski wrote:
On Sun, Jul 13, 2003 at 01:58:51AM +0200, Timwi wrote:
If the French Wikipedia has separate articles for them, it probably means several articles are warranted, in which case the English ones should be split to match. If not, the French ones should be combined. Or is there a good reason why French *needs* them separate when English doesn't?
If you have an answer to that, then use redirection pages. As I see it, French has two articles [[fr:Zeus]] and [[fr:Jupiter (mythologie)]], while English has [[en:Zeus]] and lets [[en:Jupiter (god)]] redirect to that. What's wrong with linking [[fr:Zeus]] <=> [[en:Zeus]] and [[fr:Jupiter (mythologie)]] <=> [[en:Jupiter (god)]]?
There's no reason why one language's Wikipedia should do something one way just because some other language's does it so.
Why? The only reason Wikipedia exists in several languages is to make it accessible to speakers of other languages, not to create different content based on a certain audience. Ideally, the actual contents of the Wikipedias should be the same, only in different languages. In particular, the set of covered topics (read: article titles) should be the same, because the set of expressible topics as well as the importance or relevance of each topic do *not* depend on language.
Forcing 1:1 correspondence because it's easier to code with it, and generally dictating policy based on purely technical reasons, is very rarely a good idea.
As outlined in the above paragraph, my reasoning is not solely based on coding and technical reasons. My greatest concern is consistency, conformance to user expectation, and minimising user confusion.
Let's take a real-life example. I'm German, now suppose my English were rather poor. I read some website which has a link to an English Wikipedia article. I see the "Other languages:" list of links at the top; I want it to go straight to the German version of what I'm seeing, or what the original website intended to link to. I should not have to make an extra decision (which of the several German links to follow), nor should I have to worry that an article I go to will contain stuff I'm not expecting to see, i.e. stuff the original website didn't intend to link to.
Another example. Suppose I was bilingual and I liked to compare German and English versions of articles for fun (or for some project or whatever). So I read through a German article first. Then I click on the link to the English article. I would expect there to be a link straight back to the German article I had just read; if there were two links to two German articles, I'd be confused, and I'd be unable to compare any two articles not only because they are contentually not the same, but because none of the three are even about the exact same topic.
Links on en (if nothing is specified for French and Polish): [[fr:Zeus|French (Zeus)]], [[fr:Jupiter (mythologie)|French (Jupiter (mythologie))]], [[pl:Zeus|Polish]]
As I said above, I really don't think it's a good idea to load an unnecessary decision upon the user (which article to go to), especially when the links are somewhere in the middle of a long list of links to all sorts of other languages.
If we have relation more complex than 1:N we will link too much. For imaginary example: Polish: [[Historia - lata 1918-1939]], [[Historia - lata 1939-1945]] English: [[History - years 1918-1941]], [[History - years 1941-1945]]
[[Historia - lata 1918-1939]] links to [[History - years 1918-1941]] [[Historia - lata 1939-1945]] links to both English articles [[History - years 1918-1941]] links to both Polish articles [[History - years 1941-1945]] links to [[Historia - lata 1939-1945]]
As you said yourself, that's linking too much. It would confuse users a lot. I think in this case [[en:1918-1941]] should clearly link to [[pl:1918-1939]] because that covers most of it. Then [[pl:1918-1939]] in turn should have navigational links to what comes before 1918 and what comes after 1939.
However, as you also said yourself, this was a fictional example. From the impression I got, we're supposed to have one page for each single year, one for each decade, one for each century, etc. I can see where using arbitrary periods of time could lead to a heated POV debate because people will think the boundaries are chosen with reference to events that are the most important only to a subset of the audience.
Again, the contents of the Wikipedias should be the same. They should only be in different languages.
Timwi
Timwi wrote in part:
Tomasz Wegrzanowski wrote:
There's no reason why one language's Wikipedia should do something one way just because some other language's does it so.
Why? The only reason Wikipedia exists in several languages is to make it accessible to speakers of other languages, not to create different content based on a certain audience. Ideally, the actual contents of the Wikipedias should be the same, only in different languages.
Ideally, yes. Ideally, Wikipedia should be available in all languages, with complete versions on all topics of human knowledge. None of this is going to happen in /my/ lifetime. One language's version of Wikipedia, while it develops, shouldn't force the development of another language's version, even though both are approaching the goal of complete coverage. At least not until there's sound auto-translation software, because humans aren't going to be willing to do the work.
In particular, the set of covered topics (read: article titles) should be the same, because the set of expressible topics as well as the importance or relevance of each topic do *not* depend on language.
And if it isn't -- what then? (Meaning forever, what then.) Tomasz's plan for automatic interlanguage links, whatever its flaws, at least addresses this possibility. Your plan, however, won't allow certain links to be written at all until the discrepancy between the wikis' organisation is sorted out.
Interlanguage links are the /first/ step on the road to helping each language's version contain all the info that each other language's version has -- it can't be held hostage to article organisation, which on [[en:]] at least is rearranged often. (Not to mention that it depends on [[en:]]'s naming conventions, which themselves rely heavily on the /English/ language.)
-- Toby
The mailing lists admin interface and the french wiki have been stuck for about an hour.
Could someone do something about that please ?
__________________________________ Do you Yahoo!? SBC Yahoo! DSL - Now only $29.95 per month! http://sbc.yahoo.com
--- Tomasz Wegrzanowski taw@users.sourceforge.net wrote:
On Sun, Jul 13, 2003 at 03:24:42AM +0200, Timwi wrote:
Again, the contents of the Wikipedias should be
the same. They should
only be in different languages.
Let's just say that I completely disagree with that statement.
Seconded. Neither the content, nor the titles.
__________________________________ Do you Yahoo!? SBC Yahoo! DSL - Now only $29.95 per month! http://sbc.yahoo.com
Anthere wrote:
--- Tomasz Wegrzanowski taw@users.sourceforge.net wrote:
On Sun, Jul 13, 2003 at 03:24:42AM +0200, Timwi wrote:
Again, the contents of the Wikipedias should be
the same. They should
only be in different languages.
Let's just say that I completely disagree with that statement.
Seconded. Neither the content, nor the titles.
Thirded, of course.
Luc Van Oostenryck, aka Looxix on fr: and en:
On Sun, Jul 13, 2003 at 03:24:42AM +0200, Timwi wrote:
Again, the contents of the Wikipedias should be
the same. They should
only be in different languages.
Well, this is perhaps a simplistic way of saying "there should be more cross-article pollenation in paralell among different languages."
To say that there ought be some kind of rule regarding what content actually exists in all of those articles -- in terms that they be unified -- would present serious problems.
Its bad enough people complain now about "bad English" on the en.wiki -- (SIMPLE.WIKIPEIDA people..) It would be equally bad for a non-speaker to get into tiffs with people about the content of their pedias.
Certainly for each language there needs to be a loyalty to GNUFDL -- period. As far as other things, like saying "George Bush's Mother is a Goat" in Arabic... well,. that will have to be sorted out by the academic standards that they can reasonably set there.
Should there be an Arabic article on Salman Rushdie? Certainly. But well not concern ourselves whether that article takes the same slant as it does on the en.wiki... I would strongly disagree with translating all of the various cheezy media articles -- top 100 lists, anime summaries, etc.. to each language. Like anyone could do that anyway.. Were just concerned with the core stuff, and the rest is fluff.
-S-
__________________________________ Do you Yahoo!? SBC Yahoo! DSL - Now only $29.95 per month! http://sbc.yahoo.com
Just wondering - how much would it cost to hire translators that would translate all contents of English Wikipedia to some 20 or so languages ? Common sense suggests that translating is much easier than developing anew, yet we keep doing it the hard way.
I'm not saying that translated English Wikipedia should replace other 'pedias, but these translations could be used as sources for them.
Any estimates in manhours or cash ?
Hello,
Well, I would like to be hired... since I jave not found any job yet.
Youssef
On Sat, Jul 12, 2003 at 07:54:31PM +0200, Oualmakran Youssef wrote:
Hello,
Well, I would like to be hired... since I jave not found any job yet.
Well, nobody's being hired yet, I was just curious ;) I suspect that translation fees vary a lot depending on source language (English is quite popular, so it isn't likely to cost extra) and country (we could get Wikipedia in 5 african languages in cost of 1 european).
But what order of magnitude of work and money is that ?
Well, nobody's being hired yet, I was just curious ;) I suspect that translation fees vary a lot depending on source language (English is quite popular, so it isn't likely to cost extra) and country (we could get Wikipedia in 5 african languages in cost of 1 european).
But what order of magnitude of work and money is that ?
I'm not saying that translated English Wikipedia should replace other 'pedias, but these translations could be used
as sources for them.
Paid translation is not an option -- in fact if it wasnt for readily and cheaply available machine tranlation, regardless of how fluent one is in a second language -- it would bee just too much work. So, generally were looking at a (Wiki-wide) project that is perhaps 92% machine translated - at least. After that, it can be left to the normal corrections of fluent Wikipedians...
As far as languages sucking off of the English content -- is this going to be one way or two -way? How can edit conflicts accross languages be resolved?
WP wasnt set up as a cross langa platform -- in fact quite the opposite. It was set up as a series of islands, and each island has a single rowboat. Some of us think it might be nice to have a real, coordinated and more efficient ferry system. There would have to be some centralization -- where the main articles are concerned and in this imagined island chain, the big island is not necessarily the one that should have all the centralized functions. Ie... the English wiki Should not automatically have the last word, -- by the defaults of a one-way process.
-S-
__________________________________ Do you Yahoo!? SBC Yahoo! DSL - Now only $29.95 per month! http://sbc.yahoo.com
In message 20030713191019.GA3465@tavaiah, Tomasz Wegrzanowski taw=Rn4VEauK+AKRv+LV9MX5uipxlwaOVQ5f@public.gmane.org writes
Just wondering - how much would it cost to hire translators that would translate all contents of English Wikipedia to some 20 or so languages ? Common sense suggests that translating is much easier than developing anew, yet we keep doing it the hard way.
I'm not saying that translated English Wikipedia should replace other 'pedias, but these translations could be used as sources for them.
Any estimates in manhours or cash ?
As an experiment I've just spent half the afternoon translating most of the English article on Argentina to cy.wikipedia.org/wiki/Yr_Ariannin -- I can't think of anything more soul-destroying if it was done on a large scale.
As an experiment I've just spent half the afternoon translating most of the English article on Argentina to cy.wikipedia.org/wiki/Yr_Ariannin
Shouldn't the text on the map (e.g. "South Atlantic Ocean") also be translated?
In message bet129$1uf$1@main.gmane.org, Timwi timwi=hi6Y0CQ0nG0@public.gmane.org writes
As an experiment I've just spent half the afternoon translating most of the English article on Argentina to cy.wikipedia.org/wiki/Yr_Ariannin
Shouldn't the text on the map (e.g. "South Atlantic Ocean") also be translated?
Ideally yes, but the same map is used in the English, German, Dutch, Polish and Swedish wikipedias, so I'm not going to lose too much sleep over it!
Arwel Parry wrote:
In message bet129$1uf$1@main.gmane.org, Timwi timwi=hi6Y0CQ0nG0@public.gmane.org writes
As an experiment I've just spent half the afternoon translating most of the English article on Argentina to cy.wikipedia.org/wiki/Yr_Ariannin
Shouldn't the text on the map (e.g. "South Atlantic Ocean") also be translated?
Ideally yes, but the same map is used in the English, German, Dutch, Polish and Swedish wikipedias, so I'm not going to lose too much sleep over it!
Hm. It wasn't me who said "Just because one language Wikipedia does it this way, doesn't mean all the others have to". ;-)
I'll translate the German one :-)
Timwi
On 13 Jul 2003 at 21:10, Tomasz Wegrzanowski wrote:
Just wondering - how much would it cost to hire translators that would translate all contents of English Wikipedia to some 20 or so languages ? Common sense suggests that translating is much easier than developing anew, yet we keep doing it the hard way.
I'm not saying that translated English Wikipedia should replace other 'pedias, but these translations could be used as sources for them.
Any estimates in manhours or cash ?
Algorithm: 1. count all chars in the en-wp 2. divide this number by 1500* 3. multiply by the price offered by the translator or translation service**
* 1500 chars per page is the volume of one standard page of a sworn translation
** The price for translation en->pl offered by one of Polish Internet translation services is 31 zlotys per page (= 7.9 USD). We can try to optimize this
The size of wp-en sql dump is 421407730 bytes (July 8, 2003) if the ratio of rough text to sql dump & wikicode of wikipage is (let's say) 0.90, then the price for the translation en->pl would be ~253000 zlotys (~= 64400 USD)
Regards Youandme
On 13 Jul 2003 at 22:28, Youandme wrote:
Algorithm:
- count all chars in the en-wp
- divide this number by 1500*
- multiply by the price offered by the translator or translation service**
- 1500 chars per page is the volume of one standard page of a sworn translation
Ooops... it should be of course: non-sworn translation!
** The price for translation en->pl offered by one of Polish Internet translation services is 31 zlotys per page (= 7.9 USD). We can try to optimize this
And this is the price for non-sworn translation
The size of wp-en sql dump is 421407730 bytes (July 8, 2003) if the ratio of rough text to sql dump & wikicode of wikipage is (let's say) 0.90, then the price for the translation en->pl would be ~253000 zlotys (~= 64400 USD)
So nothing changes here.
Sorry for the mistake Youandme
On Sun, Jul 13, 2003 at 10:28:04PM +0200, Youandme wrote:
On 13 Jul 2003 at 21:10, Tomasz Wegrzanowski wrote:
Just wondering - how much would it cost to hire translators that would translate all contents of English Wikipedia to some 20 or so languages ? Common sense suggests that translating is much easier than developing anew, yet we keep doing it the hard way.
I'm not saying that translated English Wikipedia should replace other 'pedias, but these translations could be used as sources for them.
Any estimates in manhours or cash ?
Algorithm:
- count all chars in the en-wp
- divide this number by 1500*
- multiply by the price offered by the translator or translation service**
- 1500 chars per page is the volume of one standard page of a sworn translation
** The price for translation en->pl offered by one of Polish Internet translation services is 31 zlotys per page (= 7.9 USD). We can try to optimize this
The size of wp-en sql dump is 421407730 bytes (July 8, 2003) if the ratio of rough text to sql dump & wikicode of wikipage is (let's say) 0.90, then the price for the translation en->pl would be ~253000 zlotys (~= 64400 USD)
I can't repeat these numbers, not any computation error ?
Some googling (tl/umaczenie angielski, first hit) showed 1800 chars (including punctuation) per page, 30 zloty per page,
select sum(length(cur_text)) from cur; 378 523 585
pages (1800 chars = 1 page) 210 290.88
cost (30 zloty / page) 6 308 726.4 zloty
cost (4 EUR / zloty, hehe) 1 577 181.6 EUR
Of course it relies on absurd assumption that 1 byte = 1 character, and it's really 210 thousand pages.
Any idea what's character:byte ratio like ? Rendering 1000 random articles into HTML, and counting chars would probably work.
On 13 Jul 2003 at 23:02, Tomasz Wegrzanowski wrote:
On Sun, Jul 13, 2003 at 10:28:04PM +0200, Youandme wrote:
The size of wp-en sql dump is 421407730 bytes (July 8, 2003) if the ratio of rough text to sql dump & wikicode of wikipage is (let's say) 0.90, then the price for the translation en->pl would be ~253000 zlotys (~= 64400 USD)
I can't repeat these numbers, not any computation error ?
Oops... another error. I've forgotten to mutiply by the cost of one page. So my estimation is 31 times higher: 7 843 000 zlotys (~= 1 997 000 USD)
Some googling (tl/umaczenie angielski, first hit) showed 1800 chars (including punctuation) per page, 30 zloty per page,
Standards are different but I agree that 1800 chars per page (instead of 1500 per page) is more common in Poland (my past experience says so).
select sum(length(cur_text)) from cur; 378 523 585
pages (1800 chars = 1 page) 210 290.88
cost (30 zloty / page) 6 308 726.4 zloty
cost (4 EUR / zloty, hehe) 1 577 181.6 EUR
So you see, we are talking about MILLIONS! :) Who will pay for that? :)
Youandme
On Sun, Jul 13, 2003 at 11:33:03PM +0200, Youandme wrote:
On 13 Jul 2003 at 23:02, Tomasz Wegrzanowski wrote:
On Sun, Jul 13, 2003 at 10:28:04PM +0200, Youandme wrote:
The size of wp-en sql dump is 421407730 bytes (July 8, 2003) if the ratio of rough text to sql dump & wikicode of wikipage is (let's say) 0.90, then the price for the translation en->pl would be ~253000 zlotys (~= 64400 USD)
I can't repeat these numbers, not any computation error ?
Oops... another error. I've forgotten to mutiply by the cost of one page. So my estimation is 31 times higher: 7 843 000 zlotys (~= 1 997 000 USD)
Some googling (tl/umaczenie angielski, first hit) showed 1800 chars (including punctuation) per page, 30 zloty per page,
Standards are different but I agree that 1800 chars per page (instead of 1500 per page) is more common in Poland (my past experience says so).
select sum(length(cur_text)) from cur; 378 523 585
pages (1800 chars = 1 page) 210 290.88
cost (30 zloty / page) 6 308 726.4 zloty
cost (4 EUR / zloty, hehe) 1 577 181.6 EUR
So you see, we are talking about MILLIONS! :) Who will pay for that? :)
That's before multiplying by char:byte ratio, which I guess is something like 1:5 (but we should measure, not guess).
Oh, and we can ignore RamBot stuff that makes some 20% of Wikipedia, and get some bulk discount ;)
Including that, and 20% discount, that's 807 516.98 zloty or 182 580.07 euro (4.42281 zloty / euro, according to http://oanda.com/)
Still a lot of money.
Tomasz Wegrzanowski wrote in part:
Just wondering - how much would it cost to hire translators that would translate all contents of English Wikipedia to some 20 or so languages ? Common sense suggests that translating is much easier than developing anew, yet we keep doing it the hard way.
Actually, I think that the hard way is /good/.
If you're writing an article in, say, Polish, and you read, say, English, then by all means, use an English article (if any exists) as a source. But otherwise, I think that your time is best spent (and this is true whether you pay for your time or not) writing a new article based on the available sources. This is because it gives fresh ideas and fresh input. Then when somebody comes along that /does/ know both languages, the different present takes on the subject can be moved in /both/ directions.
Note that the idea of time management is quite important to my idea. Rewriting an article from scratch when a good article is available is not a wise use of time; translate when you can do that. But when the translation can't be done now, for whatever reason, then we should welcome the new article from scratch.
(I'm not sure that anything that I've said actually disagrees with you.)
-- Toby
To this end, we might as well spawn a new project... A project that takes the en wikipedia data dump and passes it through Babelfish or the like.
Or, we could have an option on the edit page. "import from: ", where you could specify the wiki that you want to import from, the source language, the destination language (or no translation whatsoever). It could load the machine-translated text into the edit window.
Just a thought...
Jason
Tomasz Wegrzanowski wrote:
Just wondering - how much would it cost to hire translators that would translate all contents of English Wikipedia to some 20 or so languages ? Common sense suggests that translating is much easier than developing anew, yet we keep doing it the hard way.
I'm not saying that translated English Wikipedia should replace other 'pedias, but these translations could be used as sources for them.
Any estimates in manhours or cash ? _______________________________________________ Wikitech-l mailing list Wikitech-l@wikipedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Hello,
I think that we need more free translation tools : *[[Dictionary]] *[[Translation memory]] *[[Engine translation]]
I think that memory trnaslation and engine translation are hard to implement for texts with long sentences. But, I think it is very useful for some cases, and that our wiki-developpers can (easily) develop tools for at least pne case: dates.
Most data in years/months pages are very short sentences. For example, the deaths/births sections should fairly easily be translated by a program with human help. For examples: *president of the United States--> Président des États-Unis (French) *American actor--> acteur américain.
We should store that kind of data in a translation memory since the translation does not change with context.
Regards, Youssef aka youssefsan
Oualmakran Youssef wrote:
Most data in years/months pages are very short sentences. For example, the deaths/births sections should fairly easily be translated by a program with human help.
Machines still only give a very rough first draft, and very brain dead translation that can sometimes have us rolling in the aisles with laughter.
For examples: *president of the United States--> Président des États-Unis (French) *American actor--> acteur américain.
We should store that kind of data in a translation memory since the translation does not change with context.
And how do you account for the tendency for some to translate "actor" as "comédien" whether or not the acting is comedy as most of us would tend to interpret that term at first glance. Then there is the question of what to do with the feminine of actor and president where the political correctness can be a problem.
Translations are rarely so straightforward
Jason Richey wrote:
To this end, we might as well spawn a new project... A project that takes the en wikipedia data dump and passes it through Babelfish or the like.
Could we please return to the real world? Machine translations are sometimes vaguely useful for getting the gist of something. but by and large, they are pure crud. dealing with such a machine translation would be WORSE for the writer than starting with a blank page.
In message 3F1662E7.1020701@planetunreal.com, tarquin tarquin=ivGM6J5ThX418oCzpIvf9w@public.gmane.org writes
Jason Richey wrote:
To this end, we might as well spawn a new project... A project that takes the en wikipedia data dump and passes it through Babelfish or the like.
Could we please return to the real world? Machine translations are sometimes vaguely useful for getting the gist of something. but by and large, they are pure crud. dealing with such a machine translation would be WORSE for the writer than starting with a blank page.
I quite agree, apart from the fact that I can't envisage Babelfish "or the like" ever being set up to translate many of the lesser-used languages we have wikipedias set up for.
Im glad to see ther are two of you -- there is certainly an open debate about this. Your opinions are noted.
But the statement...
"dealing with such a machine translation would be WORSE for the writer than starting with a blank page."
...is a bit loosy-goosy.. It seems almost technophobic, and it misses some points, namely !. the general thrust of the wikipedia as collaborative, and interactive -- 2. that something on the page is most often better than *nothing on the page -- 3. that machine translation is improving at an extremely rapid pace. 4. this is the general consensus --that the WP take upon itself a more international scope .
I dont agree with "running everthing through Babelfish" -- this is not the best option, even if it were feasible -- (I think that was an oversimplification - not a well-crafted sentence.) The better options are a subject weve already mentioned -- namely a platform dedicated to cross-langa articles , and a means to make it more efficient to use our own personal translation ware... I suggested the use of the simple wiki for this -- there are some issues with that as well.
Ultimately, there is no generalized view of machine translation -- Systran (google) is rather good for one-way conversion between one European language and another. Even Jim Breen's kanji translator is good for at mass-converting Kanji to their meanings, in a list. That it would be too hard for a "writer" to connect the dots with some grammar, is also a rather fishy statement. Wikipedians are writers, anyway -- were *editors.
-S-
--- Arwel Parry arwel@cartref.demon.co.uk wrote:
In message 3F1662E7.1020701@planetunreal.com, tarquin tarquin=ivGM6J5ThX418oCzpIvf9w@public.gmane.org writes
Jason Richey wrote:
To this end, we might as well spawn a new
project... A project that
takes the en wikipedia data dump and passes it
through Babelfish or
the like.
Could we please return to the real world? Machine translations are sometimes vaguely useful
for getting the gist
of something. but by and large, they are pure crud. dealing with such a machine translation would be
WORSE for the writer
than starting with a blank page.
I quite agree, apart from the fact that I can't envisage Babelfish "or the like" ever being set up to translate many of the lesser-used languages we have wikipedias set up for.
-- Arwel Parry http://www.cartref.demon.co.uk/
Wikitech-l mailing list Wikitech-l@wikipedia.org
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
__________________________________ Do you Yahoo!? SBC Yahoo! DSL - Now only $29.95 per month! http://sbc.yahoo.com
Im glad to see ther are two of you -- there is certainly an open debate about this. Your opinions are noted.
But the statement...
"dealing with such a machine translation would be WORSE for the writer than starting with a blank page."
...is a bit loosy-goosy.. It seems almost technophobic, and it misses some points, namely !. the general thrust of the wikipedia as collaborative, and interactive -- 2. that something on the page is most often better than *nothing on the page -- 3. that machine translation is improving at an extremely rapid pace. 4. this is the general consensus --that the WP take upon itself a more international scope .
I dont agree with "running everthing through Babelfish" -- this is not the best option, even if it were feasible -- (I think that was an oversimplification - not a well-crafted sentence.) The better options are a subject weve already mentioned -- namely a platform dedicated to cross-langa articles , and a means to make it more efficient to use our own personal translation ware... I suggested the use of the simple wiki for this -- there are some issues with that as well.
Ultimately, there is no generalized view of machine translation -- Systran (google) is rather good for one-way conversion between one European language and another. Even Jim Breen's kanji translator is good for at mass-converting Kanji to their meanings, in a list. That it would be too hard for a "writer" to connect the dots with some grammar, is also a rather fishy statement. Wikipedians are NOT writers, anyway -- were *editors.
-S-
--- Arwel Parry arwel@cartref.demon.co.uk wrote:
In message 3F1662E7.1020701@planetunreal.com, tarquin tarquin=ivGM6J5ThX418oCzpIvf9w@public.gmane.org writes
Jason Richey wrote:
To this end, we might as well spawn a new
project... A project that
takes the en wikipedia data dump and passes it
through Babelfish or
the like.
Could we please return to the real world? Machine translations are sometimes vaguely useful
for getting the gist
of something. but by and large, they are pure crud. dealing with such a machine translation would be
WORSE for the writer
than starting with a blank page.
I quite agree, apart from the fact that I can't envisage Babelfish "or the like" ever being set up to translate many of the lesser-used languages we have wikipedias set up for.
-- Arwel Parry http://www.cartref.demon.co.uk/
Wikitech-l mailing list Wikitech-l@wikipedia.org
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
__________________________________ Do you Yahoo!? SBC Yahoo! DSL - Now only $29.95 per month! http://sbc.yahoo.com
steve vertigo wrote:
...is a bit loosy-goosy.. It seems almost technophobic,
I am hardly technophobic!
and it misses some points, namely !. the general thrust of the wikipedia as collaborative, and interactive
I don't see how this is related
-- 2. that something on the page is most often better than *nothing on the page
A bad translation would make readers think the rest is of the same quality
-- 3. that machine translation is improving at an extremely rapid pace.
I'l believe it when I see it. HUMANS have difficulty translating. It's a very imprecise art. A computer would need to understand *meaning* to effectively translate anything beyond very basic sentences. I HOPE Wikipedia articles are written with a little more style than that!
- this is the general consensus --that
the WP take upon itself a more international scope .
How do you get that from my point? I staunchly support WP's international scope! Eg, I have been arguing for MONTHS we must move the english pedia from www. to en. URL
Please can we drop this line of debate -- it's pointless. Bring it back up when machine translation is viable.
tarquin wrote:
I'll believe it when I see it. HUMANS have difficulty translating. It's a very imprecise art. A computer would need to understand meaning to effectively translate anything beyond very basic sentences. I HOPE Wikipedia articles are written with a little more style than that!
Systrans, which you recommended, gives this in French:
Les HUMAINS ont la traduction de difficulté. C'est un art très imprécis. Un ordinateur devrait comprendre la signification pour traduire efficacement n'importe quoi au delà des phrases très de base. J'espère que des articles de Wikipedia sont écrits avec peu plus de modèle que cela !
Now that is junk. A french wikipedian wanting to clean this up might not have a clue for some of the incorrect words.
They would need to refer back to the english version to see what was meant in the first place -- in other words, they would have to be bilingual. And a bilingual user might as well write from scratch.
If you want concrete examples -- they would have no way of knowing "style" was meant by "modele" without translating back into English and doing some guesswork. They would probably think it mean "I hope that wikipedia articles are written with a better template".
Theres a fundamental rule in journalism That youre missing Tarq -- (and Wikipedians ARE journalists -- in the best sense, perhaps) The rule is:
"Get Something On the Page -- Period."
Imagine a big burly fat guy with a cigar yelling that in your ear as your typing it up against your deadline. Im not saying that I can go around recommending machtransware as a cure for cancer -- I am saying it is fast arriving -- and can be useful to some people -- those talented at turning crap into custard. If one is not so talented -- and does not wish to be bothered with using it -- this should not translate into them being bothered that its by others being used. Is regular and assisted use can solve the issue of perhaps seeding a modest degree of interest in some of the other WP's -- which remain as they are, stillborn. Translate that.
Respectfully, -V-
--- tarquin tarquin@planetunreal.com wrote:
tarquin wrote:
I'll believe it when I see it. HUMANS have difficulty translating. It's a very
imprecise art. A
computer would need to understand meaning to
effectively translate
anything beyond very basic sentences. I HOPE
Wikipedia articles are
written with a little more style than that!
Systrans, which you recommended, gives this in French:
Les HUMAINS ont la traduction de difficult�. C'est un art tr�s impr�cis. Un ordinateur devrait comprendre la signification pour traduire efficacement n'importe quoi au del� des phrases tr�s de base. J'esp�re que des articles de Wikipedia sont �crits avec peu plus de mod�le que cela !
Now that is junk. A french wikipedian wanting to clean this up might not have a clue for some of the incorrect words.
They would need to refer back to the english version to see what was meant in the first place -- in other words, they would have to be bilingual. And a bilingual user might as well write from scratch.
If you want concrete examples -- they would have no way of knowing "style" was meant by "modele" without translating back into English and doing some guesswork. They would probably think it mean "I hope that wikipedia articles are written with a better template".
Wikitech-l mailing list Wikitech-l@wikipedia.org
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
__________________________________ Do you Yahoo!? SBC Yahoo! DSL - Now only $29.95 per month! http://sbc.yahoo.com
steve vertigo wrote:
Theres a fundamental rule in journalism That youre missing Tarq -- (and Wikipedians ARE journalists -- in the best sense, perhaps) The rule is:
"Get Something On the Page -- Period."
Imagine a big burly fat guy with a cigar yelling that in your ear as your typing it up against your deadline.
Now imagine you hand him this:
"Wikipedia is project encyclopedia working together. It is available many languages and is permitted with the GNU Free Manual Permission."
And so on.
Burly Fat Guy: "You're fired, Vertigo!"
It's far better to "get something on the page" by:
1. writing your own article (even if it's just a stub) 2. doing the translation yourself
Im not saying that I can go around recommending machtransware as a cure for cancer -- I am saying it is fast arriving -- and can be useful to some people -- those talented at turning crap into custard. If one is not so talented -- and does not wish to be bothered with using it -- this should not translate into them being bothered that its by others being used. Is regular and assisted use can solve the issue of perhaps seeding a modest degree of interest in some of the other WP's -- which remain as they are, stillborn. Translate that.
If one is so talented, why couldn't he or she could use machine translation as an translation aid individually, without putting the unaltered results into Wikipedia? Given where the technology is at this point, this seems to be the prudent choice.
Stephen G.
Stevertigo wrote:
Certainly for each language there needs to be a loyalty to GNUFDL -- period. As far as other things, like saying "George Bush's Mother is a Goat" in Arabic... well,. that will have to be sorted out by the academic standards that they can reasonably set there.
If Jimbo ever finds out about this, then there'll be trouble. (In practice, it may take him awhile -- unless you are vigilant!) This violates NPOV -- even if all Arabic speakers agree! We can't blame [[ar:]] writers if they don't know this, but once a writer knows that this opinion is denied by a reasonable population of people -- such as most citizens of the country that he leads! -- then the writer is obligated to attribute the opinion rather than to state it as fact.
By Wikipedia's founding standards, NPOV is as nonnegotiabble as the FDL -- even /more/ nonnegotiable, in fact! (If a future version of the FDL allows a move to, say, Creative Commons by-sa, then Wikipedia may very well do just that. See recent discussion on <textbook-l>.)
-- Toby
On Sun, Jul 13, 2003 at 03:24:42AM +0200, Timwi wrote:
Again, the contents of the Wikipedias should be the same. They should only be in different languages.
Let's just say that I completely disagree with that statement.
I must say I agree. Another point to be made is organization, not all wikipedias will be organized the same and we can't expect them to be, the language they are written in, and the culture of the people that write them will obviously affect how they are organized, and probably their contents too. What is more lets say (hypothetically) We have the Category "Chemistry", but people over in the X wikipedia actually have 2 different words for what we call chemistry, and none of them are perfect translations with concepts being split between the two. This is a real possibility and we have to be sensitiveto that, 1:1 linking is not allways ideal.
Lightning
wikitech-l@lists.wikimedia.org