I often come across invalid XML in the codebase, such as unquoted
attributes or unclosed tags, this is a request my fellow developers to
help ensure that our output meets the XHTML spec at all times, one way
to do help ensure this is to add:
$wgMimeType = 'application/xhtml+xml';
To LocalSettings.php, if you're using a Gecko browser for testing this
will activate its strict mode causing it to trow parsing errors if
finds invalid XML.
I set up a copy of nl.wikipedia.org on my test PC from the public dumps,
and ran the updater to upgrade it to 1.5. This is a medium-sized wiki,
in Latin-1 encoding.
The good news:
* It worked -- the updater ran through to completion without exploding.
* After setting $wgLegacyEncoding = 'windows-1252', it seems to properly
convert article text encoding to UTF-8 on page load.
The bad news:
* The UTF-8 converstion necessary for the other database fields (titles,
usernames, comments etc) hasn't been quite finished yet, so wasn't auto-run.
* The updater ran for a few minutes shy of 10 hours. Most of that time
was spent shuffling cur entries into the old table, where they
eventually become plain old text entries. The pulling of revision data
out of old (by now renamed to text) seemed to take a smaller portion of
the time, but I foolishly didn't time the individual steps.
Most CPU time was spent in I/O wait state in the MySQL server. This
machine has IDE disks purchased for size & cost, not speed, has
relatively little memory (512M), I haven't attempted to optimize the
MySQL configuration for memory usage, and I kept doing things like
installing Debian in VMWare in the foreground... ;)
It probably ought to go faster on the big Wikimedia servers, but I can't
say just how much.
There may be ways to further optimize the conversion process; dropping
some of the indexes first, for instance, might be an overall win if it
makes the importing faster.
Even in the ideal case it'll be kinda slow to run these, but it really
is necessary... at least the schema change should make future changes
less painful.
For the final live updates we'll probably want to do them one at a time,
keeping all other wikis open for editing, and the in-conversion one open
for read-only on a backup.
With the way we've got shared document roots this might require some odd
configuration shuffling to load up either 1.4 or 1.5 code depending on
update state, but I think it should be possible.
-- brion vibber (brion @ pobox.com)
Hi list,
I'm trying to install MediaWiki -1.4.3 and during installation I'm
getting the following collation errors :(
Logging table has correct title encoding.
Initialising "MediaWiki" namespace...
A database error has occurred
Query: SELECT cur_title,cur_is_new,cur_user_text FROM `cur` WHERE
cur_namespace=8 AND cur_title
IN('1movedto2','1movedto2_redir'...'Yourvariant','Zhconversiontable')
Function:
Error: 1271 Illegal mix of collations for operation ' IN ' (localhost)
Backtrace:
GlobalFunctions.php line 507 calls wfBacktrace()
Database.php line 383 calls wfDebugDieBacktrace()
Database.php line 333 calls DatabaseMysql::reportQueryError()
InitialiseMessages.inc line 150 calls DatabaseMysql::query()
InitialiseMessages.inc line 78 calls initialiseMessagesReal()
updaters.inc line 205 calls initialiseMessages()
index.php line 539 calls do_all_updates()
I've even gone to the measure of creating the `wikidb` manually:
mysql> CREATE DATABASE wikidb DEFAULT CHARACTER SET Latin1 COLLATE
Latin1_bin;
and then running the install script. We are using PHP 5.0.1 and MySQL
4.0.18, any ideas would be greatly appreciated.
TIA
Jason
__________
Jason Lane (Development)
ONSPEED
"Faster Internet"
Direct: +44 (0)20 7952 4035
General: +44 (0)8707 585 859
Fax: +44 (0)870 705 1393
http://www.onspeed.com
PLEASE NOTE ONSPEED IS NOT LIABLE FOR ANY DAMAGES, MALFUNCTION, OR LOSS
OF DATA, CAUSED AS A RESULT OF FOLLOWING ADVICE ENCLOSED IN THIS EMAIL.
ANY CHANGES SHOULD BE CARRIED OUT AT YOUR OWN RISK.
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed.
The opinions expressed in this mail are those of the author and do not
necessarily represent the views of the company. If you have received
this email in error please notify service(a)onspeed.com.
Tim Starling wrote:
: VFD puts great strain on the server as it is, because the server is
: forced to regularly render the whole page, consisting of a few megabytes
: of HTML. If you can break it down into small sections, I think that will
: be a win for performance, even if there is less opportunity for caching.
Tim, this is being bitterly resisted by VFD habitues on VFD talk. Please
stop by and give numbers if needed to convince:
http://en.wikipedia.org/wiki/Wikipedia_talk:Votes_for_deletion#Can_we_pleas…
Jamesday may also care to weigh in - VFD is composed of metatemplates,
which are a known noxious entity.
Please let them know that insisting their 1.5 megabyte page be the
*default* deletion portal is actually a bad idea.
- d.
> Belnet/Belgium -- 1 rack of space, unlimited bandwidth, they are ready
> to go Monday, they can do full hands-on, etc., including replacing
> borken hard drives and so on like that. They are excited to move
> forward quickly. In this case, we must supply the hardware. We can
> either buy hardware (with the German money?) or I can ask someone to buy
> it for us (see Big Company X, below).
> Amsterdam - a large NGO wants to do a big press announcement when I'm
> there in Holland at the end of this month. They are providing a set of
> servers which have already been ordered. I do not know the exact
> specifications, perhaps someone else can tell me?
If we are talking Europe, I think the key here is to consider where the
traffic comes from and who has good connectivity to that audience.
Belnet is an educational network; so it will be able to provide the best
connectivity to the Belgian educational users (approx 500,000 users); and
connectivity to other networks via BNIX which is generally used by networks
in the Benelux countries. Outside of this it is going to be slower. And I
would expect they would want a cap on their external connectivity - they are
using Cable & Wireless and Cogent amongst others.
The dutch NGO, I cannot comment on but again this seems very Benelux
centric; I think it is important to consider where the primary hotspots of
traffic are:
- UK
- Germany
- Sweden
- Nederlands
If you study each NAP across Europe; you will see that the largest in terms
of traffic is LINX (London). If you then look at the participants on that
NAP you will then see that its not UK centric but a number of US, UK,
German, Dutch and Scandinavian networks are connected there. If you equate
this back in terms of number of users then this one NAP alone gives you many
millions of users; and the majority of tier 1 networks across Europe.
Then looking at the German market; most of the traffic/users are on Deutsche
Telecom's network, they also have multiple interconnects with LINX. The
tier-1 German networks are also well connected internationally.
In Sweden you have a different situation; the majority of traffic (general
not Wikipedia specific) I have measured seems to remain within the region
(SE/NO/DK/FI). This also seems to be how Bredbandsbolaget (main broadband
provider) seems to have dimensioned their network. This could also be down
to language - the Swedish/Norwegian/Danish language is similar enough for
their neighbors content to be of interest to them as well.
Statistics
----------
Population Users
Germany 82726188 46312662
UK 59889407 35179141
Italy 58608565 28610000
France 60293927 24848009
Spain 43435136 14590180
Netherlands 16316019 10806328
Poland 38133891 10600000
Sweden 9043990 6656716
Belgium 10443012 5100000
Austria 8163782 4630000
Greece 11212468 3800000
Denmark 5411596 3720000
Portugal 10463170 3600000
Czech 10230271 3530000
Finland 5246920 3260000
Hungary 10083477 3050000
Ireland 4027303 2060000
Slovakia 5379455 1820000
Latvia 2306489 936000
Slovenia 1956916 800000
Lithuania 3430836 695000
Estonia 1344840 621000
Cyprus 950947 250000
Luxembourg 455581 170000
Malta 384594 120000
In terms of Penetration:
Population Users
Sweden 9043990 6656716
Denmark 5411596 3720000
Netherlands 16316019 10806328
Finland 5246920 3260000
UK 59889407 35179141
Austria 8163782 4630000
Germany 82726188 46312662
Ireland 4027303 2060000
Italy 58608565 28610000
Belgium 10443012 5100000
Estonia 1344840 621000
France 60293927 24848009
Slovenia 1956916 800000
Latvia 2306489 936000
Luxembourg 455581 170000
Czech 10230271 3530000
Portugal 10463170 3600000
Greece 11212468 3800000
Slovakia 5379455 1820000
Spain 43435136 14590180
Malta 384594 120000
Hungary 10083477 3050000
Poland 38133891 10600000
Cyprus 950947 250000
Lithuania 3430836 695000
Assuming that the demand of Wikipedia content is the similar in terms of
percentage of population (its not but I do not have any figures to comment
on that yet), I would strongly consider the following first:
UK: LINX and XchangePoint. This will give access to all the tier 1 networks
in Europe http://green.linx.net/cgi-bin/peering_matrix2.cgi
Sweden: DGIX. This will give access to all the tier 1/2 networks in
Scandinavia. http://www.netnod.se/connected.htm
If anyone has any details traffic analysis for Europe or would like to set
one up, please let me know.
//Eden
While the Java-based Lucene search server is giving pretty good
performance on search results compiled with GCJ, the index builder
performs pretty slowly: 3-10x slower than the same code running under
Sun's JDK.
Since I've gotten relatively good performance out of the C# version
running under Mono, I've gone ahead and imported that into CVS and
started bringing it up to date, and plan to use it at least for the
indexing.
It's in the 'mwsearch' module in CVS, should anyone feel like taking a look.
-- brion vibber (brion @ pobox.com)
I thought [[m:Article validation feature]] would be showing up, at least
for data gathering, in 1.5. Is it on test.leuksman.com, is there no
interface to it unless you know where it is?
- d.
When I try to login my bot to the test wiki at
http://test.leuksman.com/, it fails. The bot gets a "400 Bad Request"
answer back. Does anybody have an idea what is the problem?
Andre Engels