In the German wikipedia a list of the used qualifiers in the titles was
discussed, and most of the participants think it will be a interesting feature.
I know I have expressed it wrong, therefore an example:
Cell (biology) is a homonym (cell) with a qualifier (biology). To get a
proper list of those qualifiers and modify or eliminate wrong ones, a list
would be very helpful.
The discussion result was that such a unique list of those qualifiers from
titles (table cur) and bl_to (brokenlinks) would make sense.
Unfortunally I can give you only a proper PostgerSQL select statement (only
table cur) but possibly someone can transfer this easy to mySQL:
======== WARNING: THIS IS NOT A VALID STATEMENT FOR WIKIPEDIA ==========
SELECT DISTINCT substring(cur_title FROM '.+\\((.+)\\)') AS p FROM cur;
========================= I WARNED YOU =================================
-- "\\(" ==> ( (needed for quoting)
-- (.+) ==> the () - construct is used to select the part substring will
return.
For the titles "foo", "foo (bar)", "foo2 (bar)" and "bar (foo)" the result
will be "bar" and "foo".
This should only show the as-is state of these qualifiers! There is no
intention for any automated process to enforce them, because no Wikipedian
should get an error like "Qualifier not allowed" or so. This page is only
for administrational and informational purposes!
Of course additional features, like showing the matched pages and others,
would be nice, but there the discussion must go on further, IMHO.
If there are more questions, ask, I will try to answer them.
Smurf
--
------------------------- Anthill inside! ---------------------------
I've finally gone ahead and hacked up that preliminary page caching I've
been talking about; see new changes to Article.php & co. As an emergency
measure I've put it up on larousse/www.wikipedia.org with only minimal
testing. So far it's working great -- system load is way down, response
time seems good.
Presently it operates only on regular page views by users who are not
logged in. I've tweaked the header in the corner so it no longer shows
the IP address, so every anon's page will appear the same. (If someone's
added to their talk page, this is detected and the cache is disabled, so
the 'You have new messages' link will show and take them to the talk
page.
For pages that are determined cacheable, we check a cache directory for
a file: if it exists and is not obsoleted by the 'last touched'
timestamp already established for dealing with browser caching, we just
load it up and pass it straight through. If there's no file or it's
obsolete, we install an output buffer handler, and at the end we catch
the whole page output and save it to the file.
Caveats:
- Invalidation of cached pages is controlled by the same mechanism that
invalidates browser caches, and will be subject to some of the same bugs
there. Problem areas may include undeletion, the talk/article page
links, and anywhere where the link tables are broken. Some redirects may
be funny, but hopefully not. :)
- I'm pretty sure I excluded all the non-cacheable page view variants. I
might have missed something, in which case bad pages could crop into the
cache space.
- There's a site-wide cache invalidation date settable in localsettings.
I haven't actually tested it :) and there should probably be a sysopable
or developerable clear-all-caches special page. This also needs to be
worked in to affect the browser cache as well.
- It should also be possible to explicitly clear the cache of a page and
force it to regenerate in case it's screwed up. Perhaps a little button
or something.
- Some pages, like the main page, should be invalidated periodically or
else never cached, because they contain special variables (time, article
count) which may change.
- This only affects non-logged-in users so far. But that makes up the
greater part of our traffic, so that's okay for now. It makes the server
faster for the rest of us. :)
- The cache directory is divided up like the upload directory is; so
there should be 4096 separate dirs. Should be plenty for keeping ext3
from going mad and killing us all for a while yet.
Other notes:
- Hypothetically we could fall back on cached pages if unable to contact
the database.
- Cache files are not deleted on invalidation; they're just assumed
obsolete, and replaced when needed.
- There's a fun new bug where logouts (or perhaps timeouts) leave a
session in a funny state where the interface works as not-logged-in, but
edits are saved with the formerly used user name (but still with 0 as
the user id, so contribs doesn't work).
See our now much happier servers:
[brion@larousse w]$ uptime
15:26:42 up 14:57, 3 users, load average: 7.36, 9.50, 7.76
[brion@larousse w]$ free
total used free shared buffers cached
Mem: 1030952 996824 34128 0 176848 554108
-/+ buffers/cache: 265868 765084
Swap: 1020088 72416 947672
[brion@larousse w]$ vmstat 1 15
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1 0 0 72416 34104 176848 554248 0 10 8 88 393 210 73 3 24
0 0 0 72428 34020 176848 554300 0 172 0 172 761 394 60 6 34
0 0 0 72428 34636 176848 554312 0 0 0 616 380 187 30 1 69
0 0 0 72428 34644 176848 554356 0 156 0 156 496 338 43 4 53
0 0 0 72416 34480 176852 554416 0 0 28 0 512 308 64 2 34
1 0 0 72420 34472 176852 554456 0 200 0 200 479 274 48 2 50
2 0 2 72420 34472 176852 554464 0 0 0 208 527 341 82 6 12
3 0 0 72392 34712 176852 554512 0 204 0 536 552 420 55 4 41
1 0 0 72392 34720 176852 554548 0 0 0 0 436 287 42 0 58
1 0 0 72372 34732 176852 554564 0 208 0 208 536 315 32 8 60
2 0 0 72368 34568 176852 554612 0 0 20 0 521 320 53 3 45
1 0 0 72380 33980 176852 554620 0 224 0 840 491 318 29 6 65
1 0 0 72384 33716 176856 554688 0 108 12 108 487 302 55 1 44
2 0 0 72384 33856 176856 554712 0 0 0 0 410 237 38 0 62
0 0 0 72384 33836 176856 554724 0 192 0 192 299 163 14 2 84
[brion@pliny brion]$ uptime
3:28pm up 4 days, 5:28, 1 user, load average: 2.46, 3.49, 3.13
[brion@pliny brion]$ free
total used free shared buffers cached
Mem: 2068912 1973376 95536 0 35360 1155584
-/+ buffers/cache: 782432 1286480
Swap: 2047992 436568 1611424
[brion@pliny brion]$ vmstat 1 15
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 436568 95020 35392 1155900 12 8 41 7 18 43 47 5 47
3 2 0 436568 90744 35444 1156460 4 0 572 728 477 730 36 5 59
2 4 1 436568 89072 35508 1157508 0 0 1084 1012 754 866 38 7 55
4 0 0 436568 88900 35516 1157696 0 0 180 60 515 656 37 3 60
5 0 0 436568 84020 35540 1157808 0 0 104 444 477 815 51 6 43
2 0 0 436568 80556 35544 1157816 0 0 0 204 401 596 47 3 50
1 0 0 436568 79796 35556 1157860 0 0 56 124 394 378 19 2 78
2 0 0 436568 79756 35560 1157900 0 0 32 180 513 655 23 7 70
0 1 0 436568 79740 35604 1157924 0 0 24 1165 377 357 34 10 56
1 0 0 436568 76008 35612 1157996 0 0 72 16 383 366 7 2 90
3 0 0 436568 67984 35616 1158148 0 0 152 0 380 391 14 3 83
0 1 0 436568 64404 35632 1158560 0 0 416 132 401 504 26 4 70
2 1 0 436568 67392 35692 1159556 0 0 992 856 538 639 13 6 82
1 5 3 436568 66912 35704 1159808 0 0 240 1782 573 737 59 6 35
0 1 1 436568 66880 35732 1159928 0 0 116 2194 707 732 43 9 48
Pliny's got room to expand, and Larousse's end still has optimization
that can be done. So things are looking good!
-- brion vibber (brion @ pobox.com)
I've looked at Brion Vibber's "ps auxwww" output
(thanks!!). Although the MySQL daemons
take up the lion's share of memory, they don't take much of the
%CPU, even in aggregate. Instead, the CPU seems to be
taken up by the Apache daemons (/usr/local/apache/bin/httpd).
It doesn't appear that one daemon takes up all the time; it
appears spread out to some extent (a little bit by each,
though there IS a lot of variance).
Presumably this is due to each one executing the PHP scripts.
Clearly speeding execution of the PHP scripts would help.
One way is to reduce the work they have to do
(e.g., caching the HTML). Another is coding the hotspot
(e.g., as a loaded C module). But doing it right requires
identifying what the hotspot is in the PHP scripts.
Is there a way to enable performance monitoring in PHP, like
gprof in C, to figure out where the hotspots in the PHP scripts are?
Failing that, I guess you could insert monitoring points in
various places (painful, painful).
Of course, this doesn't mean that moving wikitext from MySQL
to the filesystem, or using the filesystem as an HTML cache,
is a bad idea. I don't know how transmitting
data from MySQL to the scripts is accounted for; the transit
time betwen script and MySQL may be hidden in the script performance
measures.
We don't want the servers on ibiblio, but how about the mailing lists?
Putting all your eggs in one casket .. basket .. is never a good idea, no?
Regards,
Erik
(newbie here, so excuse any ignorance in the wiki-ism's)
I want to set up a full documention base, using Wiki<something>, to
supplement my work as a Sysadmin. I wonder if someone has done this
before - for example, from things of compile time flags, to system notes,
to upgrade instructions, etc. etc... I see Wiki as a potentially good
application/tool for this.
Of course, my question is specific to the code used (phase3) in Wikipedia.
Thanks,
Forrest
this issue has raised its head again on the manual of style page. And it
still bugs me that we're mixing up presentation with semantics.
recap: the curent situation is this:
== heading ==
Text: there will be little space between this and the heading
== heading ==
Text following a blank line. There will be a gap between this and the
heading
So a quick question:
1. under what circumstances do we *definitely* need space between a
heading an the following content?
and
2. under what circumstances do we *definitely NOT*?
The only one I can remember is that in tables (countries, elements,
etc), we want *no* space.
any others?
because if not, we can resolve this problem with CSS:
h1 , h2 etc { usual space }
table h1 , table h2 etc { no space }
While setting things up for the automated testing system, Lee made some
changes to the HTML code, adding id tags & such so the tests could find
things more easily.
I'm not sure exactly how the code should be behaving, but there does
seem to be a difference, as reported at
http://www.wikipedia.org/wiki/Wikipedia%3ANew_server_madness :
On edit pages, we've changed from:
<form .... name='editform'>
to
<form id="editform" ....>
However the latter form seems to break this JavaScript fragment in the
page's onLoad handler:
document.editform.wpTextbox1.focus()
Mozilla gives the error "document.editform has no properties".
IE 5.5 gives "'document.editform.wpTextbox1' is null or not an object".
-- brion vibber (brion @ pobox.com)