Hi,
Can anyone point me to the in-links and out-links from a page as per the
wiki-database (which has been downloaded from the wiki dumps ).
Thank you in advance!
Hi,
Just wanted to share some of bits we've been doing this week - it was
hopping around and analyzing our performance and application workflow
from multiple sides (kind of "Hello 2008!!!" systems performance
review).
It all started with application object cache - the caching arena was
bumped up from 55GB to 160G - and here more work had to be done to
make our parser output cacheable. Any use of magic words (and most
templates do use them) would decrease cache TTLs to 1 hour, so vast
increase in caching space didn't help much. Though, once this was
fixed, pages are reparsed just once few days. Additionally, we did
move the revision text caching for external storages to a global
pool, instead of maintaining local caches on each of these nodes.
That allows to reuse old external store box memory space for caching
more actively fetched revisions, instead those archived ones.
Another major review was done on extension loading - there by
delaying or eliminating expensive initializations, especially for
very-rarely-used extensions (relatively :) - we did shave at least
20ms off site base loading time (and service request average). That
also resulted in huge CPU use reduction. Here special thanks goes to
folks on #mediawiki (Aaron, Nikerabbit, siebrand, Simetrical, and
others) who joined this effort of analysis, education and
engineering :) There're still more difficult extensions to handle,
but I hope they will evolve into more adaptive performance-wise. This
was long-time regression caused by increasing quality of translations
- that resulted in bigger data set to handle at every page load.
A small bit, but noticeable, was simplification of
mediawiki:pagecategories message on en.wikipedia.org. Such simple
logic like "show Category: if there is just one category, and
Categories: otherwise" needs a parser to be used, which invokes lots
and lots of overhead for every page served. Those few milliseconds
needed for that absolutely grammatically correct label could be
counted in thousands of dollars. :)
There were few other victims in this unequal fight. TitleBlacklist
didn't survive the performance audit, - the current architecture of
this feature is doing work in places it never should do, and as
initial performance guidelines for it were not followed, it got
disabled for a while. Also some of CentralNotice functionality was
not optimized for work it was used after the fundraiser, so for now
this feature is disabled. Of course, these features will be enabled -
but they just need more work before they can run live.
On another front - in software core part - database connection flow
was reviewed - and few adjustments were made, which reduce master
server load quite a bit, as well as less communication is done with
all database servers (transaction coordination was too verbose before
- now it is far more lax).
Here again, some of application flow still is irrational - and might
have quite a bit of refactoring/fixing in future. Tim pointed out
that my knowledge of xdebug profiler is seriously outdated (my mind
was stuck at 2.0.1 features, where 2.0.2 introduced quite significant
changes that make life easier) ;-) Another shocking revelation was
that CPU microbenchmarks provided by MediaWiki internal profiler were
not accurate at all - the getrusage() call we use provides
information rounded at 10ms each - and most of functions execute far
faster than that. It was really amusing, that I trusted numbers,
which were similar to rational and reasonable ones only because of
huge profiling scale and eventual statistical magic. This a bit
complicates profiling in general - as there's no easy way to
determine which wait happened because of i/o blocking or context
switches.
Few images from the performance analysis work:
http://flake.defau.lt/mwpageview.pnghttp://flake.defau.lt/mediawikiprofile.png (somewhere here you should
see why TitleBlacklist died)
This one made me giggle:
http://flake.defau.lt/mwmodernart.png
Tim was questioning here if people are using wikitext for scientific
calculations, or was that just another crazy over-templating we are
used to see.
Such templates as Commons' 'picture of the day' one cause such output
=) Actually - the new parser code makes far nicer graphs (at least,
from performance engineering perspective).
And one of biggest changes happened on our Squid caching layer -
because of how different browsers request data, we generally had
different cache sets for IE, Firefox, Opera, Googlebot, KHTML, etc.
Now we do normalize the 'accept encoding' specified by browsers, what
makes most of connections fall into single class.
In theory this may at least double our caching efficiency. In
practice, we will see - the change has been live just on one cluster
just for few hours.
As a side effect we turned off 'refresh' button on your browsers.
Sorrty - please let us know if anything is seriously wrong with that
(if you feel offended about your constitutional refreshing rights -
use purge instead :)
Additionally I've heard there has been quite a bit of development in
new parser, as well as networking in Amsterdam ;-)
Quite a few people also noticed the huge flamewar of 'oh noes, dev
enabled a feature despite our lack of consensus' . Now we're sending
people to board for all the minor changes they ask for :-)
Oh, and Mark changed the scale on our 'backend service time' graph,
which is used to measure our health and performance - now the upper
limit is at 0.3s (used to be our minimum few years ago) instead of
old 1s:
http://www.nedworks.org/~mark/reqstats/svctimestats-weekly.png
So, that much of fun we've seen this week in site operations :)
Cheers,
Domas
P.S. I'll spend next week in Disneyworld instead ;-)~~
(I mentioned this on IRC just now, but other than a "me, too",
there was no response, so I'm posting here for posterity.)
Periodically today I've gotten utterly blank pages -- perhaps
1 or 2% of the time. I wonder if there's a squid or two that's
dead or acting up?
Gentlemen, it is you who are ruining network standards.
HEAD http://en.wikipedia.org/wiki/Some_Non_Existent_Page --> 200 OK
It is clearly a case of 404 Not Found.
You can still give the same "You can create this article" message AND
return a truthful HTTP code.
Else how is one to use a linkchecker to your links? Why are MediaWiki
wikis special?
Yes, 200 OK for action=edit, and disambiguation pages, but not for
the basic clear case of the spirit of 404 Not Found.
What if all ==External links== always returned 200? How could a bot
detect linkrot? Do unto others...
On Jan 24, 2008 10:35 AM, <huji(a)svn.wikimedia.org> wrote:
> + // Entry from drop down menu + additional comment
> + $reason .= ': ' . $this->DeleteReason;
Should the ': ' string be localized (below as well as here)? Also, as
far as I can tell this will result in summaries like:
Deleted old revision $1: Boilerplate reason: Custom reason
Having two colons is a bit odd.
> + $mDeletereasonother = Xml::label( wfMsg( 'filedelete-otherreason' ), 'wpReason' );
> + $mDeletereasonotherlist = wfMsgHtml( 'filedelete-reason-otherlist' );
> + $scDeleteReasonList = wfMsgForContent( 'filedelete-reason-dropdown' );
> + $mDeleteReasonList = '';
> + $delcom = Xml::label( wfMsg( 'filedelete-comment' ), 'wpDeleteReasonList' );
It seems incredibly confusing to use local variables whose names begin
with $m. An initial lowercase 'm' prefix is used to indicate member
variables. Either these should be made member variables, or the 'm'
should be dropped. You also have variables that are named identically
except for the 'm' ($deleteReasonList vs. $mDeleteReasonList), which
is even more confusing.
filedelete-comment and filedelete-otherreason seem to allow arbitrary HTML.
> + $value = trim( htmlspecialchars($option) );
Consider not escaping ampersands here, so that entities can be used --
really we only want to ban tags.
> + } elseif ( substr( $value, 0, 1) == '*' && substr( $value, 1, 1) != '*' ) {
> + // A new group is starting ...
> + $value = trim( substr( $value, 1 ) );
> + $deleteReasonList .= "$optgroup<optgroup label=\"$value\">";
> + $optgroup = "</optgroup>";
> + } elseif ( substr( $value, 0, 2) == '**' ) {
It would probably be simpler to read if you reversed these two
elseifs. Then you could drop the second part of the (current) first
one's condition, and just check substr( $value, 0, 1 ) == '*'.
Also, maybe a clearer name for $optgroup? Like $close or
$closeoptgroup or something? The current one is okay, I guess.
> + if ( $mDeleteReasonList === $value)
> + $selected = ' selected="selected"';
Indentation is wrong here. The second line should be indented by one more tab.
A. Rinkleff is the user previously known as User:Lir, a legendary
Wikipedia troll. I've placed him on moderation pre-emptively, because
he's the sort of person who really warrants it. THE TROLL FROM HELL.
- d.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hey all you Wikimedians and MediaWikians --
We're now hiring for a software development position at our new San
Francisco office. This is an entry-level position, but existing
experience with MediaWiki or other LAMP-style development will be a big
help.
Since we're still a small office, the new guy will also need to help
people in the office with basic IT issues; we have a mixed environment
with Mac, Linux, and Windows machines, and varying degrees of tech-savvy.
We're planning to open up a couple more dev positions over the coming
months as budget allows; those will not necessarily be locked to the
California office, but for now we need someone who can be on-site every
day to lend a hand.
Fuller job description at:
http://wikimediafoundation.org/wiki/Job_openings/Software_Developer_/_IT_Su…
Drop me a mail (offlist!) if you're interested, or pass it on if you
know someone who will be; we'll be scheduling interviews in the next few
weeks.
(And yes, we will be posting the position on Craigslist as well.)
- -- brion vibber (brion @ wikimedia.org)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFHmpmNwRnhpk1wk44RAnr+AKDTyokaG7YpBWBvK3X4vfMzYj9RDQCaA1WY
igzwrxdYL6/Jt8czqAmro4k=
=T3Hr
-----END PGP SIGNATURE-----
I was just reading this:
http://www.riehle.org/wp-content/uploads/2008/01/a5-junghans.pdf
And wondering if there is any desire (let alone plans) to move to a
system of storing a different internal representation (eg, XML) and
separating the display logic out. One obvious benefit would be making
it easier to produce different outputs without having to write
multiple parsers. Are there others? Would Wikipedia benefit from
supporting an interchange format?
Just fishing.
Steve
minuteelectron(a)svn.wikimedia.org wrote:
> Revision: 30162
> Author: minuteelectron
> Date: 2008-01-26 00:37:42 +0000 (Sat, 26 Jan 2008)
>
> Log Message:
> -----------
> Fix bug 9246 by watching a page when the upload\reupload "Watch this
> page" checkbox is checked and unwatching a page when it is not.
A problem with this is that it *un*watches a previously watched image
under the following circumstances:
* 'watch pages I edit' is not enabled (eg, default state)
* go to Special:Upload and select the file
* hit 'upload'
The initial check state is unchecked (since there was no initial
destination name set), and this doesn't get updated to reflect the
existing watch state of the previous image.
There are a couple possible ways around this. One is to compare the
form's actual initial check state with the submitted check state and
only apply an unwatch if there was a difference.
Another might be to do a watch state update via AJAX when a new
destination filename is set in the form. This would allow the
checkmark's default state to be set 'properly' for those with JS enabled
in modern browsers.
Perhaps a combination should be used.
For now I'll revert the change, as I think not unwatching things is less
destructive than unwatching things unexpectedly.
-- brion vibber (brion @ wikimedia.org)
Hello all,
Please take a few seconds to have a look at
http://bugzilla.wikimedia.org/show_bug.cgi?id=12681 and help with your
comments about whether we should apply this change or not. Applying it will
prevent people from spoofing the "new message" alert, but at the same time,
will make the new message bar to appear where it never appeared before,
which may not be desired.
Thanks in advance,
Hojjat (aka Huji)