Neil, many thanks for the bots--I've been running them too over my
cable modem and my server so we're getting plenty of hits. We are
getting good timing data in the logfiles, but at this point I'm most
interested in testing robustness. I'd really like it if you could
hack up a script that did random page edits as well--I tried to hack
your python a bit to do that, parsing out the text in the <textarea>
and sending it back, but I'm just not up on python enough to figure
out why it wasn't behaving the way I expected.
I'd also like to have a version of your bot that sends the text you
get back through a validator. I'll look for one that would be
relatively simple to plug in.
BTW, you can get server-side timing data from a comment at the end of
each returned page: that will tell you how much time the server
actually took between the start of the script and serving the page,
so it eliminates the delays caused by the client script and the
actual transfer. Might be interesting to compare the two.
0
I'm testing the postit script with and without non-ASCII chars, to see
if the apparent speed difference is real.
I'm not running any other load generators. Each test is an average over
roughly 50 pages. I'll alternate between the with and without states.
With: 2.39 secs average
Without: 2.91 secs average
With: 2.84 secs average
Without: 2.98 secs average
With: 2.64
Without: 3.18
Well, there does seem to be a difference, but it still might be noise.
All with = 7.87 / 3
All without = 9.07 / 3
Difference = 1.2 / 3 = 0.4 seconds
I then did two long runs:
With: 2.74 (averaged over 151 pages)
Without: 3.21 (averaged over 153 pages)
Difference: 0.47 seconds
Another two long runs:
With: 2.78 (averaged over 151 pages)
Without: 3.21 (averaged over 151 pages)
It looks like the difference is probably real, I guess due to the
greater likelihood of things that will make the regexp engine have to
lookahead and backtrack to check if things are links.
This is _not_ a big performance issue at the moment, as the Wikipedia
load is read-mostly. However, it's worth considering as a place to look
at in the future.
By then, we'll probably be running a three-tier architecture, with the
DB running on a separate machine from the PHP scripts, and so even then
this may be a low-priority issue.
Neil
I've noticed that beta.wikipedia.com has not got a robots.txt file.
Yes, I know that most recent robots will read the metadata in the page,
but I'm willing to bet that some of the dimmer or older ones don't.
Should we have one?
Also, there's no favicon.ico. I enclose a file Walone2.ico which should
work if renamed to favicon.ico and placed at
http://beta.wikipedia.com/favicon.ico
Neil
Here is another data point regarding the performance of beta.wikipedia.com
I'm now running
7 stressbots (page readers)
2 postits (page writers)
concurrently accessing beta.wikipedia.com via a 512K DSL connection.
I'm getting the following statistics:
average page read time 2.9 seconds
average page write time 4.9 seconds
corresponding to an
average page read rate of 7/2.9 = 2.4 pages/sec
average page write rate of 2 / 4.9 = 0.41 pages/sec
making a total sustained transaction rate of around 2.8 hits/sec, or
around 240,000 hits/day or over 7 million hits a month.
However, my inbound traffic is around 61 kbytes/sec doing this, so my
512k DSL link is currently the bottleneck, not the server.
Dropping the concurrency to
3 stressbots
1 postit
gives:
average page read time 1.9 seconds
average page write time 3.1 seconds
corresponding to an
average page read rate of 3/1.9 = 1.57 pages/sec
average page write rate of 1 / 3.1 = 0.32 pages/sec
total transaction rate: 1.9 hits/sec
for an inbound traffic rate of about 26 kbytes/s, where my DSL link is
no longer the bottleneck, but the system is under less load.
To really stress test the server, we will need several clients to run at
once on several different links. I'm going to stop the test now.
It would be useful for testing if we could have a page that gave current
Linux operating system stats, perhaps in a sysop-only page?
Neil
What can we do to speed up the process? Some people are getting
frustrated (also by the server problems) at the German wikipedia. I hope
people don't leave the project because of the much to often unreachable
or slow server, but I don't know. Hopefully the problems will be over
after the move to the new server.
We have translated the most important bug reports at
http://test-de.wikipedia.com/wiki/wikipedia:Bug_report
(German version at
http://test-de.wikipedia.com/wiki/wikipedia:Beobachtete_Fehler )
I hope we have found most of them.
Maybe my last mail was overlooked, but I'd still like to have sysop
status at test-de.wikipedia.com, so that I'm able to play around with
article renaming and all those other nice features we have now.
Username: Kurt Jansson
Maybe someone could also give sysop status to Ben-Zin, who is very
active in testing the new software.
When I'm sysop, can I give these rights to the other Wikipedians?
Thanks!
Kurt
Jan writes:
> Ultimately the best solution would be to have a table wanted(title,
> #pages) with an index on #pages (and a unique index on title), then
> MySQL wouldn't need to sort at all. I don't know of the top of my head
> if there are any other queries that depend on 'brokenlinks' but I
> don't believe so and if there are not then I would recommend replacing
> it.
I believe we need the full brokenlinks information in order to
initialize the links table once a new article is written, so that
"What links here" will immediately work for new articles.
Nevertheless, I think we should have a wanted table as above in
addition. Space is really no issue, but time is, and Most Wanted is
likely to be one of our more commonly called slow functions.
Axel
> by blindly executing TeX when someone edits a page, we are assuming
> that they haven't included any malicious code in their TeX source.
TeX has two dangerous commands: shell escapes and writing to an
arbitrary file. Both can be globally disabled (and are disabled by
default in most TeX distributions). It is fairly easy however to write
TeX which eats memory like crazy (TeX allows recursion :-), so we
would have to somehow restrict the resources available to the TeX
process. But we are of course right now already wide open to all sorts
of denial-of-service attacks.
Axel