Gregory Maxwell wrote:
On 3/29/06, Tim Starling
<tstarling(a)wikimedia.org> wrote:
[quoted out of order]
Who here needs more than 5 requests per second?
Who needs a latency of
less than a few hundred milliseconds? What exactly do you want full text
replication for?
5 requests per second isn't enough to keep up with reading 2-6 full
texts per recent change on enwiki which is what my bad edit detection
tool does (depending on the content, more than two because it will
look deep enough to help people fully revert contiguous multi-edit
vandalism).
Enwiki has a bit over 1.2 main namespace edits per second on a 24 hour
average. I need to keep up with the instanious rate on a 1 minute-ish
window, which is much greater, if I don't want the vandalism checking
to lag annoyingly far behind realtime.
This ran with negligible impact on toolserver when we still had text there...
Including caching it should be enough, that's one of the reasons I
picked the number. You can store the text of recent edits in a disk
cache. If there are two edits to the same article in quick succession,
you'll still have the text from the first in the cache when it comes to
analysing the second. You could cache all text from the last few hours,
then limit your analysis to that time frame.
An optimisation would be to produce a full-text feed, say via a TCP
stream. The apaches could send it on save using a message passing
library. But we'd have to evaluate that in terms of the resources it
would save. And compare it to the replication option of course.
[snip]
It would be easier if we had a VLAN, so that we
didn't have to set up 5
ssh tunnels. Does anyone know anything about VLANs? Does anyone care
enough about this project to research it?
That's a subject area well within my professional skillset, I'd be
more than willing to look into it.
[snip]
Why are the list archives private? Why isn't
it listed on the
mail.wikipedia.org index page? Why isn't it in gmane? Why have I never
heard of it before? Gregory Maxwell has been whinging that nobody is
listening to him on a list that nobody can read.
Well it wasn't just a whine, I did repeat my offer to help. I also
CCed wikidev list, though I can understand why a single request there
would be lost.
Nor were my requests placed in both the wikimedia-dev irc channel nor
the toolserver IRC channel in an inaccessible place.
Are you saying you think the archives should be private?
[snip]
It should be possible to set up 5 MySQL instances
and have each of them
replicating from a different master. Is anyone volunteering to set up
those instances? Maybe we need to give root access to someone who
actually cares about this stuff.
I, in effect, volunteered to do it when I pointed out that would solve
that issue on this list previously. I've setup replication with other
RDMSes, never mysql... but there is already a configured instance to
work from. I don't even see how this could be even considered a
challenge,... it's the ongoing maintenance that carries the real
burden.
OK, let's move forward with this then.
Neither the
e.V. nor Kate made any particular attempt to involve the
other Wikimedia system administrators in this project from its
conception. I was certainly sceptical about zedler's value as a tool
server compared to the use we could have made of it as part of the core
cluster. I've now heard about one project that I'm interested in, and I
have an open mind about the rest, but you still have to make the case.
Specifically: how does your project benefit Wikipedia? Why should I
support it?
You should support it for the following reasons:
1) There are many projects on toolserver considered valuable by many
Wikipedia editors. And many of these projects are far less useful, or
not even possible, without realtime or near realtime database access.
(in particular, high speed access to the *links tables is very useful
and difficult to replace with anything available remotely)
2) The existence of the toolserver allows interested third parties to
perform work which would otherwise be requested of the developers.
Work which is of questionable value can be proven by those who care,
real workload on developers can be reduced, and more tasks which are
valuable but not valuable enough to justify interrupting core
developers can be completed.... and finally latency for such requests
are reduced.
3) Toolserver acts as an incubator to build qualified developers. I,
and several other people, now have a far more solid understanding of
the mediawiki database schema as a result of toolserver... I've
learned many things which wouldn't be important working on a static
dump but are important working on realtime data, and presumably the
real site. Many of the toolserver users are directly using portions
of mediawiki code, as well. It is intuitively obvious that this will
increase the number of people knowledgeable about mediawiki, and thus
benefit Wikimedia and mediawiki.
I was more interested in the details of the projects, actually.
If you do not believe that toolserver is being
effectively utilized,
then by all means begin plans to repurpose it.
It's not exactly up to me. I had no influence in choosing the role of
this server. I'm only talking about use of my time.
My response will to
simple to address Jimbo, Angela, and Anthere directly, remind them of
their questions which I used toolserver access to answer, and offer to
donate another comparable piece of hardware under the condition that
it be used exclusively for toolserver use and maintained under a
continued administrative plan I approve of... I have absolutely no
doubt that such a negotiation would reach an agreeable conclusion, as
toolserver is very widely found useful. Quite frankly, you will not be
able to stonewall requests to developers provide the minimal level of
support necessary to support this type of service.
It would be far more productive if rather than demanding we re-justify
the obvious need for toolserver, you offer to work with us to
determine a plan of action which will put the burden of the service
maximally on those interested in it, and minimally on those not
interested in it.
Offerring to work with you appears to be exactly what I'm doing, with
this post and my original one. I'm only asking to hear about the
projects you're working on. I thought it was a fair request, was I wrong?
If it isn't clear, I'm more than completely
willing to take a
substantial role in the ongoing maintenance of the existing
toolserver. However, I'm not the ideal candidate: I escaped running
Solaris boxes when I left the world of being a fulltime sysadmin, and
I don't particularly look forward to going back to it... and I am not
a fan of MySQL and find it particularly poorly suited for the sort of
queries we perform on toolserver (and this is demonstrably true, query
planning for anything with a subquery is poor enough to be considered
outright broken). Both of these factors will require me to learn
things I would otherwise not, but I am willing to do so because I am
not burdened by a disbelief in the importance of toolserver.
Good enough. I had virtually no knowledge of linux (or any other
unix-based OS) when I started on Wikipedia, I also had to learn PHP and
SQL. So you're doing better than me.
Further,
my persistant and annoying questions about why we continue to work
around MySQL's shortcommings (for example, non-BMP UTF-8 support)
rather than at least evaluating other opensource RDBMSes, had clearly
caused tension between myself and, at least, our developers who are
otherwise employed by MySQL AB.
How about we save that argument for another day?
-- Tim Starling