On 3/29/06, Tim Starling <tstarling(a)wikimedia.org> wrote:
Including caching it should be enough, that's one
of the reasons I
picked the number. You can store the text of recent edits in a disk
cache. If there are two edits to the same article in quick succession,
you'll still have the text from the first in the cache when it comes to
analysing the second. You could cache all text from the last few hours,
then limit your analysis to that time frame.
An optimisation would be to produce a full-text feed, say via a TCP
stream. The apaches could send it on save using a message passing
library. But we'd have to evaluate that in terms of the resources it
would save. And compare it to the replication option of course.
I have some doubts there.. but ultimately were at a situation where
toolserver will be pulling every version (if due to nothing more than
the filter bot), plus there are users who will wish to read every
current revision for things like geo-reference extraction, and users
who will read every version in the history of an article (history flow
analysis).
So sure, 'cached' will work, but how much more cached can you get
beyond replicated?
Why are the list archives private? Why isn't it
listed on the
mail.wikipedia.org index page? Why isn't it in gmane? Why have I never
heard of it before? Gregory Maxwell has been whinging that nobody is
listening to him on a list that nobody can read.
Well it wasn't just a whine, I
did repeat my offer to help. I also
CCed wikidev list, though I can understand why a single request there
would be lost.
Nor were my requests placed in both the wikimedia-dev irc channel nor
the toolserver IRC channel in an inaccessible place.
Are you saying you think the
archives should be private?
No, I wasn't... just that my complaints were not just here (and
obviously whomever is currently authoritative should be on this
list...). But now that I've considered it some, this list has been
used at least once to say "hey twits, I looked at your code and its
vulnerable to SQL injection" ... so keeping it closed except to users
who otherwise have access is necessarily bad.
[multiple mysql instances]
OK, let's move forward with this then.
Ok
[snip generic justification from me]
I was more interested in the details of the projects,
actually.
Well I can speak for what I've done... I think other people are doing
more interesting things than I am, in part because I keep running into
stupid problems. (Loss of text access, found a subselect bug in mysql
which we've since had fixed, impatience with solaris)
I've got an irc bot that is currently inactive due to lack of
text-access which detects suspect edits based on a number of criteria
and has a remarkably low false positive rate... and was widely used (I
still get emails every week or two asking me to turn it back on, even
though it's been done for some time now).
Based on the same code I have a tool which does realtime grammar
checking, though I was still working on getting that production
ready...
I have a fair number of rather boring reports, things that perform set
operations with categories, pagelinks, imagelinks, templatelinks, etc
tables... They answer questions which are used by many users to
implement procedures on enwikipedia... such as 'show me all pages
which have one of these 'indicates a human' categories but without one
of these 'indicates a dead human' categories which isn't tagged with
the living people category (which related changes is used on for libel
patrolling). The lower computation complexity ones are performed on
request in realtime, other ones are either cached or triggered out of
cron.
I update WP:1000, and I also answer random bullshit questions poised
by just about anyone who can manage to ask something clear enough that
I can turn it into SQL
(
http://en.wikipedia.org/wiki/Wikipedia_talk:Userbox_policy_poll#Meta_analys…).
I also answer such trivia in cases which isn't so unimportant... like
injecting data into the FUD about april first.
I've built a web interface similar to
http://www.placeopedia.com/
(though far less polished), using tiger line (thus only US) data
stored in a PostGIS database, but thats not on toolserver anymore
because I had dependency trouble with some of the mapping libraries
using solaris 10 and gave up for other projects for a while. I have a
couple of other tools ready to go, simple stuff like per-article edit
activity rate graphers and such, but all of my stats stuff is done
using [[R_programming_language]] which had serious issues on Solaris
10 last time I gave it a shot. There really is enough work that stuff
that doesn't just work on Solaris gets moved to the back.
I also run a wikipedia bot that does a number of tasks which is not
itself on toolserver, but which operates itself based on queries
performed on toolserver. Almost all the tasks I have it do are ones
which would require reading hundreds of pages via http were I unable
to drive it with queries.
I also have a 1/4th finished framework which will eventually be used
for collaborative review. Initially I'm going to target it for stupid
stuff like reviewing blocks. I actually haven't showed it to anyone
yet, it's yet another project where I'm hung up on lack of text
access, but you can look.. main page is at
http://tools.wikimedia.de/~gmaxwell/audit/ one of the applications is
at
http://tools.wikimedia.de/~gmaxwell/cgi-bin/audit_block_audit.py
and a static view of the app (so you don't have to setup an account,
which is kinda annoying when replag is high,
http://tools.wikimedia.de/~gmaxwell/audit_block_audit.html) ... It's
almost functional, I have to work out some policy decisions on how to
handle deleted edits for sysop users vs non-sysop users before I can
make that little bit go live. The whole purpose of such tools is to
bring all the relevant information into one place so a user can just
work down a list making decisions... Without text access, it's hard to
bring all the relevant information into one place.
Yes, I can http get it but it would be silly to impliment that if
we're going to get access back in some form, and if we're not going to
get it back form I'm going to give up toolserver anyways.
There are other users, with tools like the coundown deletion tool,
article spell checkers. I'll let them speak for themselves.
If you do not
believe that toolserver is being effectively utilized,
then by all means begin plans to repurpose it.
It's not exactly up to me. I had no influence in choosing the role of
this server. I'm only talking about use of my time.
Ah. My apologies there, it's just that toolserver has been neglected
for so long and already a good chunk of my work has been made useless,
it only seemed logical that the next step would be to just kill it
completely.
Offerring to work with you appears to be exactly what
I'm doing, with
this post and my original one. I'm only asking to hear about the
projects you're working on. I thought it was a fair request, was I wrong?
No, but I took it wrong.
Good enough. I had virtually no knowledge of linux (or
any other
unix-based OS) when I started on Wikipedia, I also had to learn PHP and
SQL. So you're doing better than me.
Wow, newbie.
How about we save that argument for another day?
Fine by me! :)