On 3/29/06, Tim Starling tstarling@wikimedia.org wrote:
Including caching it should be enough, that's one of the reasons I picked the number. You can store the text of recent edits in a disk cache. If there are two edits to the same article in quick succession, you'll still have the text from the first in the cache when it comes to analysing the second. You could cache all text from the last few hours, then limit your analysis to that time frame.
An optimisation would be to produce a full-text feed, say via a TCP stream. The apaches could send it on save using a message passing library. But we'd have to evaluate that in terms of the resources it would save. And compare it to the replication option of course.
I have some doubts there.. but ultimately were at a situation where toolserver will be pulling every version (if due to nothing more than the filter bot), plus there are users who will wish to read every current revision for things like geo-reference extraction, and users who will read every version in the history of an article (history flow analysis).
So sure, 'cached' will work, but how much more cached can you get beyond replicated?
Why are the list archives private? Why isn't it listed on the mail.wikipedia.org index page? Why isn't it in gmane? Why have I never heard of it before? Gregory Maxwell has been whinging that nobody is listening to him on a list that nobody can read.
Well it wasn't just a whine, I did repeat my offer to help. I also CCed wikidev list, though I can understand why a single request there would be lost.
Nor were my requests placed in both the wikimedia-dev irc channel nor the toolserver IRC channel in an inaccessible place.
Are you saying you think the archives should be private?
No, I wasn't... just that my complaints were not just here (and obviously whomever is currently authoritative should be on this list...). But now that I've considered it some, this list has been used at least once to say "hey twits, I looked at your code and its vulnerable to SQL injection" ... so keeping it closed except to users who otherwise have access is necessarily bad.
[multiple mysql instances]
OK, let's move forward with this then.
Ok
[snip generic justification from me]
I was more interested in the details of the projects, actually.
Well I can speak for what I've done... I think other people are doing more interesting things than I am, in part because I keep running into stupid problems. (Loss of text access, found a subselect bug in mysql which we've since had fixed, impatience with solaris)
I've got an irc bot that is currently inactive due to lack of text-access which detects suspect edits based on a number of criteria and has a remarkably low false positive rate... and was widely used (I still get emails every week or two asking me to turn it back on, even though it's been done for some time now).
Based on the same code I have a tool which does realtime grammar checking, though I was still working on getting that production ready...
I have a fair number of rather boring reports, things that perform set operations with categories, pagelinks, imagelinks, templatelinks, etc tables... They answer questions which are used by many users to implement procedures on enwikipedia... such as 'show me all pages which have one of these 'indicates a human' categories but without one of these 'indicates a dead human' categories which isn't tagged with the living people category (which related changes is used on for libel patrolling). The lower computation complexity ones are performed on request in realtime, other ones are either cached or triggered out of cron.
I update WP:1000, and I also answer random bullshit questions poised by just about anyone who can manage to ask something clear enough that I can turn it into SQL (http://en.wikipedia.org/wiki/Wikipedia_talk:Userbox_policy_poll#Meta_analysi...). I also answer such trivia in cases which isn't so unimportant... like injecting data into the FUD about april first.
I've built a web interface similar to http://www.placeopedia.com/ (though far less polished), using tiger line (thus only US) data stored in a PostGIS database, but thats not on toolserver anymore because I had dependency trouble with some of the mapping libraries using solaris 10 and gave up for other projects for a while. I have a couple of other tools ready to go, simple stuff like per-article edit activity rate graphers and such, but all of my stats stuff is done using [[R_programming_language]] which had serious issues on Solaris 10 last time I gave it a shot. There really is enough work that stuff that doesn't just work on Solaris gets moved to the back.
I also run a wikipedia bot that does a number of tasks which is not itself on toolserver, but which operates itself based on queries performed on toolserver. Almost all the tasks I have it do are ones which would require reading hundreds of pages via http were I unable to drive it with queries.
I also have a 1/4th finished framework which will eventually be used for collaborative review. Initially I'm going to target it for stupid stuff like reviewing blocks. I actually haven't showed it to anyone yet, it's yet another project where I'm hung up on lack of text access, but you can look.. main page is at http://tools.wikimedia.de/~gmaxwell/audit/ one of the applications is at http://tools.wikimedia.de/~gmaxwell/cgi-bin/audit_block_audit.py and a static view of the app (so you don't have to setup an account, which is kinda annoying when replag is high, http://tools.wikimedia.de/~gmaxwell/audit_block_audit.html) ... It's almost functional, I have to work out some policy decisions on how to handle deleted edits for sysop users vs non-sysop users before I can make that little bit go live. The whole purpose of such tools is to bring all the relevant information into one place so a user can just work down a list making decisions... Without text access, it's hard to bring all the relevant information into one place.
Yes, I can http get it but it would be silly to impliment that if we're going to get access back in some form, and if we're not going to get it back form I'm going to give up toolserver anyways.
There are other users, with tools like the coundown deletion tool, article spell checkers. I'll let them speak for themselves.
If you do not believe that toolserver is being effectively utilized, then by all means begin plans to repurpose it.
It's not exactly up to me. I had no influence in choosing the role of this server. I'm only talking about use of my time.
Ah. My apologies there, it's just that toolserver has been neglected for so long and already a good chunk of my work has been made useless, it only seemed logical that the next step would be to just kill it completely.
Offerring to work with you appears to be exactly what I'm doing, with this post and my original one. I'm only asking to hear about the projects you're working on. I thought it was a fair request, was I wrong?
No, but I took it wrong.
Good enough. I had virtually no knowledge of linux (or any other unix-based OS) when I started on Wikipedia, I also had to learn PHP and SQL. So you're doing better than me.
Wow, newbie.
How about we save that argument for another day?
Fine by me! :)