Some comments

List overview All Threads
Download

newer

older

Some comments

Troubles with reading Articles

Tim Starling

29 Mar 2006 29 Mar '06

5:06 p.m.

I found out about this list a few days ago, and I've read back through some of the archives. I have a few comments. Why are the list archives private? Why isn't it listed on the mail.wikipedia.org index page? Why isn't it in gmane? Why have I never heard of it before? Gregory Maxwell has been whinging that nobody is listening to him on a list that nobody can read. Kate can be a bit secretive at times, and this was at least at one time her pet project, but maybe now that she seems to have abandoned it, then it's time to change the structure. Neither the e.V. nor Kate made any particular attempt to involve the other Wikimedia system administrators in this project from its conception. I was certainly sceptical about zedler's value as a tool server compared to the use we could have made of it as part of the core cluster. I've now heard about one project that I'm interested in, and I have an open mind about the rest, but you still have to make the case. Specifically: how does your project benefit Wikipedia? Why should I support it? Daniel Kinzler wrote:

...

Yesterday, Kate told me that the problem with replication from the Asian cluster is that mysql can only connect to one replication master. I have googeled a bit, and it appears that that is not true (at least for MySQL 5.1): http://dev.mysql.com/doc/refman/5.1/en/replication-intro.html says: Multiple-master replication is possible, but raises issues not present in single-master replication. See Section 6.15, “Auto-Increment in Multiple-Master Replication”.

Multiple-master replication in this context could more aptly be called circular replication. This is where you have say 3 servers, A replicating B, B replicating C, C replicating A. Then you can write to any of the three servers, and the writes will be propagated to the other 2 servers. This is quite useless for the toolserver, where we have 5 masters which will never replicate from each other in a circle. It should be possible to set up 5 MySQL instances and have each of them replicating from a different master. Is anyone volunteering to set up those instances? Maybe we need to give root access to someone who actually cares about this stuff. It would be easier if we had a VLAN, so that we didn't have to set up 5 ssh tunnels. Does anyone know anything about VLANs? Does anyone care enough about this project to research it? Regarding Daniel's WikiProxy: I have reviewed the code, and I have the following comments: * use curl, not file_get_contents() * With curl you can set a short timeout, with file_get_contents() it will be 3 minutes. Set a timeout of a few seconds, and then use exponential backoff. Requests get lost sometimes, retries help. * Tell curl to proxy the request via rr.pmtpa.wikimedia.org:80. This will skip the knams squid cluster and save a few milliseconds For applications using it: if it's too slow, use a few parallel threads. Anything up to about 5 requests per second should be OK. Who here needs more than 5 requests per second? Who needs a latency of less than a few hundred milliseconds? What exactly do you want full text replication for? -- Tim Starling

Show replies by date

Gregory Maxwell

29 Mar 29 Mar

6:29 p.m.

On 3/29/06, Tim Starling <tstarling(a)wikimedia.org> wrote: [quoted out of order]

...

Who here needs more than 5 requests per second? Who needs a latency of less than a few hundred milliseconds? What exactly do you want full text replication for?

5 requests per second isn't enough to keep up with reading 2-6 full texts per recent change on enwiki which is what my bad edit detection tool does (depending on the content, more than two because it will look deep enough to help people fully revert contiguous multi-edit vandalism). Enwiki has a bit over 1.2 main namespace edits per second on a 24 hour average. I need to keep up with the instanious rate on a 1 minute-ish window, which is much greater, if I don't want the vandalism checking to lag annoyingly far behind realtime. This ran with negligible impact on toolserver when we still had text there... [snip]

...

It would be easier if we had a VLAN, so that we didn't have to set up 5 ssh tunnels. Does anyone know anything about VLANs? Does anyone care enough about this project to research it?

That's a subject area well within my professional skillset, I'd be more than willing to look into it. [snip]

...

Why are the list archives private? Why isn't it listed on the mail.wikipedia.org index page? Why isn't it in gmane? Why have I never heard of it before? Gregory Maxwell has been whinging that nobody is listening to him on a list that nobody can read.

Well it wasn't just a whine, I did repeat my offer to help. I also CCed wikidev list, though I can understand why a single request there would be lost. Nor were my requests placed in both the wikimedia-dev irc channel nor the toolserver IRC channel in an inaccessible place. [snip]

...

It should be possible to set up 5 MySQL instances and have each of them replicating from a different master. Is anyone volunteering to set up those instances? Maybe we need to give root access to someone who actually cares about this stuff.

I, in effect, volunteered to do it when I pointed out that would solve that issue on this list previously. I've setup replication with other RDMSes, never mysql... but there is already a configured instance to work from. I don't even see how this could be even considered a challenge,... it's the ongoing maintenance that carries the real burden.

...

Neither the e.V. nor Kate made any particular attempt to involve the other Wikimedia system administrators in this project from its conception. I was certainly sceptical about zedler's value as a tool server compared to the use we could have made of it as part of the core cluster. I've now heard about one project that I'm interested in, and I have an open mind about the rest, but you still have to make the case. Specifically: how does your project benefit Wikipedia? Why should I support it?

You should support it for the following reasons: 1) There are many projects on toolserver considered valuable by many Wikipedia editors. And many of these projects are far less useful, or not even possible, without realtime or near realtime database access. (in particular, high speed access to the *links tables is very useful and difficult to replace with anything available remotely) 2) The existence of the toolserver allows interested third parties to perform work which would otherwise be requested of the developers. Work which is of questionable value can be proven by those who care, real workload on developers can be reduced, and more tasks which are valuable but not valuable enough to justify interrupting core developers can be completed.... and finally latency for such requests are reduced. 3) Toolserver acts as an incubator to build qualified developers. I, and several other people, now have a far more solid understanding of the mediawiki database schema as a result of toolserver... I've learned many things which wouldn't be important working on a static dump but are important working on realtime data, and presumably the real site. Many of the toolserver users are directly using portions of mediawiki code, as well. It is intuitively obvious that this will increase the number of people knowledgeable about mediawiki, and thus benefit Wikimedia and mediawiki. If you do not believe that toolserver is being effectively utilized, then by all means begin plans to repurpose it. My response will to simple to address Jimbo, Angela, and Anthere directly, remind them of their questions which I used toolserver access to answer, and offer to donate another comparable piece of hardware under the condition that it be used exclusively for toolserver use and maintained under a continued administrative plan I approve of... I have absolutely no doubt that such a negotiation would reach an agreeable conclusion, as toolserver is very widely found useful. Quite frankly, you will not be able to stonewall requests to developers provide the minimal level of support necessary to support this type of service. It would be far more productive if rather than demanding we re-justify the obvious need for toolserver, you offer to work with us to determine a plan of action which will put the burden of the service maximally on those interested in it, and minimally on those not interested in it. If it isn't clear, I'm more than completely willing to take a substantial role in the ongoing maintenance of the existing toolserver. However, I'm not the ideal candidate: I escaped running Solaris boxes when I left the world of being a fulltime sysadmin, and I don't particularly look forward to going back to it... and I am not a fan of MySQL and find it particularly poorly suited for the sort of queries we perform on toolserver (and this is demonstrably true, query planning for anything with a subquery is poor enough to be considered outright broken). Both of these factors will require me to learn things I would otherwise not, but I am willing to do so because I am not burdened by a disbelief in the importance of toolserver. Further, my persistant and annoying questions about why we continue to work around MySQL's shortcommings (for example, non-BMP UTF-8 support) rather than at least evaluating other opensource RDBMSes, had clearly caused tension between myself and, at least, our developers who are otherwise employed by MySQL AB.

Tim Starling

30 Mar 30 Mar

3:26 a.m.

Gregory Maxwell wrote:

...

On 3/29/06, Tim Starling <tstarling(a)wikimedia.org> wrote: [quoted out of order]

Who here needs more than 5 requests per second? Who needs a latency of less than a few hundred milliseconds? What exactly do you want full text replication for?

Including caching it should be enough, that's one of the reasons I picked the number. You can store the text of recent edits in a disk cache. If there are two edits to the same article in quick succession, you'll still have the text from the first in the cache when it comes to analysing the second. You could cache all text from the last few hours, then limit your analysis to that time frame. An optimisation would be to produce a full-text feed, say via a TCP stream. The apaches could send it on save using a message passing library. But we'd have to evaluate that in terms of the resources it would save. And compare it to the replication option of course.

...

[snip]

It would be easier if we had a VLAN, so that we didn't have to set up 5 ssh tunnels. Does anyone know anything about VLANs? Does anyone care enough about this project to research it?

That's a subject area well within my professional skillset, I'd be more than willing to look into it. [snip]

Are you saying you think the archives should be private?

...

[snip]

OK, let's move forward with this then.

...

I was more interested in the details of the projects, actually.

...

If you do not believe that toolserver is being effectively utilized, then by all means begin plans to repurpose it.

It's not exactly up to me. I had no influence in choosing the role of this server. I'm only talking about use of my time.

...

My response will to simple to address Jimbo, Angela, and Anthere directly, remind them of their questions which I used toolserver access to answer, and offer to donate another comparable piece of hardware under the condition that it be used exclusively for toolserver use and maintained under a continued administrative plan I approve of... I have absolutely no doubt that such a negotiation would reach an agreeable conclusion, as toolserver is very widely found useful. Quite frankly, you will not be able to stonewall requests to developers provide the minimal level of support necessary to support this type of service. It would be far more productive if rather than demanding we re-justify the obvious need for toolserver, you offer to work with us to determine a plan of action which will put the burden of the service maximally on those interested in it, and minimally on those not interested in it.

Offerring to work with you appears to be exactly what I'm doing, with this post and my original one. I'm only asking to hear about the projects you're working on. I thought it was a fair request, was I wrong?

...

If it isn't clear, I'm more than completely willing to take a substantial role in the ongoing maintenance of the existing toolserver. However, I'm not the ideal candidate: I escaped running Solaris boxes when I left the world of being a fulltime sysadmin, and I don't particularly look forward to going back to it... and I am not a fan of MySQL and find it particularly poorly suited for the sort of queries we perform on toolserver (and this is demonstrably true, query planning for anything with a subquery is poor enough to be considered outright broken). Both of these factors will require me to learn things I would otherwise not, but I am willing to do so because I am not burdened by a disbelief in the importance of toolserver.

Good enough. I had virtually no knowledge of linux (or any other unix-based OS) when I started on Wikipedia, I also had to learn PHP and SQL. So you're doing better than me.

...

Further, my persistant and annoying questions about why we continue to work around MySQL's shortcommings (for example, non-BMP UTF-8 support) rather than at least evaluating other opensource RDBMSes, had clearly caused tension between myself and, at least, our developers who are otherwise employed by MySQL AB.

How about we save that argument for another day? -- Tim Starling

Gregory Maxwell

4:45 a.m.

On 3/29/06, Tim Starling <tstarling(a)wikimedia.org> wrote:

...

I have some doubts there.. but ultimately were at a situation where toolserver will be pulling every version (if due to nothing more than the filter bot), plus there are users who will wish to read every current revision for things like geo-reference extraction, and users who will read every version in the history of an article (history flow analysis). So sure, 'cached' will work, but how much more cached can you get beyond replicated?

...

Are you saying you think the archives should be private?

No, I wasn't... just that my complaints were not just here (and obviously whomever is currently authoritative should be on this list...). But now that I've considered it some, this list has been used at least once to say "hey twits, I looked at your code and its vulnerable to SQL injection" ... so keeping it closed except to users who otherwise have access is necessarily bad. [multiple mysql instances]

...

OK, let's move forward with this then.

Ok [snip generic justification from me]

...

I was more interested in the details of the projects, actually.

Well I can speak for what I've done... I think other people are doing more interesting things than I am, in part because I keep running into stupid problems. (Loss of text access, found a subselect bug in mysql which we've since had fixed, impatience with solaris) I've got an irc bot that is currently inactive due to lack of text-access which detects suspect edits based on a number of criteria and has a remarkably low false positive rate... and was widely used (I still get emails every week or two asking me to turn it back on, even though it's been done for some time now). Based on the same code I have a tool which does realtime grammar checking, though I was still working on getting that production ready... I have a fair number of rather boring reports, things that perform set operations with categories, pagelinks, imagelinks, templatelinks, etc tables... They answer questions which are used by many users to implement procedures on enwikipedia... such as 'show me all pages which have one of these 'indicates a human' categories but without one of these 'indicates a dead human' categories which isn't tagged with the living people category (which related changes is used on for libel patrolling). The lower computation complexity ones are performed on request in realtime, other ones are either cached or triggered out of cron. I update WP:1000, and I also answer random bullshit questions poised by just about anyone who can manage to ask something clear enough that I can turn it into SQL (http://en.wikipedia.org/wiki/Wikipedia_talk:Userbox_policy_poll#Meta_analys…). I also answer such trivia in cases which isn't so unimportant... like injecting data into the FUD about april first. I've built a web interface similar to http://www.placeopedia.com/ (though far less polished), using tiger line (thus only US) data stored in a PostGIS database, but thats not on toolserver anymore because I had dependency trouble with some of the mapping libraries using solaris 10 and gave up for other projects for a while. I have a couple of other tools ready to go, simple stuff like per-article edit activity rate graphers and such, but all of my stats stuff is done using [[R_programming_language]] which had serious issues on Solaris 10 last time I gave it a shot. There really is enough work that stuff that doesn't just work on Solaris gets moved to the back. I also run a wikipedia bot that does a number of tasks which is not itself on toolserver, but which operates itself based on queries performed on toolserver. Almost all the tasks I have it do are ones which would require reading hundreds of pages via http were I unable to drive it with queries. I also have a 1/4th finished framework which will eventually be used for collaborative review. Initially I'm going to target it for stupid stuff like reviewing blocks. I actually haven't showed it to anyone yet, it's yet another project where I'm hung up on lack of text access, but you can look.. main page is at http://tools.wikimedia.de/~gmaxwell/audit/ one of the applications is at http://tools.wikimedia.de/~gmaxwell/cgi-bin/audit_block_audit.py and a static view of the app (so you don't have to setup an account, which is kinda annoying when replag is high, http://tools.wikimedia.de/~gmaxwell/audit_block_audit.html) ... It's almost functional, I have to work out some policy decisions on how to handle deleted edits for sysop users vs non-sysop users before I can make that little bit go live. The whole purpose of such tools is to bring all the relevant information into one place so a user can just work down a list making decisions... Without text access, it's hard to bring all the relevant information into one place. Yes, I can http get it but it would be silly to impliment that if we're going to get access back in some form, and if we're not going to get it back form I'm going to give up toolserver anyways. There are other users, with tools like the coundown deletion tool, article spell checkers. I'll let them speak for themselves.

...

If you do not believe that toolserver is being effectively utilized, then by all means begin plans to repurpose it.

It's not exactly up to me. I had no influence in choosing the role of this server. I'm only talking about use of my time.

Ah. My apologies there, it's just that toolserver has been neglected for so long and already a good chunk of my work has been made useless, it only seemed logical that the next step would be to just kill it completely.

...

No, but I took it wrong.

...

Good enough. I had virtually no knowledge of linux (or any other unix-based OS) when I started on Wikipedia, I also had to learn PHP and SQL. So you're doing better than me.

Wow, newbie.

...

How about we save that argument for another day?

Fine by me! :)

Gregory Maxwell

4:56 a.m.

On 3/29/06, Gregory Maxwell <gmaxwell(a)gmail.com> wrote:

...

I also run a wikipedia bot that does a number of tasks which is not itself on toolserver, but which operates itself based on queries performed on toolserver. Almost all the tasks I have it do are ones which would require reading hundreds of pages via http were I unable to drive it with queries.

http://tools.wikimedia.de/~kate/cgi-bin/count_edits?user=Roomba&dbname=… (based on the sort of activity roomba does, the workload savings vs screen scraping is remarkable... for example one of its more frequent activities is tagging orpahaned fair use images. Once a day it finds all of them, and keeps a running list. Images which are persistantly orphaned for a while get tagged. The query typically takes a couple of seconds, without toolserver roomba would need to walk the categories to find all the image pages then load them to find what links to them... This would result in about a million additional http requests a week. Limited to 5 requests/second, it just wouldn't work at all)

6617

days inactive

6618

days old

toolserver-l@lists.wikimedia.org

Manage subscription

4 comments

2 participants

tags (0)

participants (2)

Gregory Maxwell
Tim Starling