Wikitech-l March 2004

wikitech-l@lists.wikimedia.org

96 participants
150 discussions

Re: [Wikitech-l] A Plan For Van(dalism)
by Kelly Anderson 18 Mar '04

18 Mar '04

Having spent a good portion of my academic life in the field of Pattern Recognition, and having read the CRM114 articles in depth, my gut tells me that some of the other methods (throttling, etc.) would likely reduce Wikipedian Vandalism sufficiently that a Markovian approach would give little additional benefit. Making vandalism easy to fix, like undoing all edits from a particular user/IP with one switch also sound very useful. I do like the idea of flagging suspect pages and I also really liked the idea of flagging anonymous users at a higher suspect rate than logged in users. I suspect that a "heuristic" (lots of rules) model similar to that used by SpamAssasin where you can plug in new rules for new threats would probably be the best way to solve the Vandalism problem for Wikipedia over time (although I don't have a really good idea from the discussion thus far as to whether it is already a serious issue or not). For example, one heuristic would be "anonymous user" add a couple of points to the "Spam" score. Another heuristic, "this user/IP has added lots of edits really quickly" would add a few points. If a user was in a "safe" list as a mass spell checker/grammar checker, then you would subtract a bunch of points. Heuristics for the type of article might be useful. I suspect that political articles are particularly targeted by vandals, for example. The point is, a heuristic engine adapts over time, whereas a Markovian model would perhaps serve as ONE good heuristic within the larger engine. Markovian or Baysian engines tend to get "tired" over time. My Baysian email spam filter is now so tired of spam that it sees nearly everything as spam, whether it is or isn't (of course it was trained on nearly 300,000 spams before it reached that point.) When a heuristic engine is coupled with human interaction and double checking (as would be the natural case with Wikipedia) this becomes a great system. You would just have a special suspected vandalism page that is generated much as the special statistics type pages are generated now. Knowing the seriousness of the problem at this point would drive whether or not this feature should be developed quickly, and that is knowledge I don't posess at this point. -Kelly At 08:17 PM 3/14/2004, you wrote: >So, I just installed the CRM114 Markovian spam filtering software: > > http://crm114.sourceforge.net/ > >The whole thing is based on Bayesian filtering, which is just a way to >make very dumb software make really smart decisions. With sufficient >training, a very simple piece of software can make very accurate >distinctions between spam and non-spam email messages. See Paul >Graham's famous "A Plan for Spam" about this: > > http://www.paulgraham.com/spam.html > >The CRM114 stuff is Markovian, which means it's _even_dumber_ than >Bayesian stuff, and makes _even_smarter_ decisions. More or less. > >Anyways, one thing that's mentioned on the crm114 page is that folks >use the same technology for different kinds of text sorting. Like, for >system administrators, they can sort log file entries into ones >they're interested in and ones they're not. > >And I was thinking: you know, it'd be nice to be able to flag >acceptable and problematic articles in MediaWiki Web sites. Like, say, >an admin sees some vandalism going on, and goes to fix it. One of the >checkmarks on saving is "Vandalism fix" or some such. This would tag >the previous version as... ungood. Something. > >And then after a while the software gets good at understanding what's >ungood and what's not. And there could be a tracking page to say, >"These seem to be pages in an ungood state." And it would be easier to >find those and fix 'em. > >~ESP > >-- >Evan Prodromou <evan(a)wikitravel.org> >Wikitravel - http://www.wikitravel.org/ >The free, complete, up-to-date and reliable world-wide travel guide >_______________________________________________ >Wikitech-l mailing list >Wikitech-l(a)Wikipedia.org >http://mail.wikipedia.org/mailman/listinfo/wikitech-l

1 0

Re: [Wikitech-l] Gnutella
by Kelly Anderson 18 Mar '04

18 Mar '04

I'm sorry if my ignorance is showing here, but I went to both sourceforge and gnutella.com and I still don't have any idea what gnutella is, other than it's a peer to peer networking protocol for file sharing. In other words, I can't figure out why Gnutella is different from Ares, the old Napster, or Bittorrent from an architectural point of view. I am more familiar with Bittorrent, which is very useful for distributing copies of very large files (which a pdf version of Wikipedia would certainly qualify as) without using hardly any bandwidth on the main server. It's very cool that way. Perhaps Gnutella is similarly cool, but I can't find a "what is Gnutella" web page. Quoting: "The practical implication is that the BitTorrent system makes it easy to distribute very large files to large numbers of people while placing minimal bandwidth requirements on the original "seeder." That is because everyone who wants the file is sharing with one another, rather than downloading from a central source. A separate file-sharing network known as eDonkey uses a similar system." Would someone who is familiar with both Gnutella and Bittorrent tell me why using Bittorrent for such a project would be stupid? It would certainly use less of Wikipedia's already strained bandwidth. -Kelly At 04:10 PM 3/9/2004, you wrote: > >>>>> "IF" == Itai Fiat <itai(a)mail.portland.co.uk> writes: > > IF> Hello, This my first post to wikitech-l (yay!), and a > IF> presumptuous one at that. I recently had an idea, and have > IF> written an written a short (the frame of reference being the > IF> Magna Carta) article describing it at > IF> http://meta.wikipedia.org/wiki/Gnutella (for confusions > IF> sake, the content of the article will not be recounted > IF> here). While I have no problem developing this myself, I was > IF> wondering if anybody has any objections - or, better yet, > IF> suggestions or cash - they'd like to contribute in advance. > >So, I added comments on-site, but here's my feeling: > >I think a p2p distribution mechanism for Wikipedia is an EXCELLENT >idea. It would/could do a lot to lower the load on the main servers. > >I also think that if I had a fancy new-generation p2p client, I'd make >a go of re-distributing Wikipedia content. It would be a great way to >promote my network, especially if it has support for in-network >Websites (like Freenet does). > >But I don't think this is core functionality for Wikimedia, and I >doubt that it'll get a lot of support around here. I think probably >the best idea is to download the database dumps, and maybe experiment >from there. > >You may want to shop this idea with one of the open source P2P network >groups, to see what they think about it. Frankly, there's more >benefit for a P2P network than there is to Wikipedia, so they'll >probably be more interested. > >Anyways, good idea. You're gonna need to stick with it to see it >happen, though. The best ideas are like that. > >~ESP > >-- >Evan Prodromou <evan(a)wikitravel.org> >Wikitravel - http://www.wikitravel.org/ >The free, complete, up-to-date and reliable world-wide travel guide >_______________________________________________ >Wikitech-l mailing list >Wikitech-l(a)Wikipedia.org >http://mail.wikipedia.org/mailman/listinfo/wikitech-l

5 6

no-cache header?
by Timwi 18 Mar '04

18 Mar '04

Hi, it seems that Wikipedia's HTTP response contains this header: Cache-Control: private, s-maxage=0, max-age=0, must-revalidate If I understand this right, this means that the page must not be cached by the browser. Has this header always been there? Or has it been added recently? If so, can it be removed again? :) The reason I'm bringing this up is that I've noticed that Firefox has suddenly started reloading all pages whenever I use the "Back" function. It didn't use to do that until today. Hence, I suppose the header wasn't there before. If the header isn't new and has always been there, does anyone have any idea what else might be causing Firefox's change in behaviour? Thanks, Timwi

3 2

Re: [Wikitech-l] Pretty serious bug (was: Strangeness with Wikipedia:Orphaned Articles)
by engelsAG＠t-online.de 18 Mar '04

18 Mar '04

"Timwi" <timwi(a)gmx.net> schrieb: > Rich Holton wrote: > > > Ummm... [[wikipedia:Orphaned Articles]] is distinct > > from [[wikipedia:Orphaned articles]] (look closely). > > Yeah. Someone pointed it out to me on my User talk page. I suppose we > can really delete [[Wikipedia:Orphaned articles]] and move > [[Wikipedia:Orphaned Articles]] over to the intuitive capitalisation > that matches our naming convention in the article namespace. > > > It does sound similar to what you describe, though > > more muddled. Here we had: > > -Heading plus sections A & B > > -Heading > > -Heading plus sections A thru Z (ie whole correct > > page) > > -Sections D thru Z > > ** PLING! ** > > Whee, I know what's going on. And I'm really surprised this didn't come > up before, but anyway. This is just a theory, but it seems so 100% > plausible to me that it just gotta be right: > > I think what is happening is this: > > - You click the little "edit" link for section editing. In your example, > you edited the section "C". > - You get an edit conflict. The top window on the edit conflict screen > then displays the entire text of the article as it was submitted by > the person that was quicker than you. > - However, a hidden form element still contains the section number. > Thus, when you submit, the software thinks you're still > section-editing, although you're actually sending the entire article! > - Thus, it replaces section "C" with the entire article text. This is > why all sections except for "C" are duplicated. > > This definitely needs to be fixed :-) I suppose the easiest fix would be > to just remove that hidden form element and have users fiddle with the > entire article text. A more sensible fix would be to actually display > only that section. However, that latter solution will be extremely > complicated in cases where the person that was quicker than you added or > removed a section above the one you were editing, so the sections are > renumbered... If that is indeed the problem, it must be a relatively new bug - I have done this kind of thing several times, and as far as I know I only once did this result in the given bug. Alternative explanation is of course that I have caused the bug several times without realizing I did so... Andre

1 0

some questions about old_table
by Ivo Köthnig 17 Mar '04

17 Mar '04

Hello, I played a little bit with the (german) old_table (from 20040305) and have some questions about it. 1. As I understand the first entry of each row, this is an id, incremented for each new row. since there are missing some of the numbers, I assume that this are rows, which was deleted. Is this right? 2. There is a column timestamp and a column inverse timestamp. For what reason we need the inverse timestamp? There also seems to be an inconsistency. The entry with the id 494209 (article about "Optik" in namespace 0 has timestamp "20031231041409" but inverse timestamp "80008783828360". May this is a new-year-bug? May this should be corrected manually?! 3. Which namespace is represented by which of the numbers 0 till 9? 4. The column for the "user-comment" in the begining contains often '*' but later nothing. Was it just the behaviour of the old software to represent no comment by '*'? 5. The column for the "user_id" in the beginning contains just 0, even for well-known users. later (after conversion_script appears) the correct id is printed. The column for the "user_name" has the same strange behaviour. It contains dns-names instead of IP-adresses. Was this the behaviour of the old software? 6. The coulumn "old_flags" seems to contain nothing. (its almost always empty, but contains one/sometimes '0'). Whats the application of this column? What flags does it contain? --Ivo Köthnig

7 11

Re: [Wikitech-l] Re: Spambot
by engelsAG＠t-online.de 17 Mar '04

17 Mar '04

"Jimmy Wales" <jwales(a)bomis.com> schrieb: > Good grief. The patent system is out of control. > > Actually, since the original captcha idea is in fact clever, and since > I'm not a totally anti-patent person, I could it being reasonable to > reward the inventors of the idea with a 2 year patent. 17 years is an > infinity. > > I wonder if there are things that we've developed in house here at > Wikimedia that we should patent, just to license our patents freely > of course, and also to make fun of the patent system. Problem is that patents cost money and time to issue. On the other hand, if we find a way to issue it in some way that it is "free, provided you don't stand on our mat with your patents either", it might be useful. Now, I don't know the inner workings of the Wikipedia software well enough to know what might be patented there, but on the outside: * best to patent, but most likely to fall short on being the first would of course be the combination of wiki-like technology with databases. Something like using a database to enable users to change webpages and storing them in a simplified format that can be worked into HTML. * namespaces. Having webpages in a database with both a title and an identifier giving various kinds of pages. Could be patented as a way of allowing users to make comments on webpages (on a separate page) * reverting. Having an interface to show changes to webpages and restore them easily to an older version There's probably more, but it's past 4 am here... Andre Engels

2 1

bandwidth thieves blocked
by Gabriel Wicke 17 Mar '04

17 Mar '04

No more image bandwidth used by the following sites: # image theft acls acl imgurl urlpath_regex -i \.(jpg|gif|png)$ acl badimgref referer_regex -i wikipedia.t-st.de acl badimgref referer_regex -i worldhistory.com acl badimgref referer_regex -i yourencyclopedia.net acl badimgref referer_regex -i wordiq.com acl badimgref referer_regex -i artpolitic.org acl badimgref referer_regex -i ruv.net # deny access http_access deny badimgref imgurl Thanks to Jeronim who compiled the list. -- Gabriel Wicke

11 16

Spambot
by Tim Starling 17 Mar '04

17 Mar '04

There's been a spambot active for the last few days. It uses a number (2-15) of open proxies simultaneously to edit at a high rate, often on small unattended wikis. It finds pages by spidering. Each anonymous proxy seems to represent an autonomously spidering bot -- the bots often repeat each others' work, adding the offending external link multiple times. Over the last few days it's probably made over 1000 edits. The site in question is a Chinese Internet marketing company called EMMSS. The open proxy blocking code I've been developing since the discussion on open proxies at wikien-l still has some teething problems. It should help if it's tweaked somewhat for this particular situation (I originally intended it for human vandals). However there's a few other features on my wishlist which I think would help to deal with this sort of problem. One is a combined recent changes, showing all wikimedia wikis. This would allow people to watch for anomalous activity on usually quiet wikis. I imagine this would share code with the socket-based IRC bot which is in development. Another much-requested feature is enhancement of the rollback function, especially to allow for page deletion. The bot created many pages, apparently by following red links. In my opinion, the ideal feature would be to allow the user to supply a list of IP addresses and usernames, and then to revert every edit from those users with a single click. People cleaning up after this spambot often had to revert manually because the bot had edited the same page with multiple IP addresses. For completeness it would be nice to include page-move reversion. For now we've been dealing with this using filters and editor manpower. Filters are bad, they're against the wiki security model. I'd much rather put more power into the hands of users to deal with these situations, than to continue a filter-based arms race. If anyone wants to help out with features such as these, please speak up. In case anyone's wondering, the main response to this problem so far has been to contact a developer, who makes then makes any willing editors temporary sysops on the wikis under attack. -- Tim Starling

12 19

XML Feed
by engelsAG＠t-online.de 17 Mar '04

17 Mar '04

In my discussion with Ilse (based on which I recently sent the request to reduce the put_throttle), we also got to the subject of XML feeds. I mentioned that Yahoo was already getting one, and my contact at Ilse said he would be interested in such. Thus my questions: * Would sending out XML feeds to other parties with a good reason for interest be a good idea or would it be detrimental (how much is for example the load on the servers of sending out an XML feed compared to being spidered by a search engine?) * If the answer is positive, who can I or my contact contact to discuss the possibility? Andre Engels

7 8

LaTex chess support
by Arvind Narayanan 17 Mar '04

17 Mar '04

Hi, I have uploaded a few chess images by taking screenshots, which is quite cumbersome. Wikiproject Chess says Having uniform chess diagrams will be easy once LaTeX chess support is here. Word has it that it will soon be implemented on wikipedia. It is already implemented on http://wikisophia.org. Any plans of integrating the wikisophia.org implementation into MediaWiki? Cheers Arvind -- Its all GNU to me

10 22

← Newer
1
...
4
5
6
7
8
9
10
...
15
Older →

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Wikitech-l March 2004