Having spent a good portion of my academic life in the field of Pattern
Recognition, and having read the CRM114 articles in depth, my gut tells me
that some of the other methods (throttling, etc.) would likely reduce
Wikipedian Vandalism sufficiently that a Markovian approach would give
little additional benefit. Making vandalism easy to fix, like undoing all
edits from a particular user/IP with one switch also sound very useful.
I do like the idea of flagging suspect pages and I also really liked the
idea of flagging anonymous users at a higher suspect rate than logged in
users. I suspect that a "heuristic" (lots of rules) model similar to that
used by SpamAssasin where you can plug in new rules for new threats would
probably be the best way to solve the Vandalism problem for Wikipedia over
time (although I don't have a really good idea from the discussion thus far
as to whether it is already a serious issue or not).
For example, one heuristic would be "anonymous user" add a couple of points
to the "Spam" score. Another heuristic, "this user/IP has added lots of
edits really quickly" would add a few points. If a user was in a "safe"
list as a mass spell checker/grammar checker, then you would subtract a
bunch of points. Heuristics for the type of article might be useful. I
suspect that political articles are particularly targeted by vandals, for
example. The point is, a heuristic engine adapts over time, whereas a
Markovian model would perhaps serve as ONE good heuristic within the larger
engine. Markovian or Baysian engines tend to get "tired" over time. My
Baysian email spam filter is now so tired of spam that it sees nearly
everything as spam, whether it is or isn't (of course it was trained on
nearly 300,000 spams before it reached that point.)
When a heuristic engine is coupled with human interaction and double
checking (as would be the natural case with Wikipedia) this becomes a great
system. You would just have a special suspected vandalism page that is
generated much as the special statistics type pages are generated now.
Knowing the seriousness of the problem at this point would drive whether or
not this feature should be developed quickly, and that is knowledge I don't
posess at this point.
-Kelly
At 08:17 PM 3/14/2004, you wrote:
>So, I just installed the CRM114 Markovian spam filtering software:
>
> http://crm114.sourceforge.net/
>
>The whole thing is based on Bayesian filtering, which is just a way to
>make very dumb software make really smart decisions. With sufficient
>training, a very simple piece of software can make very accurate
>distinctions between spam and non-spam email messages. See Paul
>Graham's famous "A Plan for Spam" about this:
>
> http://www.paulgraham.com/spam.html
>
>The CRM114 stuff is Markovian, which means it's _even_dumber_ than
>Bayesian stuff, and makes _even_smarter_ decisions. More or less.
>
>Anyways, one thing that's mentioned on the crm114 page is that folks
>use the same technology for different kinds of text sorting. Like, for
>system administrators, they can sort log file entries into ones
>they're interested in and ones they're not.
>
>And I was thinking: you know, it'd be nice to be able to flag
>acceptable and problematic articles in MediaWiki Web sites. Like, say,
>an admin sees some vandalism going on, and goes to fix it. One of the
>checkmarks on saving is "Vandalism fix" or some such. This would tag
>the previous version as... ungood. Something.
>
>And then after a while the software gets good at understanding what's
>ungood and what's not. And there could be a tracking page to say,
>"These seem to be pages in an ungood state." And it would be easier to
>find those and fix 'em.
>
>~ESP
>
>--
>Evan Prodromou <evan(a)wikitravel.org>
>Wikitravel - http://www.wikitravel.org/
>The free, complete, up-to-date and reliable world-wide travel guide
>_______________________________________________
>Wikitech-l mailing list
>Wikitech-l(a)Wikipedia.org
>http://mail.wikipedia.org/mailman/listinfo/wikitech-l
I'm sorry if my ignorance is showing here, but I went to both sourceforge
and gnutella.com and I still don't have any idea what gnutella is, other
than it's a peer to peer networking protocol for file sharing. In other
words, I can't figure out why Gnutella is different from Ares, the old
Napster, or Bittorrent from an architectural point of view.
I am more familiar with Bittorrent, which is very useful for distributing
copies of very large files (which a pdf version of Wikipedia would
certainly qualify as) without using hardly any bandwidth on the main
server. It's very cool that way. Perhaps Gnutella is similarly cool, but I
can't find a "what is Gnutella" web page.
Quoting: "The practical implication is that the BitTorrent system makes it
easy to
distribute very large files to large numbers of people while placing minimal
bandwidth requirements on the original "seeder." That is because everyone
who wants the file is sharing with one another, rather than downloading from
a central source. A separate file-sharing network known as eDonkey uses a
similar system."
Would someone who is familiar with both Gnutella and Bittorrent tell me why
using Bittorrent for such a project would be stupid? It would certainly use
less of Wikipedia's already strained bandwidth.
-Kelly
At 04:10 PM 3/9/2004, you wrote:
> >>>>> "IF" == Itai Fiat <itai(a)mail.portland.co.uk> writes:
>
> IF> Hello, This my first post to wikitech-l (yay!), and a
> IF> presumptuous one at that. I recently had an idea, and have
> IF> written an written a short (the frame of reference being the
> IF> Magna Carta) article describing it at
> IF> http://meta.wikipedia.org/wiki/Gnutella (for confusions
> IF> sake, the content of the article will not be recounted
> IF> here). While I have no problem developing this myself, I was
> IF> wondering if anybody has any objections - or, better yet,
> IF> suggestions or cash - they'd like to contribute in advance.
>
>So, I added comments on-site, but here's my feeling:
>
>I think a p2p distribution mechanism for Wikipedia is an EXCELLENT
>idea. It would/could do a lot to lower the load on the main servers.
>
>I also think that if I had a fancy new-generation p2p client, I'd make
>a go of re-distributing Wikipedia content. It would be a great way to
>promote my network, especially if it has support for in-network
>Websites (like Freenet does).
>
>But I don't think this is core functionality for Wikimedia, and I
>doubt that it'll get a lot of support around here. I think probably
>the best idea is to download the database dumps, and maybe experiment
>from there.
>
>You may want to shop this idea with one of the open source P2P network
>groups, to see what they think about it. Frankly, there's more
>benefit for a P2P network than there is to Wikipedia, so they'll
>probably be more interested.
>
>Anyways, good idea. You're gonna need to stick with it to see it
>happen, though. The best ideas are like that.
>
>~ESP
>
>--
>Evan Prodromou <evan(a)wikitravel.org>
>Wikitravel - http://www.wikitravel.org/
>The free, complete, up-to-date and reliable world-wide travel guide
>_______________________________________________
>Wikitech-l mailing list
>Wikitech-l(a)Wikipedia.org
>http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Hi,
it seems that Wikipedia's HTTP response contains this header:
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
If I understand this right, this means that the page must not be cached
by the browser.
Has this header always been there? Or has it been added recently? If so,
can it be removed again? :)
The reason I'm bringing this up is that I've noticed that Firefox has
suddenly started reloading all pages whenever I use the "Back" function.
It didn't use to do that until today. Hence, I suppose the header wasn't
there before.
If the header isn't new and has always been there, does anyone have any
idea what else might be causing Firefox's change in behaviour?
Thanks,
Timwi
"Timwi" <timwi(a)gmx.net> schrieb:
> Rich Holton wrote:
>
> > Ummm... [[wikipedia:Orphaned Articles]] is distinct
> > from [[wikipedia:Orphaned articles]] (look closely).
>
> Yeah. Someone pointed it out to me on my User talk page. I suppose we
> can really delete [[Wikipedia:Orphaned articles]] and move
> [[Wikipedia:Orphaned Articles]] over to the intuitive capitalisation
> that matches our naming convention in the article namespace.
>
> > It does sound similar to what you describe, though
> > more muddled. Here we had:
> > -Heading plus sections A & B
> > -Heading
> > -Heading plus sections A thru Z (ie whole correct
> > page)
> > -Sections D thru Z
>
> ** PLING! **
>
> Whee, I know what's going on. And I'm really surprised this didn't come
> up before, but anyway. This is just a theory, but it seems so 100%
> plausible to me that it just gotta be right:
>
> I think what is happening is this:
>
> - You click the little "edit" link for section editing. In your example,
> you edited the section "C".
> - You get an edit conflict. The top window on the edit conflict screen
> then displays the entire text of the article as it was submitted by
> the person that was quicker than you.
> - However, a hidden form element still contains the section number.
> Thus, when you submit, the software thinks you're still
> section-editing, although you're actually sending the entire article!
> - Thus, it replaces section "C" with the entire article text. This is
> why all sections except for "C" are duplicated.
>
> This definitely needs to be fixed :-) I suppose the easiest fix would be
> to just remove that hidden form element and have users fiddle with the
> entire article text. A more sensible fix would be to actually display
> only that section. However, that latter solution will be extremely
> complicated in cases where the person that was quicker than you added or
> removed a section above the one you were editing, so the sections are
> renumbered...
If that is indeed the problem, it must be a relatively new bug - I have done
this kind of thing several times, and as far as I know I only once did this
result in the given bug. Alternative explanation is of course that I have
caused the bug several times without realizing I did so...
Andre
Hello,
I played a little bit with the (german) old_table (from 20040305) and have
some questions about it.
1. As I understand the first entry of each row, this is an id, incremented for
each new row. since there are missing some of the numbers, I assume that this
are rows, which was deleted. Is this right?
2. There is a column timestamp and a column inverse timestamp. For what reason
we need the inverse timestamp?
There also seems to be an inconsistency. The entry with the id 494209 (article
about "Optik" in namespace 0 has timestamp "20031231041409" but inverse
timestamp "80008783828360". May this is a new-year-bug? May this should be
corrected manually?!
3. Which namespace is represented by which of the numbers 0 till 9?
4. The column for the "user-comment" in the begining contains often '*' but
later nothing. Was it just the behaviour of the old software to represent no
comment by '*'?
5. The column for the "user_id" in the beginning contains just 0, even for
well-known users. later (after conversion_script appears) the correct id is
printed. The column for the "user_name" has the same strange behaviour. It
contains dns-names instead of IP-adresses. Was this the behaviour of the old
software?
6. The coulumn "old_flags" seems to contain nothing. (its almost always empty,
but contains one/sometimes '0'). Whats the application of this column? What
flags does it contain?
--Ivo Köthnig
"Jimmy Wales" <jwales(a)bomis.com> schrieb:
> Good grief. The patent system is out of control.
>
> Actually, since the original captcha idea is in fact clever, and since
> I'm not a totally anti-patent person, I could it being reasonable to
> reward the inventors of the idea with a 2 year patent. 17 years is an
> infinity.
>
> I wonder if there are things that we've developed in house here at
> Wikimedia that we should patent, just to license our patents freely
> of course, and also to make fun of the patent system.
Problem is that patents cost money and time to issue. On the other hand,
if we find a way to issue it in some way that it is "free, provided you
don't stand on our mat with your patents either", it might be useful.
Now, I don't know the inner workings of the Wikipedia software well
enough to know what might be patented there, but on the outside:
* best to patent, but most likely to fall short on being the first would
of course be the combination of wiki-like technology with databases.
Something like using a database to enable users to change webpages and
storing them in a simplified format that can be worked into HTML.
* namespaces. Having webpages in a database with both a title and an
identifier giving various kinds of pages. Could be patented as a way
of allowing users to make comments on webpages (on a separate page)
* reverting. Having an interface to show changes to webpages and restore
them easily to an older version
There's probably more, but it's past 4 am here...
Andre Engels
There's been a spambot active for the last few days. It uses a number (2-15)
of open proxies simultaneously to edit at a high rate, often on small
unattended wikis. It finds pages by spidering. Each anonymous proxy seems to
represent an autonomously spidering bot -- the bots often repeat each
others' work, adding the offending external link multiple times. Over the
last few days it's probably made over 1000 edits. The site in question is a
Chinese Internet marketing company called EMMSS.
The open proxy blocking code I've been developing since the discussion on
open proxies at wikien-l still has some teething problems. It should help if
it's tweaked somewhat for this particular situation (I originally intended
it for human vandals). However there's a few other features on my wishlist
which I think would help to deal with this sort of problem.
One is a combined recent changes, showing all wikimedia wikis. This would
allow people to watch for anomalous activity on usually quiet wikis. I
imagine this would share code with the socket-based IRC bot which is in
development.
Another much-requested feature is enhancement of the rollback function,
especially to allow for page deletion. The bot created many pages,
apparently by following red links. In my opinion, the ideal feature would be
to allow the user to supply a list of IP addresses and usernames, and then
to revert every edit from those users with a single click. People cleaning
up after this spambot often had to revert manually because the bot had
edited the same page with multiple IP addresses. For completeness it would
be nice to include page-move reversion.
For now we've been dealing with this using filters and editor manpower.
Filters are bad, they're against the wiki security model. I'd much rather
put more power into the hands of users to deal with these situations, than
to continue a filter-based arms race.
If anyone wants to help out with features such as these, please speak up.
In case anyone's wondering, the main response to this problem so far has
been to contact a developer, who makes then makes any willing editors
temporary sysops on the wikis under attack.
-- Tim Starling
In my discussion with Ilse (based on which I recently sent the request to reduce
the put_throttle), we also got to the subject of XML feeds. I mentioned that
Yahoo was already getting one, and my contact at Ilse said he would be interested
in such. Thus my questions:
* Would sending out XML feeds to other parties with a good reason for interest
be a good idea or would it be detrimental (how much is for example the load on
the servers of sending out an XML feed compared to being spidered by a search
engine?)
* If the answer is positive, who can I or my contact contact to discuss the
possibility?
Andre Engels
Hi,
I have uploaded a few chess images by taking screenshots, which is quite
cumbersome.
Wikiproject Chess says
Having uniform chess diagrams will be easy once LaTeX chess support is
here. Word has it that it will soon be implemented on wikipedia. It is
already implemented on http://wikisophia.org.
Any plans of integrating the wikisophia.org implementation into MediaWiki?
Cheers
Arvind
--
Its all GNU to me