I've come up with an algorithm to speed up the search when you don't know
the article title (a case this doesn't handle) but you can't get around
needing a monster index.
The easiest way to do this is to make the LuceneSearch extension grok the
full history dump and then layer the search algorithm on top of it based on
standard Lucene search.
On Sun, Feb 1, 2009 at 11:07 PM, phoebe ayers <phoebe.wiki(a)gmail.com> wrote:
On Wed, Jan 21, 2009 at 4:36 PM, Thomas Dalton
<thomas.dalton(a)gmail.com>
wrote:
2009/1/22 Erik Moeller
<erik(a)wikimedia.org>rg>:
Because I don't think it's good to
discuss attribution as an abstract
principle, just as an example, the author attribution for the article
[[France]] is below, excluding IP addresses. According to the view
that attribution needs to be given to each pseudonym, this entire
history would have to be included with every copy of the article.
Needless to say, in a print product, this would occupy a very
significant amount of space. Needless to say, equally, it's a
significant obligation for a re-user. And, of course, Wikipedia keeps
growing and so do its attribution records.
Well, the attribution list is about 1/6 the length of the article (in
terms of bytes). Given that it can be in significantly smaller font
size, doesn't have lots of whitespace and has no images, it's going to
take up far less than 1/6 as much space on the page. It will be a
significant amount of space, but not an impractical one (to the extent
that copying and pasting into Word gives meaningful results, the
article takes up 35 pages, the attribution list takes up 2).
Which is fine if you're reprinting the whole article, but what if
you're just reprinting the lede, or some other section of an article?
Should a reuser still be required to reprint 2 pages of credits for a
paragraph of article? That seems onerous. Note that just reprinting a
*section* of an article is how many print reuse cases have worked to
date (the German encyclopedia and our CafePress bumperstickers come to
mind), and this case is not something that we've discussed much so
far.
And having just actually done this, with a real book and a real
publisher, in "How Wikipedia Works," I can attest that it's a
non-trivial amount of work to get author lists for articles --
removing duplication, IPs, formatting, etc is all a good deal of work
-- and I like to think I understand how histories work. It would be a
much bigger task for someone who didn't understand histories or the
license.
The Wikiblame tool, if it were made widely accessible and prominently
integrated into the site, seems like a promising solution. In the
meantime, I think we ought to consider what "proper credit" is for
just reusing a part of an article, versus the whole thing.
-- phoebe
_______________________________________________
foundation-l mailing list
foundation-l(a)lists.wikimedia.org
Unsubscribe:
https://lists.wikimedia.org/mailman/listinfo/foundation-l