A couple of months ago, I raised on this list the issue of "no-indexing" Wikipedia pages outside the mainspace, principally including project-space pages such as XfDs, AN/ANI, RfA's, RfAr's, and the like, but possibly including userspace as well. By no-indexing, I refer to coding these pages such that they will not be picked up by Google or other search engines.
The desirability of this change has been noted by many people, including very experienced Wikipedians. As we all know, the popularity of Wikipedia and the intensive number of internal links means that when a Wikipedia page contains the name of a living individual, then unless the person is either extremely notable or happens to have a common name, that page will almost inevitably become a high-ranking, if not the highest ranking, search engine result for that individual. This raises issues enough when the search result is a BLP or other mainspace article, but it is totally unacceptable when the high-ranking result destined to follow the individual around forever is something like:
- An AfD deciding to delete an article about a person because of her perceived lack of any sufficiently notable or meaningful accomplishments in life (these can be courtesy-blanked on request, but how many subjects know how even to ask); or - An RfA, involving a contributor who happens to edit under his real name, which fails because the user was deemed unqualified for adminship; or - An arbitration case, in which an editor was severely criticized or even banned for violations of Wikipedia policy - regrettable, but not something for which it would serve any purpose to tar the person's RL reputation forever; or - A long and heated discussion in an ancient ANI thread, again involving a contributor who edits using her name, involving some ancient wiki-grievance long forgotten ... until the contributor applies for a scholarship or a job and someone Googles her name; or - An ArbCom election in which the user came in 17th place; or - An SSP report in which a user editing under a new name is indelibly linked to a username based on his real name, which he chose to abandon months or years earlier because of precisely these very concerns; or - A discussion on ANI noticeboard of defamatory or privacy-invading material in a BLP or other article, which it is rightfully decided to delete from the article itself ... except it remains preserved in the noticeboard discussion (I do see that this aspect of the problem has been addressed on the BLP noticeboard archives, but this type of discussion occurs on ANI and elsewhere as well); or - Various other places where these issues, involving both article subjects and Wikipedia contributors, continue to arise on a frequent basis.
It has been observed that being named on Wikipedia, whether for legitimate reasons or otherwise, has a powerful potential to damage a person's life. (See for example the BLP policy and its talkpage, the ArbCom decisions in RfAr/Badlydrawnjeff and RfAr/Footnoted quotes, or discussion on various criticism sites.) As noted, this raises a troublesome enough suite of issues when the person in question has been accurately discussed in the encyclopedia itself. It is really not acceptable when it occurs as a happenstance of an ancillary discussion of an article subject or of a contributor (even a misbehaving or a now-unwelcome contributor).
I have read more than enough complaints from people who have found themselves in many of the unfortunate situations I describe here. If they are Wikipedians, they sometimes come to rue the day they ever thought of contributing, much less contributing under a name linked to their real identity. If they are article subjects with no particular connection to Wikipedia, they must surely find the situation maddening. By comparison, the benefits to the general public of being able to read through internal Wikipedia discussions of this nature as the result of a casual Google search must be reckoned, at the best, as slight.
In the prior thread, I believe there was significant support for implementing coding necessary to cause "no-indexing" of projectspace and possibly userspace and other-space pages. The main counter-arguments were:
- That some project-space pages DO warrant indexing. An example that was given was the notability policy or the BLP policy. The solution to this is to have a "yes-index" feature that would override the no-index code on a particular project-space page where indexing was agreed to be affirmatively desirable. Community discussion could come up with a list of those particular pages in a week or so. - That Wikipedia currently lacks a top-quality internal search capability, and therefore we need to be able to use external search engines such as Google to perform administrator functions and the like. There is some merit to this observation; I certainly have used Google to hunt down references I remembered when I was writing arbitration decisions, for example. But internal administrative convenience is not a good argument to disregard real harm that we are inadvertently causing to specific individuals. The developers can and probably should be tasked, as a high priority, with improving the search capabilities; but it has been too long since the problems I have described in this e-mail were identifed, and it is time they were solved. - The most cynical response has been that Wikipedia thrives on Google-rank created by internal links and is not going to do anything that would lessen its page-ranks, whether out of pride or for some conjectured eventual mercenary reason. Actually, this was not a counter-argument presented on Wikien; it's a cynical speculation about motivations that was presented on a criticism site. I give it no credence, but it would be easy enough to disprove once and for all.
Wikipedia and its community are often criticized for irresponsibly neglecting the negative effects of the project on some of its subjects and some of its contributors. We have here an opportunity to take an incremental but meaningful step toward addressing a group of related, significant concerns. I would like to urge that the on-again, off-again discussion of this proposal proceed to a conclusion either here or on-wiki and that some definitive action be taken in the near future.
(Finally, I would appreciate if responses could focus on the substance of this post and not on the identity of its author.)
Regards, Newyorkbrad
On Wed, Jul 23, 2008 at 10:47 AM, Newyorkbrad (Wikipedia) newyorkbrad@gmail.com wrote:
A couple of months ago, I raised on this list the issue of "no-indexing" Wikipedia pages outside the mainspace, principally including project-space pages such as XfDs, AN/ANI, RfA's, RfAr's, and the like, but possibly including userspace as well. By no-indexing, I refer to coding these pages such that they will not be picked up by Google or other search engines.
Note that much of this is already done, see our robots file:
http://en.wikipedia.org/robots.txt
Currently all AFD, RFA, RFC and RFAR subpages (but not the main AFD page, the main RFA page etc) are blocked from indexing. Of your examples the admin noticeboard and userspace are probably the big examples of pages that are still indexed that we might not want to be so.
Note that the robots file can easily be updated by a request on bugzilla [1] if there is consensus for it.
- That Wikipedia currently lacks a top-quality internal search capability,
and therefore we need to be able to use external search engines such as Google to perform administrator functions and the like. There is some merit
On this point, there's been great improvement in MediaWiki's search capabilities this year with the MWSearch backend coming online.
---- [1] Like this request, for example: https://bugzilla.wikimedia.org/show_bug.cgi?id=10288
On Wed, Jul 23, 2008 at 7:24 AM, Stephen Bain stephen.bain@gmail.com wrote:
On Wed, Jul 23, 2008 at 10:47 AM, Newyorkbrad (Wikipedia) newyorkbrad@gmail.com wrote:
A couple of months ago, I raised on this list the issue of "no-indexing" Wikipedia pages outside the mainspace, principally including
project-space
pages such as XfDs, AN/ANI, RfA's, RfAr's, and the like, but possibly including userspace as well. By no-indexing, I refer to coding these
pages
such that they will not be picked up by Google or other search engines.
Note that much of this is already done, see our robots file:
http://en.wikipedia.org/robots.txt
Currently all AFD, RFA, RFC and RFAR subpages (but not the main AFD page, the main RFA page etc) are blocked from indexing. Of your examples the admin noticeboard and userspace are probably the big examples of pages that are still indexed that we might not want to be so.
Just to pick everyone's favorite topic as an example:
http://www.google.com/search?hl=en&pwst=1&q=+site:en.wikipedia.org+%...
What is the benefit to allowing Google to index DRV, talk pages, and user/user talk pages? Aside from the Mediawiki native search function not being always that great, the only negative to blocking or restricting Search Engines to just cover strictly Article space would be a possible loss of Google Juice, which should not a concern.
- Joe
On 7/23/08, Joe Szilagyi szilagyi@gmail.com wrote:
On Wed, Jul 23, 2008 at 7:24 AM, Stephen Bain stephen.bain@gmail.com wrote:
On Wed, Jul 23, 2008 at 10:47 AM, Newyorkbrad (Wikipedia) newyorkbrad@gmail.com wrote:
A couple of months ago, I raised on this list the issue of
"no-indexing"
Wikipedia pages outside the mainspace, principally including
project-space
pages such as XfDs, AN/ANI, RfA's, RfAr's, and the like, but possibly including userspace as well. By no-indexing, I refer to coding these
pages
such that they will not be picked up by Google or other search engines.
Note that much of this is already done, see our robots file:
http://en.wikipedia.org/robots.txt
Currently all AFD, RFA, RFC and RFAR subpages (but not the main AFD page, the main RFA page etc) are blocked from indexing. Of your examples the admin noticeboard and userspace are probably the big examples of pages that are still indexed that we might not want to be so.
Just to pick everyone's favorite topic as an example:
http://www.google.com/search?hl=en&pwst=1&q=+site:en.wikipedia.org+%...
What is the benefit to allowing Google to index DRV, talk pages, and user/user talk pages? Aside from the Mediawiki native search function not being always that great, the only negative to blocking or restricting Search Engines to just cover strictly Article space would be a possible loss of Google Juice, which should not a concern.
- Joe
Does the current exclusion of XfD's include DRV as well?
Newyorkbrad
On Wed, Jul 23, 2008 at 8:31 AM, Newyorkbrad (Wikipedia) < newyorkbrad@gmail.com> wrote:
Does the current exclusion of XfD's include DRV as well?
Nope:
http://www.google.com/search?hl=en&client=firefox-a&rls=org.mozilla:...
- Joe
On 7/23/08, Joe Szilagyi szilagyi@gmail.com wrote:
What is the benefit to allowing Google to index DRV, talk pages, and user/user talk pages? Aside from the Mediawiki native search function not being always that great, the only negative to blocking or restricting Search Engines to just cover strictly Article space would be a possible loss of Google Juice, which should not a concern.
As far as I'm concerned, Google juice, i.e. page-rank and whatnot can go jump in the lake.
1. Build a search engine of Google-esque calibre (boolean +A +B -"C D" etc. to search any and all WMF projects of the user's choosing),
2. Put it on the toolserver,
3. Configure the toolserver's robots.txt to unwelcome Google, at least from indexing anything related to the toolserver search engine.
4. Configure all WMF projects' robots.txt to welcome Google indexing only of main-space, article, portal, etc. "content" pages.
5. (optional, sounds quite tricky) Split the category namespace. Figure out some way to train google-bot to:
index content categories like *Category:English popes *Category:Bob Dylan songs *Category:Pacific Ocean
ignore logistical crap like *Category:Articles with unsourced statements since December 2006 *Category:Unsuccessful requests for adminship *Category:Suspected Wikipedia sockpuppets of Janis Doe *Category:Start-Class biography (sports and games) articles *Category:Sports templates by country etc. etc.
could somebody think of a reliable way to do this, short of creating a separate name-space?
—C.W.
On Wed, Jul 23, 2008 at 12:09 PM, Charlotte Webb charlottethewebb@gmail.com wrote:
- (optional, sounds quite tricky) Split the category namespace.
Figure out some way to train google-bot to:
[snip]
That could be addressed with a __NOINDEX__ parser directive that could be applied on a page by page basis for things like that... the complication there is that eventually people would abuse it to hide things in places we normally expect to be indexed.
On 7/23/08, Gregory Maxwell gmaxwell@gmail.com wrote:
On Wed, Jul 23, 2008 at 12:09 PM, Charlotte Webb charlottethewebb@gmail.com wrote:
- (optional, sounds quite tricky) Split the category namespace.
Figure out some way to train google-bot to:
[snip]
That could be addressed with a __NOINDEX__ parser directive that could be applied on a page by page basis for things like that... the complication there is that eventually people would abuse it to hide things in places we normally expect to be indexed.
Changing the default indexing status of a page (index to no-index or vice versa) could theoretically be made an admin-only function (and would count as use of an administrator tool, with the accountability implied thereby). However, this implies longer-term mediawiki programming changes of unknown complexity, so certainly shouldn't become a barrier to other progress.
Newyorkbrad
On Wed, Jul 23, 2008 at 12:26 PM, Newyorkbrad (Wikipedia) newyorkbrad@gmail.com wrote:
Changing the default indexing status of a page (index to no-index or vice versa) could theoretically be made an admin-only function (and would count as use of an administrator tool, with the accountability implied thereby). However, this implies longer-term mediawiki programming changes of unknown complexity, so certainly shouldn't become a barrier to other progress.
Right. Perfect is the enemy of good. Get the defaults sane per-namespace and then we'll be motivated to figure out how to set the defaults.
I'd propose that only Main, Portal, Category, and Image be indexed. With Category eventually slimmed down via some more selective process, and Wikipedia: eventually puffed up. ... though I'd support any and all proposals that cut down on the indexing of 'meta' namespaces.
On Wed, Jul 23, 2008 at 12:09 PM, Charlotte Webb charlottethewebb@gmail.com wrote:
could somebody think of a reliable way to do this, short of creating a separate name-space?
From a technical standpoint we just need a way to either:
1. Insert any <meta/> tag from the article content. (Bad idea for security reasons.)
2. Make some template-esque tag like {{{noindex}}} that will instruct the engine to include the following tag in the <head/> element:
<meta name="robots" content="noindex" />
Note that nofollow should not be present, because we do want it to crawl to the linked articles. We just don't want the category indexed.
This tag should follow transclusion rules -- then we could just insert this into whatever template we currently use to mark such categories (if any).
P.S. Regarding Gregory's response (that came in while writing this) potential abuse is not really a concern. We have a block button. The trick is coming up with a policy or guideline on usage so people know what's acceptable and what's not.
Alternately (thinking while I type here, bear with me) we could have a MediaWiki: page listing pages that we don't want indexed. Possibly specifying a template would catch all pages that template is transcluded to? Then it could be protected if it became an issue.
Chris Howie schreef:
P.S. Regarding Gregory's response (that came in while writing this) potential abuse is not really a concern. We have a block button.
Indeed.
The worst abuse that can happen is that vandals un-noindex libelous information, so that it shows up in Google.
But consider that as long as article space is indexed (which it should be -- we shouldn't put anything in the main namespace that we wouldn't be happy about showing up in Google) vandals will always be able to do just that, by adding the info to an article.
Our strategy for the latter tactic is to block those vandals; there is no need for stronger measures for vandals who tamper with no-index.
Eugene
On Wed, Jul 23, 2008 at 12:20 PM, Chris Howie cdhowie@gmail.com wrote: [snip]
- Make some template-esque tag like {{{noindex}}} that will instruct
the engine to include the following tag in the <head/> element:
Parser directives like __noindex__ are the mediawiki esq way of accomplishing things like this. ... but it's all the same..
P.S. Regarding Gregory's response (that came in while writing this) potential abuse is not really a concern. We have a block button. The trick is coming up with a policy or guideline on usage so people know what's acceptable and what's not.
It's not just me pointing this out... proposals like this have been previously rejected on this basis:
https://bugzilla.wikimedia.org/show_bug.cgi?id=9415 https://bugzilla.wikimedia.org/show_bug.cgi?id=8068
Blocking is a good tool to stop abuse but it only works once we've found it. Someone could sneakily create __noindex__ pages, especially via transcluding no-indexing templates.
Also of relevance to this discussion please see: https://bugzilla.wikimedia.org/show_bug.cgi?id=11443
Alternately (thinking while I type here, bear with me) we could have a MediaWiki: page listing pages that we don't want indexed. Possibly specifying a template would catch all pages that template is transcluded to? Then it could be protected if it became an issue.
Having to read some enormous page every page-load wouldn't be good. It would be better to do the right thing on average per-namespace then use something in the pages to control exceptions.
On Wed, Jul 23, 2008 at 12:35 PM, Gregory Maxwell gmaxwell@gmail.com wrote:
On Wed, Jul 23, 2008 at 12:20 PM, Chris Howie cdhowie@gmail.com wrote:
P.S. Regarding Gregory's response (that came in while writing this) potential abuse is not really a concern. We have a block button. The trick is coming up with a policy or guideline on usage so people know what's acceptable and what's not.
It's not just me pointing this out... proposals like this have been previously rejected on this basis:
https://bugzilla.wikimedia.org/show_bug.cgi?id=9415 https://bugzilla.wikimedia.org/show_bug.cgi?id=8068
Blocking is a good tool to stop abuse but it only works once we've found it. Someone could sneakily create __noindex__ pages, especially via transcluding no-indexing templates.
People do sneaky mainspace vandalism too.
Alternately (thinking while I type here, bear with me) we could have a MediaWiki: page listing pages that we don't want indexed. Possibly specifying a template would catch all pages that template is transcluded to? Then it could be protected if it became an issue.
Having to read some enormous page every page-load wouldn't be good. It would be better to do the right thing on average per-namespace then use something in the pages to control exceptions.
That is how I meant it -- a page of exceptions. In the case of categories, it could point at just a template we put on non-encyclopedic categories, if "noindex-by-transclusion" can work.
On Wed, Jul 23, 2008 at 12:42 PM, Chris Howie cdhowie@gmail.com wrote:
https://bugzilla.wikimedia.org/show_bug.cgi?id=9415 https://bugzilla.wikimedia.org/show_bug.cgi?id=8068
Blocking is a good tool to stop abuse but it only works once we've found it. Someone could sneakily create __noindex__ pages, especially via transcluding no-indexing templates.
People do sneaky mainspace vandalism too.
Indeed. In mainspace. And hide it from being found by noindexing it. ::shrugs::. It's not primarily my argument. Go read the bugzilla entries I linked to.
Having to read some enormous page every page-load wouldn't be good. It would be better to do the right thing on average per-namespace then use something in the pages to control exceptions.
That is how I meant it -- a page of exceptions. In the case of categories, it could point at just a template we put on non-encyclopedic categories, if "noindex-by-transclusion" can work.
An explicit list of exemptions could reasonably grow to very large and it would need to be scanned for membership every time a page is parsed. I would be somewhat surprised if there were not >1000 meta-categories already. Go look at how __NOTOC__ works, that would be the most logical way of doing this in mediawiki. Thoughts?
On Wed, Jul 23, 2008 at 12:48 PM, Gregory Maxwell gmaxwell@gmail.com wrote:
An explicit list of exemptions could reasonably grow to very large and it would need to be scanned for membership every time a page is parsed. I would be somewhat surprised if there were not >1000 meta-categories already. Go look at how __NOTOC__ works, that would be the most logical way of doing this in mediawiki. Thoughts?
I'm not pushing a big page over a __NOTOC__ type syntax (yes, I know what it is too). But with a big page we would get protection capabilities, which we would not have with a parser extension. That was the primary argument behind that suggestion. It depends whether we value protection of the noindex system more or maintainability of the list more.
On 7/23/08, Gregory Maxwell gmaxwell@gmail.com wrote:
An explicit list of exemptions could reasonably grow to very large and it would need to be scanned for membership every time a page is parsed. I would be somewhat surprised if there were not >1000 meta-categories already. Go look at how __NOTOC__ works, that would be the most logical way of doing this in mediawiki. Thoughts?
Alright, we could insert something like this into the header templates used for non-content "meta" categories such as the examples I gave above.
Tracking down abuses need not be so tedious. If this __NOINDEX__ symbol adds the html which tells Google-bot to move on because there's nothing to see here... it could easily add some sort of visible confirmation (by user preference or javascript gadget or something) showing some kind of on/off-style symbol (NOT FAIR USE, not the stylized [G] favicon mind you...!) to indicate whether or not the page is indexed so you don't have to check the html manually.
—C.W.
On 7/23/08, Charlotte Webb charlottethewebb@gmail.com wrote:
...so you don't have to check the html manually.
But let's not get ahead of ourselves, we need a better internal search engine first. I can see the usefulness of searching across several projects (to look for cross-wiki pattern vandalism to revert or look for other language versions of something you just wrote, etc.) so this is why i suggested putting it on the toolserver.
On the other hand a dedicated domain name like "search.wikimedia.org" is equally if not more appealing.
—C.W.
On 7/23/08, Charlotte Webb charlottethewebb@gmail.com wrote:
On 7/23/08, Charlotte Webb charlottethewebb@gmail.com wrote:
...so you don't have to check the html manually.
But let's not get ahead of ourselves, we need a better internal search engine first. I can see the usefulness of searching across several projects (to look for cross-wiki pattern vandalism to revert or look for other language versions of something you just wrote, etc.) so this is why i suggested putting it on the toolserver.
On the other hand a dedicated domain name like "search.wikimedia.org" is equally if not more appealing.
—C.W.
Any thoughts following up on the status of our ability to create improved internal searching as discussed here? Implementation of the no-index proposal should not await (and I gather is not awaiting) improvements in this feature, but a commitment to proceed with such improvements would eliminate virtually the only serious objection that has been raised to it.
Newyorkbrad
2008/7/25 Newyorkbrad (Wikipedia) newyorkbrad@gmail.com:
Any thoughts following up on the status of our ability to create improved internal searching as discussed here? Implementation of the no-index proposal should not await (and I gather is not awaiting) improvements in this feature, but a commitment to proceed with such improvements would eliminate virtually the only serious objection that has been raised to it.
Our search has improved markedly in the past year and continues to get better. If people can give examples of where our internal search just isn't up to the task, those will be of great use to the devs.
- d.
2008/7/25 David Gerard dgerard@gmail.com:
2008/7/25 Newyorkbrad (Wikipedia) newyorkbrad@gmail.com:
Any thoughts following up on the status of our ability to create improved internal searching as discussed here? Implementation of the no-index proposal should not await (and I gather is not awaiting) improvements in this feature, but a commitment to proceed with such improvements would eliminate virtually the only serious objection that has been raised to it.
Our search has improved markedly in the past year and continues to get better. If people can give examples of where our internal search just isn't up to the task, those will be of great use to the devs.
serious No spell checking. less serious I can't enter chemical structures.
On Fri, Jul 25, 2008 at 9:31 AM, geni geniice@gmail.com wrote:
serious No spell checking.
If memory serves me correctly, spell checking is a feature currently available in the MWSearch system, though last I heard it had performance issues, so if it exists it's probably not turned on yet on the Wikimedia projects.
On Wed, Jul 23, 2008 at 12:35 PM, Gregory Maxwell gmaxwell@gmail.com wrote:
Having to read some enormous page every page-load wouldn't be good. It would be better to do the right thing on average per-namespace then use something in the pages to control exceptions.
The ability to noindex namespaces is already in MediaWiki and could be turned on at will. See $wgNamespaceRobotPolicies. $wgArticleRobotPolicies is also relevant, although not really useful unless we have a very short fixed list of exceptions that can be maintained by sysadmins (which we don't).
It wouldn't be too hard to fix bug 8068 and add a __NOINDEX__ keyword, though. We would of course want an __INDEX__ keyword as well. (Probably we can leave out __FOLLOW__ and __NOFOLLOW__ as options. Using the namespace default for all pages should be fine there.) Maybe I'll make that my project for today.
Perhaps the ideal solution is one that omits these pages from search results UNLESS the user specifically requests them in some way (for example, a search term including "site:en.wikipedia.org")... but I kinda doubt that's realistic, on my part.
I'll be sorry to lose some ability to find information easily, but it's an unfortunate conundrum that articles and discussion pages are given the same weight, and absent some better way to fix the resulting problems... at least our internal search capabilities have gradually been improving, as I think has consistently been the main worry when this has come up in the past.
Should policy/guideline/essay/help pages continue to be indexed? It primarily seems to be process/activity pages which are problematic in this regard.
Are user pages uniformly problematic, or mainly those for banned/blocked users? If it's an isolated group at all, we could just as easily use the __NOINDEX__ keyword in relevant templates.
-Luna
2008/7/26 Luna lunasantin@gmail.com:
Perhaps the ideal solution is one that omits these pages from search results UNLESS the user specifically requests them in some way (for example, a search term including "site:en.wikipedia.org")... but I kinda doubt that's realistic, on my part.
I'll be sorry to lose some ability to find information easily, but it's an unfortunate conundrum that articles and discussion pages are given the same weight,
Very unlikely. Notice how rarely a talk page beats out it's article page.
Would this prevent mirror sites from 'scraping' noindex pages? This would be a definite improvement as talk pages are often a depository of libel and whilst we can remove and oversight material on wp pages we're powerless to do anything on a mirror
On 7/25/08, geni geniice@gmail.com wrote:
2008/7/26 Luna lunasantin@gmail.com:
Perhaps the ideal solution is one that omits these pages from search results UNLESS the user specifically requests them in some way (for example, a search term including "site:en.wikipedia.org")... but I kinda doubt that's realistic, on my part.
I'll be sorry to lose some ability to find information easily, but it's an unfortunate conundrum that articles and discussion pages are given the same weight,
Very unlikely. Notice how rarely a talk page beats out it's article page.
-- geni
WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
On Wed, Jul 30, 2008 at 3:12 AM, David Katz dkatz2001@gmail.com wrote:
Would this prevent mirror sites from 'scraping' noindex pages? This would be a definite improvement as talk pages are often a depository of libel and whilst we can remove and oversight material on wp pages we're powerless to do anything on a mirror
I believe most mirrors use an actual database dump (though some might scrape). In that case they would get all the content on Wikipedia, wherever it is.
On Wed, Jul 30, 2008 at 7:45 AM, Chris Howie cdhowie@gmail.com wrote:
On Wed, Jul 30, 2008 at 3:12 AM, David Katz dkatz2001@gmail.com wrote:
Would this prevent mirror sites from 'scraping' noindex pages? This would be a definite improvement as talk pages are often a depository of libel and whilst we can remove and oversight material on wp pages we're powerless to do anything on a mirror
I believe most mirrors use an actual database dump (though some might scrape). In that case they would get all the content on Wikipedia, wherever it is.
Is there anyway we can set things so that mirrors only mirror the actual namespace content and not talk, user and WP administrative pages?
2008/7/30 David Katz dkatz2001@gmail.com:
Is there anyway we can set things so that mirrors only mirror the actual namespace content and not talk, user and WP administrative pages?
Yeah, not put it in the dumps. However, effectively proprietising Wikipedia in such a manner (remember, the project space is GFDL too) doesn't really sit well with the "free content" thing.
Most mirrors wouldn't want that stuff to mirror Wikipedia anyway. Those that do, are either Doing It Wrong or really do want to mirror the project space too.
- d.
David Gerard wrote:
2008/7/30 David Katz dkatz2001@gmail.com:
Is there anyway we can set things so that mirrors only mirror the actual namespace content and not talk, user and WP administrative pages?
Yeah, not put it in the dumps. However, effectively proprietising Wikipedia in such a manner (remember, the project space is GFDL too) doesn't really sit well with the "free content" thing.
Most mirrors wouldn't want that stuff to mirror Wikipedia anyway. Those that do, are either Doing It Wrong or really do want to mirror the project space too.
- d.
Umh, this might be a stupid question, but since I don't know any better, I'll ask anyway...
What would be wrong with having sectioned dumps? With mainspace and article talk pages in one dump section, and Userpages, userspace subpages and user talk pages in in another section of the dump, and a third dump for all the wonderful "Wikipedia" namespace pages and their talk pages.
That way everyone can pick and mix...
Yours,
Jussi-Ville Heiskanen
On Wed, Jul 30, 2008 at 12:16 PM, Jussi-Ville Heiskanen cimonavaro@gmail.com wrote:
Umh, this might be a stupid question, but since I don't know any better, I'll ask anyway...
What would be wrong with having sectioned dumps? With mainspace and article talk pages in one dump section, and Userpages, userspace subpages and user talk pages in in another section of the dump, and a third dump for all the wonderful "Wikipedia" namespace pages and their talk pages.
That way everyone can pick and mix...
What a great idea!
(which is why it's already done that way :) ... Also, I don't believe that most mirrors which actually use the dumps copy anything but the articles though I don't have any data to back that up)
2008/7/30 Gregory Maxwell gmaxwell@gmail.com:
What a great idea!
(which is why it's already done that way :) ... Also, I don't believe that most mirrors which actually use the dumps copy anything but the articles though I don't have any data to back that up)
Depends. Some take the lot on the basis that it gives the more content to put adds next to. And if we go noindex on all that stuff it will stop getting whacked by the duplicate content penalty.
geni wrote:
2008/7/30 Gregory Maxwell gmaxwell@gmail.com:
What a great idea!
(which is why it's already done that way :) ... Also, I don't believe that most mirrors which actually use the dumps copy anything but the articles though I don't have any data to back that up)
Depends. Some take the lot on the basis that it gives the more content to put adds next to. And if we go noindex on all that stuff it will stop getting whacked by the duplicate content penalty.
This is all very interesting.
I wonder if I am reading this correctly.
Do I understand correctly that those who do not just download our non-mainspace (you know the real wikipedia stuff of articles like of an encyclopaedic value), do it with full knowledge that isn't really encyclopaedic matter, but download it anyway?
On the gripping hand the arguments I have heard against adjusting the licencing of the non-mainspace pages has been on the basis of not providing free web-hosting, so everything has to be copy-left.
Somehow I don't think that equation passes the sniff test.
Particularly in the light of the fact that the MediaWiki help-pages are already definitely *not* copy-left, but decisively PD.
Yours,
Jussi-Ville Heiskanen
On Wed, Jul 30, 2008 at 3:35 PM, Jussi-Ville Heiskanen cimonavaro@gmail.com wrote: [snip]
Do I understand correctly that those who do not just download our non-mainspace (you know the real wikipedia stuff of articles like of an encyclopaedic value), do it with full knowledge that isn't really encyclopaedic matter, but download it anyway?
They have the option to download only articles. We can't guess their understanding or motivation. I think that they are just grabbing the first thing that works is the more obvious theory. ;)
On the gripping hand the arguments I have heard against adjusting the licencing of the non-mainspace pages has been on the basis of not providing free web-hosting, so everything has to be copy-left.
Somehow I don't think that equation passes the sniff test.
Particularly in the light of the fact that the MediaWiki help-pages are already definitely *not* copy-left, but decisively PD.
You're free to make your contributions more liberally licensed, just not less.
If you want to post information about yourself under a restrictive license, there are lots of low to no cost web hosts that allow it. So long as you're a contributor the projects are very permissive about making your userpage just a link to your website, as far as I've seen.
Beyond the "avoiding free webhosting", keeping the project spaces freely licensed contributes to keeping freely licensed content part of the culture and superordinate goal.
In any case, if nazi-pedia is really trying to make it look like you're a contributor there, they could do amply well without copying your Wikipedia userpage. :) Licensing is not the right tool to use against fraud. It's a wrong fit.
2008/7/30 Gregory Maxwell gmaxwell@gmail.com:
In any case, if nazi-pedia is really trying to make it look like you're a contributor there, they could do amply well without copying your Wikipedia userpage. :) Licensing is not the right tool to use against fraud. It's a wrong fit.
Speaking of which, I vaguely remember that en.metapedia.org used to be a mirror/fork, but now doesn't appear to be. (Many articles appear to be Wikipedia-originated and are labeled as such and licensed under GFDL.) Anyone got any idea what's up with that?
- d.
Gregory Maxwell wrote:
On Wed, Jul 30, 2008 at 3:35 PM, Jussi-Ville Heiskanen cimonavaro@gmail.com wrote: [snip]
Do I understand correctly that those who do not just download our non-mainspace (you know the real wikipedia stuff of articles like of an encyclopaedic value), do it with full knowledge that isn't really encyclopaedic matter, but download it anyway?
They have the option to download only articles. We can't guess their understanding or motivation. I think that they are just grabbing the first thing that works is the more obvious theory. ;)
Who says we can't guess? And why do you follow that with what is palpalby just your personal guess?
I don't think it follows Ochams Razor to assume that people in search of profit would make a special exception in the case of Wikipedia and act in a directly naîve way. That is simply asking too much from credulity; even if I know some net-rippers-off can be astoundingly stupid. There can be a presumption that most of them do one or the other, but assuming they do the naïve choice by default, is "the most ridiculous thing I ever heard".
On the gripping hand the arguments I have heard against adjusting the licencing of the non-mainspace pages has been on the basis of not providing free web-hosting, so everything has to be copy-left.
Somehow I don't think that equation passes the sniff test.
Particularly in the light of the fact that the MediaWiki help-pages are already definitely *not* copy-left, but decisively PD.
You're free to make your contributions more liberally licensed, just not less.
If you want to post information about yourself under a restrictive license, there are lots of low to no cost web hosts that allow it. So long as you're a contributor the projects are very permissive about making your userpage just a link to your website, as far as I've seen.
Beyond the "avoiding free webhosting", keeping the project spaces freely licensed contributes to keeping freely licensed content part of the culture and superordinate goal.
I think you missed the part where I was asking *specifically* about _copy-left_ and *not* "freely licenced". No biggie, easy to miss.
Then again, maybe the situation is more nuanced, and the question is ont really about less or more "free" but about the precise licence, where people can even disagree about which licence is the most "free". I certainly can consider many "copy-left" licences to be "encumbered" in certain specific manners, and still wrap my head around those peoples mindset that contend that going whole-hog PD is letting downstream users hobble the content that is derivative later.
The fact is that choosing any specific licence as a requirement or even choosing some minimum which has to be compatible with the chosen licence for non "content" pages, does constitute a restriction; though arguendo a restriction against allowing restriction.
In any case, if nazi-pedia is really trying to make it look like you're a contributor there, they could do amply well without copying your Wikipedia userpage. :) Licensing is not the right tool to use against fraud. It's a wrong fit.
Well, of course here you are extrapolating that something that I said a while ago, was a hidden reference in a post that did not explicitly not reference it at all. Nicely done.
Hand on my heart, I didn't even think about the Nazipedia thing in talking about userspace licencing this time. I in fact didn't think about this at all in personal terms; I was genuinely trying to explore the real licencing landscape we have to work with, not just you and me, but all of our contributors. Take that as you will, believe it or not.
Yours,
Jussi-Ville Heiskanen
On Wed, Jul 30, 2008 at 4:25 PM, Jussi-Ville Heiskanen cimonavaro@gmail.com wrote:
Who says we can't guess? And why do you follow that with what is palpalby just your personal guess?
I don't think it follows Ochams Razor to assume that people in search of profit would make a special exception in the case of Wikipedia and act in a directly naîve way. That is simply asking too much from credulity; even if I know some net-rippers-off can be astoundingly stupid. There can be a presumption that most of them do one or the other, but assuming they do the naïve choice by default, is "the most ridiculous thing I ever heard".
Having talked to some of the less brillant people trying to set up mirrors I think you may be attributing too much of the outcome to intent if you're attributing any of the outcome to intent at all. ;)
We do have quite a few recourses against mirrors who behave unreasonably. For example, we could hold them to the strict letter of their licensing requirements, so frequently followed sloppily.
It would probably be helpful if we more explicitly discouraged mirror sites from copying the user-pages. I think the download page currently says "you probably don't want this one" but we don't really ask them not to. I think we'd rather make it clear what we do and don't want before we can decide that someone is probably being malevolent.
You're free to make your contributions more liberally licensed, just not less.
If you want to post information about yourself under a restrictive license, there are lots of low to no cost web hosts that allow it. So long as you're a contributor the projects are very permissive about making your userpage just a link to your website, as far as I've seen.
Beyond the "avoiding free webhosting", keeping the project spaces freely licensed contributes to keeping freely licensed content part of the culture and superordinate goal.
I think you missed the part where I was asking *specifically* about _copy-left_ and *not* "freely licenced". No biggie, easy to miss.
I very much did notice you said specifically copyleft... which resulted in the very first sentence. "You're free to make your contributions more liberally licensed, just not less." If you want to say your userpage is also available as public domain you're welcome to do so.
So the actual requirement is that you must offer your userpage contributions under the GFDL but you can also publish those contributions under any number of additional licenses.
So you're perfectly free to moot the copyleft on your own stuff by offering non-copyleft free license terms... which is why I took your commentary to be free licensing vs not, rather than say much about copyleft.
[snip]
The fact is that choosing any specific licence as a requirement or even choosing some minimum which has to be compatible with the chosen licence for non "content" pages, does constitute a restriction; though arguendo a restriction against allowing restriction.
Sure, it's a restriction that you must at least offer a particular license over your userpages. But its a restriction which improves consistency, discourages particular kinds of wasteful discussion (zomg, you can't copy from my user page! yours is incompatibly licensed!), and which generally removes a flexibility which has little relevance to our mission. (though perhaps more relevance to someone looking for a free webhost! :) )
In any case, if nazi-pedia is really trying to make it look like you're a contributor there, they could do amply well without copying your Wikipedia userpage. :) Licensing is not the right tool to use against fraud. It's a wrong fit.
Well, of course here you are extrapolating that something that I said a while ago, was a hidden reference in a post that did not explicitly not reference it at all. Nicely done.
Communication accomplished well results in an understand of both the direct and the implied. .... But here I was just guessing since I wasn't quite sure where you were going.
Hand on my heart, I didn't even think about the Nazipedia thing in talking about userspace licencing this time. I in fact didn't think about this at all in personal terms; I was genuinely trying to explore the real licencing landscape we have to work with, not just you and me, but all of our contributors. Take that as you will, believe it or not.
Gladly. .. and in any case.. The concern that Nazipedia can copy our userpages in a way that makes our contributors look like nazipedia-supporters is a valid concern... it's just not not one which I think we can or should address with licensing. Sorry for jumping ahead of you incorrectly.
2008/7/30 Jussi-Ville Heiskanen cimonavaro@gmail.com:
This is all very interesting.
I wonder if I am reading this correctly.
Do I understand correctly that those who do not just download our non-mainspace (you know the real wikipedia stuff of articles like of an encyclopaedic value), do it with full knowledge that isn't really encyclopaedic matter, but download it anyway?
Probably. Wikipedia mirrors have various approaches. The bulk content to hang ads around approach would consider non project namespace stuff to be worth having. Others looking to run encyclopedic mirrors less so. Still others take selected pages to pad existing content. Then there are various sites that use wikipedia to feed Markov chain generators or as blog posts.
2008/7/30 David Katz dkatz2001@gmail.com:
On Wed, Jul 30, 2008 at 7:45 AM, Chris Howie cdhowie@gmail.com wrote:
On Wed, Jul 30, 2008 at 3:12 AM, David Katz dkatz2001@gmail.com wrote:
Would this prevent mirror sites from 'scraping' noindex pages? This would be a definite improvement as talk pages are often a depository of libel and whilst we can remove and oversight material on wp pages we're powerless to do anything on a mirror
I believe most mirrors use an actual database dump (though some might scrape). In that case they would get all the content on Wikipedia, wherever it is.
Is there anyway we can set things so that mirrors only mirror the actual namespace content and not talk, user and WP administrative pages?
Nope. Tthere are a couple of ways you could do it in in theory but the side effects would be unacceptable.
2008/7/30 David Katz dkatz2001@gmail.com:
Would this prevent mirror sites from 'scraping' noindex pages? This would be a definite improvement as talk pages are often a depository of libel and whilst we can remove and oversight material on wp pages we're powerless to do anything on a mirror
It wouldn't and mirrors generally use database dumps anyway.
On Wed, Jul 30, 2008 at 3:12 AM, David Katz dkatz2001@gmail.com wrote:
Would this prevent mirror sites from 'scraping' noindex pages? This would be a definite improvement as talk pages are often a depository of libel and whilst we can remove and oversight material on wp pages we're powerless to do anything on a mirror
Why is libel tolerated on the talk pages in the first place? I don't see many people arguing for noindexing the project pages because of libel.
On Wed, Jul 30, 2008 at 6:51 PM, Anthony wikimail@inbox.org wrote:
On Wed, Jul 30, 2008 at 3:12 AM, David Katz dkatz2001@gmail.com wrote:
Would this prevent mirror sites from 'scraping' noindex pages? This would be a definite improvement as talk pages are often a depository of libel and whilst we can remove and oversight material on wp pages we're powerless to do anything on a mirror
Why is libel tolerated on the talk pages in the first place? I don't see many people arguing for noindexing the project pages because of libel.
It's not tolerated as much as a) it often takes longer to notice and b) the removal of BLP violations from an article often results in a debate on the Talk pages in which the BLP violation is not only repeated but expanded.
On 7/23/08, Stephen Bain stephen.bain@gmail.com wrote:
On Wed, Jul 23, 2008 at 10:47 AM, Newyorkbrad (Wikipedia) newyorkbrad@gmail.com wrote:
A couple of months ago, I raised on this list the issue of "no-indexing" Wikipedia pages outside the mainspace, principally including
project-space
pages such as XfDs, AN/ANI, RfA's, RfAr's, and the like, but possibly including userspace as well. By no-indexing, I refer to coding these
pages
such that they will not be picked up by Google or other search engines.
Note that much of this is already done, see our robots file:
http://en.wikipedia.org/robots.txt
Currently all AFD, RFA, RFC and RFAR subpages (but not the main AFD page, the main RFA page etc) are blocked from indexing. Of your examples the admin noticeboard and userspace are probably the big examples of pages that are still indexed that we might not want to be so.
Note that the robots file can easily be updated by a request on bugzilla [1] if there is consensus for it.
- That Wikipedia currently lacks a top-quality internal search
capability,
and therefore we need to be able to use external search engines such as Google to perform administrator functions and the like. There is some
merit
On this point, there's been great improvement in MediaWiki's search capabilities this year with the MWSearch backend coming online.
[1] Like this request, for example: https://bugzilla.wikimedia.org/show_bug.cgi?id=10288
-- Stephen Bain stephen.bain@gmail.com
Thank you for this update. I think there may have been progress that I have missed in the past couple of months. When I posted on this topic a few months ago, either some of these types of pages were not yet no-indexed, or no one mentioned the fact, or if they did I overlooked it.
Other pages that should be excluded from indexing (if they aren't already) include SSP, RfCU, the old PAIN archives, WQA, and I'm sure people can put together a list of a few more.
As for userspace, I think the optimal solution would be to allow the individual user to opt in or out of indexing, if that is doable without too much fuss. (And indefblocked or banned users would automatically be no-indexed, to give those with identifiable usernames one fewer grievance to pursue after they have left us.) Query whether "in" or "out" would be the better default.
Newyorkbrad
2008/7/23 Newyorkbrad (Wikipedia) newyorkbrad@gmail.com:
As for userspace, I think the optimal solution would be to allow the individual user to opt in or out of indexing, if that is doable without too much fuss. (And indefblocked or banned users would automatically be no-indexed, to give those with identifiable usernames one fewer grievance to pursue after they have left us.) Query whether "in" or "out" would be the better default.
I am of the belief that only mainspace should be indexed by default, and that only limited project space (e.g., policies) should be indexed but that all other areas should be no-index by default. Userspace should default to no-index, in particular. I recall a very contentious BLP-related discussion that took place over several months, and was discussed not only in project space and on the talk pages of related articles, but was also discussed on multiple user pages. When I do a search for that subject now, ("name of celebrity" +"BLP issue" +wikipedia) what comes up is the many discussions on userpages, perpetuating the BLP problem.
Risker
On Wed, Jul 23, 2008 at 11:29 AM, Risker risker.wp@gmail.com wrote:
I am of the belief that only mainspace should be indexed by default, and that only limited project space (e.g., policies) should be indexed but that all other areas should be no-index by default. Userspace should default to no-index, in particular. I recall a very contentious BLP-related discussion that took place over several months, and was discussed not only in project space and on the talk pages of related articles, but was also discussed on multiple user pages. When I do a search for that subject now, ("name of celebrity" +"BLP issue" +wikipedia) what comes up is the many discussions on userpages, perpetuating the BLP problem.
Agreed: http://lists.wikimedia.org/pipermail/wikien-l/2007-September/081682.html (and some other posts in that month; the link there at the time was to a google search that showed that 'Thomas Dalton' #1 hit on google was a "this user has been banned from Wikipedia" userpage notice)
I'd also add portal namespace to the indexable stuff. But yea... indexing Main + Portal + named pages elsewhere would be really good. It would produce the right results for the vast majority of the searchers.
(and I for one don't mind that this thread was began by an obviously evil troll! ;) )
On Tue, Jul 22, 2008 at 5:47 PM, Newyorkbrad (Wikipedia) < newyorkbrad@gmail.com> wrote:
A couple of months ago, I raised on this list the issue of "no-indexing" Wikipedia pages outside the mainspace, principally including project-space pages such as XfDs, AN/ANI, RfA's, RfAr's, and the like, but possibly including userspace as well. By no-indexing, I refer to coding these pages such that they will not be picked up by Google or other search engines.
Regards, Newyorkbrad
Newyorkbrad is a member of Wikipedia Review; he is therefore a troll, or possibly brainwashed, and must not be listened to. Moderators, thank you for moderating the post initially. Too bad we can't just ban all the bad people so all us good people can work in peace.
regards ...
2008/7/23 Grease Monkee welloiledmachine@gmail.com:
On Tue, Jul 22, 2008 at 5:47 PM, Newyorkbrad (Wikipedia) < newyorkbrad@gmail.com> wrote:
A couple of months ago, I raised on this list the issue of "no-indexing" Wikipedia pages outside the mainspace, principally including
project-space
pages such as XfDs, AN/ANI, RfA's, RfAr's, and the like, but possibly including userspace as well. By no-indexing, I refer to coding these
pages
such that they will not be picked up by Google or other search engines.
Regards, Newyorkbrad
Newyorkbrad is a member of Wikipedia Review; he is therefore a troll, or possibly brainwashed, and must not be listened to. Moderators, thank you for moderating the post initially. Too bad we can't just ban all the bad people so all us good people can work in peace.
regards ...
Sorry, Grease Monkee, I think the thread you were looking for was the one entitled "Dangerous factionalism." Perhaps the mods can move it for you.
Risker
On Wed, Jul 23, 2008 at 8:32 AM, Risker risker.wp@gmail.com wrote:
2008/7/23 Grease Monkee welloiledmachine@gmail.com:
Newyorkbrad is a member of Wikipedia Review; he is therefore a troll, or possibly brainwashed, and must not be listened to. Moderators, thank you for moderating the post initially. Too bad we can't just ban all the bad people so all us good people can work in peace.
regards ...
Sorry, Grease Monkee, I think the thread you were looking for was the one entitled "Dangerous factionalism." Perhaps the mods can move it for you.
Risker
Oh no, I'm quite sure this is the right thread. We MUST NOT allow our minds to be poisoned by these people, and the only way to do it is to STAMP OUT their voice on this list. I mean after all, the next thing you know something crazy might happen, like admitting that a WikipediaReviewer actually had a good idea.