[WikiEN-l] Please remove user data and talk pages from database dumps!

SPUI drspui at gmail.com
Sat Dec 24 00:48:48 UTC 2005


Tupsharru Tupsharru wrote:
> I once registered as a user of Wikipedia, and I know that anything I write there *may* be copied and re-used according to the GFDL. However, I did not sign up for the Pornopedia, Nazipe dia or Spamopedia. 
> 
> What is written on user pages and user talkpages is also released under the GFDL, and if somebody wants to copy it or quote it, fine (as long as it is attributed)! But there is no reason to automate this process or make it easy for webspammers and other creeps to do so. I do not want my user page to be copied to various Wikipedia mirrors, as happened a while ago with the Nazi copy of Wikipedia. I would be even less happy if I had signed up under my real name. The appearance of a name in such a context may actually be harmful to somebody's reputation. 
> 
> 1. My first suggestion: just *make sure that when the database is copied, user information does not come along with it, including userpages, user talkpages and even the history of a page*. I notice from some of the mirrors out there, that the only contributor visible in the history of an article is the last one before the dump, somebody who may just have corrected a typo. As it doesn't give proper attribution in any case, we may just as well get rid of that too. Just make sure the history page of every downloaded article refers back to Wikipedia, where the full history can be found.
> 
> 2. Second suggestion: is there any reason why *any* discussion pages need to come with the normal database dump? The nazi 'pedia (which is down now) took these and search-and-replaced "Wikipedia" with its own name everywhere, giving the misleading impression that a lot of Wikipedia users had been active in discussions on a Nazi website. This may be seriously harmful to somebody's reputation if found through a Google search by somebody not familiar with the GFDL and how Wikipedia works. It is probably illegal in some way to do what they did (as Wikipedia will no longer be properly credited) but I just don't see anybody going to court to stop it, and we certainly don't need to facilitate abuse of mirrored discussion pages with consequences for the reputation or privacy of individual users. Again, please *replace all discussion pages in the database dump with a very clear and visible link back to Wikipedia*, not just the miniscule one down at the bottom of every page. Most do
wnl
>  oaders
>  are not going to bother removing that link, as all they want Wikipedia content for, is to get Google hits and drive traffic to their websites.
> 
> 3. Remove the user namespace from the reach of Google's indexing bots. It should be available to our internal search, but there is no reason it should get hits from Google. Userspace  contains all kinds of semi-private conversations and unfinished drafts which are really only of internal use and interest.
> 
> I question whether some other type of free but non-commercial license wouldn't be more suitable for user pages, but that may not be realistic for various reasons. But the removal of these pages from the dump really shouldn't require a change in license. It will just force somebody who wants to copy the content to do so manually. The webspammers obviously won't bother with that.
> 
There are certain advantages to including everything in the dumps. For 
instance, if Wikipedia gets shut down for some reason, it would be much 
easier to start it up elsewhere if we had a full dump.

As for Google indexing, many of the pages in my User space, like 
http://en.wikipedia.org/wiki/User:SPUI/Amtrak_shite , are useful, and I 
would hope that Google does index it.



More information about the WikiEN-l mailing list