I can make a better case for hiding things from internal search then I did on the bug. I'll send it here and copy it to the mailing list:
The biggest case I can think of for excluding text from search is the license information on commons. Please take that as an example. Maybe it is the only example I think it is pretty important. 1. The license information doesn't add a whole lot to the result. Try searching commons with Cirrus for "distribute", "transmit", or "following" and you'll very quickly start to see the text of the CC license. And the searches find 14 million results. Heaven forbid you want to find "distributed transmits" or something. You'll almost exclusively get the license highlighted and you'll still find 14 million results. This isn't _horrible_ because the top results all have "distribute" or "transmit" in the title but it isn't great. 2. Knock on effect from #2: because relevance is calculated based on the inverse of the number of documents that contain the word the then every term in the CC license is worth less then words not in the license. I can't point to any example of why that is bad but I feel it in my bones. Feel free to ignore this. I'm probably paranoid. 3. Entirely self serving: given #1, the contents of the license take up an awful lot of space for very little benefit. If I had more space I could make Cirrus a beta on more wikis. It is kind of a lame reason and I'm attacking the space issue from other angles so maybe it'll be moot long before we get this deployed and convince the community that it is worth doing. 4. Really really self serving: if .nosearch is the right solution and is useful then it is super duper easy to implement. Like one line of code, a few tests, and bam. Its already done, just waiting to be rebased and merged. It was so easy it would have taken longer to estimate the effort then to propose an implementation.
I really wouldn't be surprised if someone couldn't come up with great reason why #1 is silly and we just shouldn't do it.
The big problem with the nosearch class implementation is that it'd be pretty simple to abuse and hard to catch the abuse because the text is still on the page. One of the nice things about the solution is you could use a web browser's debugger to highlight all the text excluded from search by writing a simple CSS class.
I think that is all I have on the subject,
Nik/manybubbles
On Wed, Feb 19, 2014 at 1:29 AM, Chad innocentkiller@gmail.com wrote:
On Tue, Feb 18, 2014 at 9:50 PM, MZMcBride z@mzmcbride.com wrote:
Chad wrote:
I'm curious how people would go about hiding text from the internal MediaWiki search engine (not external robots). Right now I'm thinking of doing a rather naïve .nosearch class that would be stripped before indexing. I can see potentials for abuse though.
Does anyone have any bright ideas?
It's difficult to offer advice without knowing why you're trying to do what it is you're trying to do. You've described a potential solution,
but
I'm not sure what problem you're trying to solve. Are there some example use-cases or perhaps there's a relevant bug in Bugzilla?
Ah, here's the bug: https://bugzilla.wikimedia.org/show_bug.cgi?id=60484
-Chad _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l