I can make a better case for hiding things from internal search then I did
on the bug. I'll send it here and copy it to the mailing list:
The biggest case I can think of for excluding text from search is the
license information on commons. Please take that as an example. Maybe it
is the only example I think it is pretty important.
1. The license information doesn't add a whole lot to the result. Try
searching commons with Cirrus for "distribute", "transmit", or
"following"
and you'll very quickly start to see the text of the CC license. And the
searches find 14 million results. Heaven forbid you want to find
"distributed transmits" or something. You'll almost exclusively get the
license highlighted and you'll still find 14 million results. This isn't
_horrible_ because the top results all have "distribute" or "transmit"
in
the title but it isn't great.
2. Knock on effect from #2: because relevance is calculated based on the
inverse of the number of documents that contain the word the then every
term in the CC license is worth less then words not in the license. I
can't point to any example of why that is bad but I feel it in my bones.
Feel free to ignore this. I'm probably paranoid.
3. Entirely self serving: given #1, the contents of the license take up an
awful lot of space for very little benefit. If I had more space I could
make Cirrus a beta on more wikis. It is kind of a lame reason and I'm
attacking the space issue from other angles so maybe it'll be moot long
before we get this deployed and convince the community that it is worth
doing.
4. Really really self serving: if .nosearch is the right solution and is
useful then it is super duper easy to implement. Like one line of code, a
few tests, and bam. Its already done, just waiting to be rebased and
merged. It was so easy it would have taken longer to estimate the effort
then to propose an implementation.
I really wouldn't be surprised if someone couldn't come up with great
reason why #1 is silly and we just shouldn't do it.
The big problem with the nosearch class implementation is that it'd be
pretty simple to abuse and hard to catch the abuse because the text is
still on the page. One of the nice things about the solution is you could
use a web browser's debugger to highlight all the text excluded from search
by writing a simple CSS class.
I think that is all I have on the subject,
Nik/manybubbles
On Wed, Feb 19, 2014 at 1:29 AM, Chad <innocentkiller(a)gmail.com> wrote:
On Tue, Feb 18, 2014 at 9:50 PM, MZMcBride
<z(a)mzmcbride.com> wrote:
Chad wrote:
I'm curious how people would go about hiding
text from the internal
MediaWiki search engine (not external robots). Right now I'm thinking of
doing a rather naïve .nosearch class that would be stripped before
indexing. I can see potentials for abuse though.
Does anyone have any bright ideas?
It's difficult to offer advice without knowing why you're trying to do
what it is you're trying to do. You've described a potential solution,
but
I'm not sure what problem you're trying
to solve. Are there some example
use-cases or perhaps there's a relevant bug in Bugzilla?
Ah, here's the bug:
https://bugzilla.wikimedia.org/show_bug.cgi?id=60484
-Chad
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l