Thoughts on hiding text from the internal search

List overview All Threads
Download

newer

older

The Zürich Hackathon and you

Re: [Wikitech-l] Using Special...

Chad

19 Feb 2014 19 Feb '14

5 a.m.

Hi, I'm curious how people would go about hiding text from the internal MediaWiki search engine (not external robots). Right now I'm thinking of doing a rather naïve .nosearch class that would be stripped before indexing. I can see potentials for abuse though. Does anyone have any bright ideas? -Chad

Show replies by thread

MZMcBride

19 Feb 19 Feb

5:50 a.m.

Chad wrote:

...

I'm curious how people would go about hiding text from the internal MediaWiki search engine (not external robots). Right now I'm thinking of doing a rather naïve .nosearch class that would be stripped before indexing. I can see potentials for abuse though. Does anyone have any bright ideas?

It's difficult to offer advice without knowing why you're trying to do what it is you're trying to do. You've described a potential solution, but I'm not sure what problem you're trying to solve. Are there some example use-cases or perhaps there's a relevant bug in Bugzilla? MZMcBride

Chad

6:29 a.m.

On Tue, Feb 18, 2014 at 9:50 PM, MZMcBride <z(a)mzmcbride.com> wrote:

...

Chad wrote:

Ah, here's the bug: https://bugzilla.wikimedia.org/show_bug.cgi?id=60484 -Chad

Nikolas Everett

3:14 p.m.

I can make a better case for hiding things from internal search then I did on the bug. I'll send it here and copy it to the mailing list: The biggest case I can think of for excluding text from search is the license information on commons. Please take that as an example. Maybe it is the only example I think it is pretty important. 1. The license information doesn't add a whole lot to the result. Try searching commons with Cirrus for "distribute", "transmit", or "following" and you'll very quickly start to see the text of the CC license. And the searches find 14 million results. Heaven forbid you want to find "distributed transmits" or something. You'll almost exclusively get the license highlighted and you'll still find 14 million results. This isn't _horrible_ because the top results all have "distribute" or "transmit" in the title but it isn't great. 2. Knock on effect from #2: because relevance is calculated based on the inverse of the number of documents that contain the word the then every term in the CC license is worth less then words not in the license. I can't point to any example of why that is bad but I feel it in my bones. Feel free to ignore this. I'm probably paranoid. 3. Entirely self serving: given #1, the contents of the license take up an awful lot of space for very little benefit. If I had more space I could make Cirrus a beta on more wikis. It is kind of a lame reason and I'm attacking the space issue from other angles so maybe it'll be moot long before we get this deployed and convince the community that it is worth doing. 4. Really really self serving: if .nosearch is the right solution and is useful then it is super duper easy to implement. Like one line of code, a few tests, and bam. Its already done, just waiting to be rebased and merged. It was so easy it would have taken longer to estimate the effort then to propose an implementation. I really wouldn't be surprised if someone couldn't come up with great reason why #1 is silly and we just shouldn't do it. The big problem with the nosearch class implementation is that it'd be pretty simple to abuse and hard to catch the abuse because the text is still on the page. One of the nice things about the solution is you could use a web browser's debugger to highlight all the text excluded from search by writing a simple CSS class. I think that is all I have on the subject, Nik/manybubbles On Wed, Feb 19, 2014 at 1:29 AM, Chad <innocentkiller(a)gmail.com> wrote:

...

On Tue, Feb 18, 2014 at 9:50 PM, MZMcBride <z(a)mzmcbride.com> wrote:

Chad wrote:

It's difficult to offer advice without knowing why you're trying to do what it is you're trying to do. You've described a potential solution,

but

I'm not sure what problem you're trying to solve. Are there some example use-cases or perhaps there's a relevant bug in Bugzilla?

Ah, here's the bug: https://bugzilla.wikimedia.org/show_bug.cgi?id=60484 -Chad _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Helder .

5:17 p.m.

On Wed, Feb 19, 2014 at 12:14 PM, Nikolas Everett <neverett(a)wikimedia.org> wrote:

...

The big problem with the nosearch class implementation is that it'd be pretty simple to abuse and hard to catch the abuse because the text is still on the page. One of the nice things about the solution is you could use a web browser's debugger to highlight all the text excluded from search by writing a simple CSS class.

What if the abuse is inside of a hidden element? http://jsfiddle.net/WQ6K2/ Helder

Nikolas Everett

5:39 p.m.

On Wed, Feb 19, 2014 at 12:17 PM, Helder . <helder.wiki(a)gmail.com> wrote:

...

On Wed, Feb 19, 2014 at 12:14 PM, Nikolas Everett <neverett(a)wikimedia.org> wrote:

could

use a web browser's debugger to highlight all the text excluded from

by writing a simple CSS class.

What if the abuse is inside of a hidden element? http://jsfiddle.net/WQ6K2/ Helder

Yeah, nowhere near perfect.

Isarra Yos

6:09 p.m.

Don't suppose there'd be any way to do both things - an index to search normally, and one to search full of everything if someone's searching for hidden stuff. -L On 19/02/14 17:39, Nikolas Everett wrote:

...

On Wed, Feb 19, 2014 at 12:17 PM, Helder . <helder.wiki(a)gmail.com> wrote:

On Wed, Feb 19, 2014 at 12:14 PM, Nikolas Everett <neverett(a)wikimedia.org> wrote:

could

use a web browser's debugger to highlight all the text excluded from

by writing a simple CSS class.

What if the abuse is inside of a hidden element? http://jsfiddle.net/WQ6K2/ Helder

Yeah, nowhere near perfect. _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

C. Scott Ananian

7:02 p.m.

Or use structured search instead of hiding things, so that you can search (say) just the titles of images on commons without corrupting the title index with license text. --scott

Chad

7:57 p.m.

On Wed, Feb 19, 2014 at 11:02 AM, C. Scott Ananian <cananian(a)wikimedia.org>wrote;wrote:

...

Or use structured search instead of hiding things, so that you can search (say) just the titles of images on commons without corrupting the title index with license text.

We don't lump the title with the text. The problem is when you've got a wiki where there's a whole lot of the same text so it becomes harder to search for things that happen to also be in that repeated text. Two good examples so far are welcome-style templates on Talk namespaces and License templates on files. Another way of summarizing this is: we want to index all of the page text, except when we don't. -Chad

3724

days inactive

3724

days old

wikitech-l@lists.wikimedia.org

Manage subscription

8 comments

6 participants

tags (0)

participants (6)

C. Scott Ananian
Chad
Helder .
Isarra Yos
MZMcBride
Nikolas Everett