Responding in line:


On Sat, Jun 21, 2014 at 10:54 AM, billinghurst <billinghurst@gmail.com> wrote:
(in for a penny, in for a pound)
Is there a means to find out how users are using local search facilities
to find things?


1) Types of searches?  How much customisation is taking place between
simple and complex searches? 
2) Common searches?
3) Searching from the available search box, or utilising the full search
tools?


Sort of.  We log searches but we don't make an effort to link subsequent searches to know when the same user is searching again.  So we can't tell if they are refining the search.  We're loath to dig into this more deeply because we're just generally worried about privacy issues.  Also digging too much into this would be a ton of work.

Interesting point: when a search goes super duper long (>10 seconds) we do log the user/IP that performed the search so if someone is exploiting some bug we can ban them to save the infrastructure.  We've only ever had one such bug and no one exploited it but we collect the data any way.

But we totally can dig through the search logs and figure out which operators are in use and we could determine common searches - but we don't do that automatically and we don't use it for much.  We're pretty concerned about that kind of data crossing the NDA barrier - more privacy concerns.
 

Thanks. Regards, Billinghurst

On Sun, 22 Jun 2014 00:44:38 +1000, billinghurst <billinghurst@gmail.com>
wrote:
> And here I am with more questions. ;-)
>
> Can you explain how the lead paragraph section will work? For instance
the
> Wikisources pages generally lead with a generic ns template, before
> starting on the body of the work.

The lead paragraph implementation that we use now is somewhat wikipedia centric unfortunately.  We had to start somewhere though.  The lead paragraph is actually everything between the start of the article and the first heading. Other warts:
1.  This is the rendered document - templates are expanded.
2.  This is only article text, stuff that we've pulled out as "auxiliary" like the contents of tables and image captions don't count.
3.  If there isn't a heading then we give up and assume there isn't a lead paragraph.

Its not perfect by any means, but its a step.  The code itself allows for other implementations to be configured per wiki but I only have one other implementation: assume there isn't a lead at all.  Right now all wikis use the heading implementation.

I'm happy to work on this.  It might be useful to let the page mark its lead section itself rather then have us infer it.  Another option is to mark the contents of the template as "auxiliary text" so it'll be excluded from the lead in.  I wrote some documentation here:
https://www.mediawiki.org/w/index.php?title=Help%3ACirrusSearch&diff=1045315&oldid=1043851

 
>
> Snippets ... with Wikisources poetry, is there a means to have the first
> line of a poem promoted to be a snippet? Thinking of this as much poetry
is
> cited by its first line, so can see some synergy, especially if there
was
> either some markup, or automatic recognition of the first line. Also
> knowing that depending on the work sometimes it has a title prior to the
> first line, sometimes not.
>
> Within which feature is the snippets stuff/ Looking at
>
http://git.wikimedia.org/tree/mediawiki%2Fextensions%2FCirrusSearch.git/master/tests%2Fbrowser%2Ffeatures
> it isn't obvious to me which it is.


highlighting.feature - highlighting is the name that Lucene uses for marking search matches in text.  Its a pretty common term for this but I tend to use "extracting snippets" when talking to non-Lucene users because its less ambiguous.

We can certainly work on things like the first line of the poem.  We maintain our a highlighter separate from the rest of the Lucene/Elasticsearch code base specifically so we can iterate on it quickly.  Can you file a bug with an example page?  That'd be super helpful.  The highlighter is best viewed here if you are curious.  We maintain it in gerrit but it gets replicated to github and the documentation is more readable there.  This keeps it from being _too_ different from every other Elasticsearch plugin.

Nik