Hi,
Nik Everett added a lot of information about future CirrusSearch changes to the next edition of Tech News (#26, due to be sent on Monday). (Thank you, Nik :)
We can't include all of it in the newsletter, so below is the original text with all the details. The newsletter will point to this email for further information.
_______________________________________________________________
- CirrusSearch updates (with 1.24wmf10) - Categories will now be considered in result ranking which should improve results - We took a shortcut to get this deployed (much) more quickly and the consequences are that the incategory operator won't work for up to 24 hours after the deployment. We'll make this time as short as possible. If this is going to be a horrible pain then file a bug https://bugzilla.wikimedia.org/enter_bug.cgi?product=MediaWiki%20extensions&component=CirrusSearch against Cirrus with the wiki you work on. We can either prioritize your wiki so the outage is very small or, if its a big enough deal, come up with a workaround. - Text from the lead paragraph in the article will be given a boost when ranking results which should also improve results - This will take some time to roll onto the wikis after wmf10 because the index will have to be rebuilt. Days, likely. - I don't imagine this'll have any impact on wiktionary and commons but file a bug https://bugzilla.wikimedia.org/enter_bug.cgi?product=MediaWiki%20extensions&component=CirrusSearch against Cirrus if it seems like it has a negative impact on results - We're on track to add support for searching in article source including regular expressions. See [the documentation https://www.mediawiki.org/wiki/Help:CirrusSearch#insource:] for more. - Like the lead paragraph the article source will take some time to roll into the index after the deployment. - Right now we haven't implemented snippet extraction from article source searches. You'll only get snippets back from the regular search terms. If you don't have any regular search terms you'll get back a snippet from the beginning of the article. I know this isn't ideal at all, and its on the list of things to fix. - We'll cut all wikis over to a new snippet extractor - You should only notice improvements in the snippets generated but if you see any trouble file a bug https://bugzilla.wikimedia.org/enter_bug.cgi?product=MediaWiki%20extensions&component=CirrusSearch against Cirrus
Sorry for writing a novel!
On Fri, Jun 20, 2014 at 9:30 AM, Guillaume Paumier gpaumier@wikimedia.org wrote:
Hi,
Nik Everett added a lot of information about future CirrusSearch changes to the next edition of Tech News (#26, due to be sent on Monday). (Thank you, Nik :)
We can't include all of it in the newsletter, so below is the original text with all the details. The newsletter will point to this email for further information.
- CirrusSearch updates (with 1.24wmf10)
improve results - We took a shortcut to get this deployed (much) more quickly and the consequences are that the incategory operator won't work for up to 24 hours after the deployment. We'll make this time as short as possible. If this is going to be a horrible pain then file a bug https://bugzilla.wikimedia.org/enter_bug.cgi?product=MediaWiki%20extensions&component=CirrusSearch against Cirrus with the wiki you work on. We can either prioritize your wiki so the outage is very small or, if its a big enough deal, come up with a workaround.
- Categories will now be considered in result ranking which should
when ranking results which should also improve results - This will take some time to roll onto the wikis after wmf10 because the index will have to be rebuilt. Days, likely. - I don't imagine this'll have any impact on wiktionary and commons but file a bug https://bugzilla.wikimedia.org/enter_bug.cgi?product=MediaWiki%20extensions&component=CirrusSearch against Cirrus if it seems like it has a negative impact on results
- Text from the lead paragraph in the article will be given a boost
including regular expressions. See [the documentation https://www.mediawiki.org/wiki/Help:CirrusSearch#insource:] for more. - Like the lead paragraph the article source will take some time to roll into the index after the deployment. - Right now we haven't implemented snippet extraction from article source searches. You'll only get snippets back from the regular search terms. If you don't have any regular search terms you'll get back a snippet from the beginning of the article. I know this isn't ideal at all, and its on the list of things to fix.
- We're on track to add support for searching in article source
- We'll cut all wikis over to a new snippet extractor
but if you see any trouble file a bug https://bugzilla.wikimedia.org/enter_bug.cgi?product=MediaWiki%20extensions&component=CirrusSearch against Cirrus
- You should only notice improvements in the snippets generated
-- Guillaume Paumier Technical Communications Manager — Wikimedia Foundation
Wikitech-ambassadors mailing list Wikitech-ambassadors@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors
On Fri, Jun 20, 2014 at 8:00 PM, Nikolas Everett neverett@wikimedia.org wrote:
Sorry for writing a novel!
No problem at all; Detailed updates are a good thing :)
++++1!
2014-06-20 20:06 GMT+02:00 Guillaume Paumier gpaumier@wikimedia.org:
On Fri, Jun 20, 2014 at 8:00 PM, Nikolas Everett neverett@wikimedia.org wrote:
Sorry for writing a novel!
No problem at all; Detailed updates are a good thing :)
-- Guillaume Paumier
Wikitech-ambassadors mailing list Wikitech-ambassadors@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors
And here I am with more questions. ;-)
Can you explain how the lead paragraph section will work? For instance the Wikisources pages generally lead with a generic ns template, before starting on the body of the work.
Snippets ... with Wikisources poetry, is there a means to have the first line of a poem promoted to be a snippet? Thinking of this as much poetry is cited by its first line, so can see some synergy, especially if there was either some markup, or automatic recognition of the first line. Also knowing that depending on the work sometimes it has a title prior to the first line, sometimes not.
Within which feature is the snippets stuff/ Looking at http://git.wikimedia.org/tree/mediawiki%2Fextensions%2FCirrusSearch.git/mast... it isn't obvious to me which it is.
Thanks. Regards, Billinghurst
On Fri, 20 Jun 2014 14:00:03 -0400, Nikolas Everett neverett@wikimedia.org wrote:
Sorry for writing a novel!
On Fri, Jun 20, 2014 at 9:30 AM, Guillaume Paumier
wrote:
Hi,
Nik Everett added a lot of information about future CirrusSearch
changes
to the next edition of Tech News (#26, due to be sent on Monday).
(Thank
you, Nik :)
We can't include all of it in the newsletter, so below is the original text with all the details. The newsletter will point to this email for further information.
- CirrusSearch updates (with 1.24wmf10)
- Categories will now be considered in result ranking which
should
improve results - We took a shortcut to get this deployed (much) more quickly and the consequences are that the incategory operator won't work for up to 24 hours after the deployment. We'll make this time as short
as
possible. If this is going to be a horrible pain then file a bug
https://bugzilla.wikimedia.org/enter_bug.cgi?product=MediaWiki%20extensions&component=CirrusSearch
against Cirrus with the wiki you work on. We can either prioritize
your
wiki so the outage is very small or, if its a big enough deal, come up
with
a workaround. - Text from the lead paragraph in the article will be given a
boost
when ranking results which should also improve results - This will take some time to roll onto the wikis after wmf10 because the index will have to be rebuilt. Days, likely. - I don't imagine this'll have any impact on wiktionary and commons but file a bug
https://bugzilla.wikimedia.org/enter_bug.cgi?product=MediaWiki%20extensions&component=CirrusSearch
against Cirrus if it seems like it has a negative impact on results - We're on track to add support for searching in article source including regular expressions. See [the documentation <https://www.mediawiki.org/wiki/Help:CirrusSearch#insource:>] for more. - Like the lead paragraph the article source will take some
time
to roll into the index after the deployment. - Right now we haven't implemented snippet extraction from article source searches. You'll only get snippets back from
the
regular search terms. If you don't have any regular search terms
you'll
get back a snippet from the beginning of the article. I know this isn't ideal at all, and its on the list of things to fix. - We'll cut all wikis over to a new snippet extractor - You should only notice improvements in the snippets
generated
but if you see any trouble file a bug
https://bugzilla.wikimedia.org/enter_bug.cgi?product=MediaWiki%20extensions&component=CirrusSearch
against Cirrus
-- Guillaume Paumier Technical Communications Manager — Wikimedia Foundation
Wikitech-ambassadors mailing list Wikitech-ambassadors@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors
(in for a penny, in for a pound) Is there a means to find out how users are using local search facilities to find things?
1) Types of searches? How much customisation is taking place between simple and complex searches? 2) Common searches? 3) Searching from the available search box, or utilising the full search tools?
Thanks. Regards, Billinghurst
On Sun, 22 Jun 2014 00:44:38 +1000, billinghurst billinghurst@gmail.com wrote:
And here I am with more questions. ;-)
Can you explain how the lead paragraph section will work? For instance
the
Wikisources pages generally lead with a generic ns template, before starting on the body of the work.
Snippets ... with Wikisources poetry, is there a means to have the first line of a poem promoted to be a snippet? Thinking of this as much poetry
is
cited by its first line, so can see some synergy, especially if there
was
either some markup, or automatic recognition of the first line. Also knowing that depending on the work sometimes it has a title prior to the first line, sometimes not.
Within which feature is the snippets stuff/ Looking at
http://git.wikimedia.org/tree/mediawiki%2Fextensions%2FCirrusSearch.git/mast...
it isn't obvious to me which it is.
Thanks. Regards, Billinghurst
On Fri, 20 Jun 2014 14:00:03 -0400, Nikolas Everett neverett@wikimedia.org wrote:
Sorry for writing a novel!
On Fri, Jun 20, 2014 at 9:30 AM, Guillaume Paumier
wrote:
Hi,
Nik Everett added a lot of information about future CirrusSearch
changes
to the next edition of Tech News (#26, due to be sent on Monday).
(Thank
you, Nik :)
We can't include all of it in the newsletter, so below is the original text with all the details. The newsletter will point to this email for further information.
- CirrusSearch updates (with 1.24wmf10)
- Categories will now be considered in result ranking which
should
improve results - We took a shortcut to get this deployed (much) more quickly and the consequences are that the incategory operator won't work for up to 24 hours after the deployment. We'll make this time as short
as
possible. If this is going to be a horrible pain then file a bug
https://bugzilla.wikimedia.org/enter_bug.cgi?product=MediaWiki%20extensions&component=CirrusSearch
against Cirrus with the wiki you work on. We can either prioritize
your
wiki so the outage is very small or, if its a big enough deal, come up
with
a workaround. - Text from the lead paragraph in the article will be given a
boost
when ranking results which should also improve results - This will take some time to roll onto the wikis after wmf10 because the index will have to be rebuilt. Days, likely. - I don't imagine this'll have any impact on wiktionary and commons but file a bug
https://bugzilla.wikimedia.org/enter_bug.cgi?product=MediaWiki%20extensions&component=CirrusSearch
against Cirrus if it seems like it has a negative impact on results - We're on track to add support for searching in article source including regular expressions. See [the documentation <https://www.mediawiki.org/wiki/Help:CirrusSearch#insource:>]
for
more. - Like the lead paragraph the article source will take some
time
to roll into the index after the deployment. - Right now we haven't implemented snippet extraction from article source searches. You'll only get snippets back from
the
regular search terms. If you don't have any regular search terms
you'll
get back a snippet from the beginning of the article. I know this isn't ideal at all, and its on the list of things to fix. - We'll cut all wikis over to a new snippet extractor - You should only notice improvements in the snippets
generated
but if you see any trouble file a bug
https://bugzilla.wikimedia.org/enter_bug.cgi?product=MediaWiki%20extensions&component=CirrusSearch
against Cirrus
-- Guillaume Paumier Technical Communications Manager — Wikimedia Foundation
Wikitech-ambassadors mailing list Wikitech-ambassadors@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors
Wikitech-ambassadors mailing list Wikitech-ambassadors@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors
Responding in line:
On Sat, Jun 21, 2014 at 10:54 AM, billinghurst billinghurst@gmail.com wrote:
(in for a penny, in for a pound) Is there a means to find out how users are using local search facilities to find things?
- Types of searches? How much customisation is taking place between
simple and complex searches?
2) Common searches?
- Searching from the available search box, or utilising the full search
tools?
Sort of. We log searches but we don't make an effort to link subsequent searches to know when the same user is searching again. So we can't tell if they are refining the search. We're loath to dig into this more deeply because we're just generally worried about privacy issues. Also digging too much into this would be a ton of work.
Interesting point: when a search goes super duper long (>10 seconds) we do log the user/IP that performed the search so if someone is exploiting some bug we can ban them to save the infrastructure. We've only ever had one such bug and no one exploited it but we collect the data any way.
But we totally can dig through the search logs and figure out which operators are in use and we could determine common searches - but we don't do that automatically and we don't use it for much. We're pretty concerned about that kind of data crossing the NDA barrier - more privacy concerns.
Thanks. Regards, Billinghurst
On Sun, 22 Jun 2014 00:44:38 +1000, billinghurst billinghurst@gmail.com wrote:
And here I am with more questions. ;-)
Can you explain how the lead paragraph section will work? For instance
the
Wikisources pages generally lead with a generic ns template, before starting on the body of the work.
The lead paragraph implementation that we use now is somewhat wikipedia centric unfortunately. We had to start somewhere though. The lead paragraph is actually everything between the start of the article and the first heading. Other warts: 1. This is the rendered document - templates are expanded. 2. This is only article text, stuff that we've pulled out as "auxiliary" like the contents of tables and image captions don't count. 3. If there isn't a heading then we give up and assume there isn't a lead paragraph.
Its not perfect by any means, but its a step. The code itself allows for other implementations to be configured per wiki but I only have one other implementation: assume there isn't a lead at all. Right now all wikis use the heading implementation.
I'm happy to work on this. It might be useful to let the page mark its lead section itself rather then have us infer it. Another option is to mark the contents of the template as "auxiliary text" so it'll be excluded from the lead in. I wrote some documentation here: https://www.mediawiki.org/w/index.php?title=Help%3ACirrusSearch&diff=104...
Snippets ... with Wikisources poetry, is there a means to have the first line of a poem promoted to be a snippet? Thinking of this as much poetry
is
cited by its first line, so can see some synergy, especially if there
was
either some markup, or automatic recognition of the first line. Also knowing that depending on the work sometimes it has a title prior to the first line, sometimes not.
Within which feature is the snippets stuff/ Looking at
http://git.wikimedia.org/tree/mediawiki%2Fextensions%2FCirrusSearch.git/mast...
it isn't obvious to me which it is.
highlighting.feature - highlighting is the name that Lucene uses for marking search matches in text. Its a pretty common term for this but I tend to use "extracting snippets" when talking to non-Lucene users because its less ambiguous.
We can certainly work on things like the first line of the poem. We maintain our a highlighter separate from the rest of the Lucene/Elasticsearch code base specifically so we can iterate on it quickly. Can you file a bug with an example page? That'd be super helpful. The highlighter is best viewed here https://github.com/wikimedia/search-highlighter if you are curious. We maintain it in gerrit but it gets replicated to github and the documentation is more readable there. This keeps it from being _too_ different from every other Elasticsearch plugin.
Nik
wikitech-ambassadors@lists.wikimedia.org