Cross-posting this to the Discovery mailing list with hopes that someone
from WMF Discovery can shed some light on this situation.
Pine
On Mon, May 15, 2017 at 2:08 PM, Tom <tom(a)hutch4.us> wrote:
> I actually think there is a drop in page content results too. Searching
> for example, pages using a tag <FooBar>text</FooBar> would report content
> found in x pages. Now search <FooBar> no content in pages found. Search
> <FooBar no > found on 3 pages but expect 50.
>
> I do want to do more testing. Rebuilding the index seems to be super fast
> unlike before which would take up to a few minutes to complete.
>
> Tom
>
> > On May 15, 2017, at 10:02 AM, [[kgh]] <mediawiki(a)kghoffmeyer.de> wrote:
> >
> > Heiya,
> >
> > it's me again. :) Does somebody at least see the issue. Probably a bug
> > that should be reported?
> >
> > Thanks and cheers
> >
> > Karsten
> >
> >
> >> Am 09.05.2017 um 16:32 schrieb [[kgh]]:
> >> Heiya,
> >>
> >> I have upgraded from 1.23 to 1.27 which was now possible since the
> >> latest release.
> >>
> >> After the process I observe a changed behavior regarding the rudimentary
> >> full-text search MediaWiki provides out of the box, i.e. I am not
> >> talking about the Cirrus/Elastica duo available as an extra.
> >>
> >> When adding a search term to the search field on MW 1.27 like e.g.
> >> "Lorem ipsum" (note: including the ") than only the page names for the
> >> findings are shown and not the page names and some text extract wrapping
> >> the searched term as MW 1.23 did. When adding just Lorem ipsum (note:
> >> excluding the ") I get the page names and some text extract wrapping the
> >> searched term as I did with 1.23. The results for Lorem ipsum however
> >> are a much worse fit than for "Lorem ipsum" so that's why I am here.
> >>
> >> Perhaps I missed some setting I now have to make or perhaps there is
> >> some script I overlooked to get things running. I'd like to get the
> >> wrapping text back. Pointers highly appreciated.
> >>
> >> Thanks for your time
> >>
> >> Karsten
> >>
> >>
> >> _______________________________________________
> >> MediaWiki-l mailing list
> >> To unsubscribe, go to:
> >> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
> >
> >
> > _______________________________________________
> > MediaWiki-l mailing list
> > To unsubscribe, go to:
> > https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
>
>
> _______________________________________________
> MediaWiki-l mailing list
> To unsubscribe, go to:
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
>
Some of Andrew's report may be of interest to Discovery.
Pine
---------- Forwarded message ----------
From: Andrew Hall <hall1467(a)umn.edu>
Date: Wed, May 31, 2017 at 7:34 AM
Subject: [Wiki-research-l] Report/Reflection on CHI 2017
To: Research into Wikimedia content and communities <
wiki-research-l(a)lists.wikimedia.org>
Hello all,
I recently attended the 2017 Conference on Human Factors in Computing
Systems (CHI) and put together a small report/reflection for Aaron Halfaker
regarding some of the work that was presented there that I found
interesting. If you’d like to check the report out, it can be found here:
https://meta.wikimedia.org/wiki/User:Hall1467/CHI_2017_Report <
https://meta.wikimedia.org/wiki/User:Hall1467/CHI_2017_Report>. CHI is a
yearly human-computer interaction conference and is a common venue for
studies on peer production communities such as Wikipedia.
Feel free to leave questions or comments in the talk page! Have a great
rest of the week.
Andrew
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Apologies for the cross-posting...
--
deb tankersley
irc: debt
Product Manager, Discovery
Wikimedia Foundation
---------- Forwarded message ---------
From: [[kgh]] <mediawiki(a)kghoffmeyer.de>
Date: Tue, May 9, 2017 at 9:33 AM
Subject: [MediaWiki-l] Search feature / Change in behaviour
To: MediaWiki admin list <mediawiki-l(a)lists.wikimedia.org>
Heiya,
I have upgraded from 1.23 to 1.27 which was now possible since the
latest release.
After the process I observe a changed behavior regarding the rudimentary
full-text search MediaWiki provides out of the box, i.e. I am not
talking about the Cirrus/Elastica duo available as an extra.
When adding a search term to the search field on MW 1.27 like e.g.
"Lorem ipsum" (note: including the ") than only the page names for the
findings are shown and not the page names and some text extract wrapping
the searched term as MW 1.23 did. When adding just Lorem ipsum (note:
excluding the ") I get the page names and some text extract wrapping the
searched term as I did with 1.23. The results for Lorem ipsum however
are a much worse fit than for "Lorem ipsum" so that's why I am here.
Perhaps I missed some setting I now have to make or perhaps there is
some script I overlooked to get things running. I'd like to get the
wrapping text back. Pointers highly appreciated.
Thanks for your time
Karsten
_______________________________________________
MediaWiki-l mailing list
To unsubscribe, go to:
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
At our weekly relevance meeting an interesting idea came up about how to
collect relevance judgements for the long tail of queries, which make up
around 60% of search sessions.
We are pondering asking questions on the article pages themselves. Roughly
we would manually curate some list of queries we want to collect relevance
judgements for. When a user has spent some threshold of time (60s?) on a
page we would, for some % of users, check if we have any queries we want
labeled for this page, and then ask them if the page is a relevant result
for that query. In this way the amount of work asked of individuals is
relatively low and hopefully something they can answer without too much
work. We know that the average page receives a few thousand page views per
day, so even with a relatively low response rate we could probably collect
a reasonable number of judgements over some medium length time period
(weeks?)
These labels would almost certainly be noisy, we would need to collect the
same judgement many times to get any kind of certainty on the label.
Additionally we would not be able to really explain the nuances of a
grading scale with many points, we would probably have to use either a
thumbs up/thumbs down approach, or maybe a happy/sad/indifferent smiley
face.
Does this seem reasonable? Are there other ways we could go about
collecting the same data? How to design it in a non-intrusive manner that
gets results, but doesn't annoy users? Other thoughts?
For some background:
* We are currently generating labeled data using statistical analysis
(clickmodels) against historical click data. This analysis requires there
to be multiple search sessions with the same query presented with similar
results to estimate the relevance of those results. A manual review of the
results showed queries with clicks from at least 10 sessions had reasonable
but not great labels, queries with 35+ sessions looked pretty good, and
queries with hundreds of sessions were labeled really well.
* an analysis of 80 days worth of search click logs showed that 35 to 40%
of search sessions are for queries that are repeated more than 10 times in
that 80 day period. Around 20% of search session are for queries that are
repeated more than 35 times in that 80 day period. (
https://phabricator.wikimedia.org/P5371)
* Our privacy policy prevents us from keeping more than 90 days worth of
data from which to run these clickmodels. Practically 80 days is probably a
reasonable cutoff, as we will want to re-use the data multiple times before
needing to delete it and generate a new set of labels.
* We currently collect human relevance judgements with Discernatron (
https://discernatron.wmflabs.org/). This is useful data for manual
evaluation of changes, but the data set is much too small (low hundreds of
queries, with an average of 50 documents per query) to integrate into
machine learning. The process of judging query/document pairs for the
community is quite tedious, and it doesn't seem like a great use of
engineer time for us to do this ourselves.