Hey everybody,
TL;DR: I wanted to let you know about an upcoming experimental Reddit AMA
("ask me anything") chat we have planned. It will focus on artificial
intelligence on Wikipedia and how we're working to counteract vandalism
while also making life better for newcomers.
We plan to hold this chat on June 1st at 21:00 UTC/14:00 PST in the /r/iAMA
subreddit[1]. I'd love to answer any questions you have about these topics
questions, and I'll send a follow-up email to this thread shortly before
the AMA begins.
----
For those who don't know who I am, I create artificial intelligences[2]
that support the volunteers who edit Wikipedia[3]. I've been fascinated by
the ways that crowds of volunteers build massive, high quality information
resources like Wikipedia for over ten years.
For more background, I research and then design technologies that make it
easier to spot vandalism in Wikipedia—which helps support the hundreds of
thousands of editors who make productive contributions. I also think a lot
about the dynamics between communities and new users—and ways to make
communities inviting and welcoming to both long-time community members and
newcomers who may not be aware of community norms. For a quick sampling of
my work, check out my most impactful research paper about Wikipedia[3],
some recent coverage of my work from *Wired*[4], or check out the master
list of my projects on my WMF staff user page[5], the documentation for the
technology team I run[9], or the home page for Wikimedia Research[8].
This AMA, which I'm doing with with the Foundation's Communications
department, is somewhat of an experiment. The intended audience for this
chat is people who might not currently be a part of our community but have
questions about the way we work—as well as potential research collaborators
who might want to work with our data or tools. Many may be familiar with
Wikipedia but not the work we do as a community behind the scenes.
I'll be talking about the work I'm doing with the ethics of AI and how we
think about artificial intelligence on Wikipedia, and ways we’re working to
counteract vandalism on the world’s largest crowdsourced source of
knowledge—like the ORES extension[6], which you may have seen highlighting
possibly problematic edits on your watchlist and in RecentChanges.
I’d love for you to join this chat and ask questions. If you do not or
prefer not to use Reddit, we will also be taking questions on ORES'
MediaWiki talk page[7] and posting answers to both threads.
1. https://www.reddit.com/r/IAmA/
2. https://en.wikipedia.org/wiki/Artificial_intelligence
2. https://www.mediawiki.org/wiki/ORES
3.
http://www-users.cs.umn.edu/~halfak/publications/The_Rise_and_Decline/halfa…
4.
https://www.wired.com/2015/12/wikipedia-is-using-ai-to-expand-the-ranks-of-…
5. https://en.wikipedia.org/wiki/User:Halfak_(WMF)
6. https://www.mediawiki.org/wiki/Extension:ORES
7. https://www.mediawiki.org/wiki/Talk:ORES
8. https://www.mediawiki.org/wiki/Wikimedia_Research
9. https://www.mediawiki.org/wiki/Wikimedia_Scoring_Platform_team
-Aaron
Principal Research Scientist @ WMF
User:EpochFail / User:Halfak (WMF)
Hey,
I'm excited to announce the new feature in wikilabels that allows people to
check progress of campaigns they are labeling.
Wikilabels [1] is a platform to gather human-labeled data so it can be used
on ORES [2]. You can find its home page in https://labels.wmflabs.org.
These data can be used in different AI models varying from fighting
vandalism to quality of items in Wikidata to anti-harassment models.
Until now, it was hard to get number of labels that is being made in each
campaign or understand how much work is left. But from now on by accessing
https://labels.wmflabs.org/stats and then going to your wiki you can have
these data. For example, go to https://labels.wmflabs.org/stats/enwiki/ and
it shows you progress of each campaign and how many labels are left to
consider it done or number of unique volunteers who are labeling.
Note that we are overhauling current paths of wikilabels to something
completely new because current paths are a little bit confusing and jump
around between GUI and API. So this URLs might change in the future [3] but
we will announce that properly beforehand and also make sure there is
redirect left from the old ones.
Any feedback about this feature would be greatly welcome. Feel free to
reach out to us in #wikimedia-ai at irc://irc.freenode.net or AI mailing
list. [4]
[1]: https://meta.wikimedia.org/wiki/Wiki_labels
[2]: https://www.mediawiki.org/wiki/ORES
[3]: https://phabricator.wikimedia.org/T165046
[4]: https://lists.wikimedia.org/mailman/listinfo/ai
Best
--
Amir Sarabadani Tafreshi, on behalf of Scoring platform team
Software Engineer (contractor)
-------------------------------------
Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin
http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/681/51985.
Hi,
I was trying to setup revscoring[0] on my GNU/Arch Linux. I'm using
python3.6 in a virtual environment and installed all dependencies from
requirements.
However, when I run nosetests in the revscoring directory, it shows the
error "AttributeError: module 'importlib._bootstrap' has no attribute
'SourceFileLoader'"[1]
Is the error possibly because I'm using python3.6 or due to something else?
[0] - https://github.com/wiki-ai/revscoring
[1] - https://dpaste.de/MErp
-Thanks,
Sumit Asthana,
IIT Patna
On Thu, May 4, 2017 at 11:44 AM, Jan Drewniak <jdrewniak(a)wikimedia.org>
wrote:
> Hi Erik
>
> From my understanding, it looks like your looking to collect relevance
> data "in reverse". Typically, for this type of data collection, I would
> assume that you'd present a query with some search results, and ask users
> "which results are relevant to this query" (which is what discernatron
> does, at a very high effort level).
>
Indeed this is looking to go in reverse. The problem with asking people
performing a query if the results are any good is the specific queries I'm
interested in are not performed by very man people. These queries see on
average less than one instance per week. By doing it in reverse we can
sample from a (hopefully) much larger distribution. I still need to do some
analysis though to see if these long tail queries also return long tail
pages, as in ones that only receive a few tens of hits per day. If the
result pages are also rarely viewed then this scheme will likely not work.
We do have a particularly large sample of queries (~ 10 million or so) to
draw from though, so can likely find queries with popular enough pages to
get information about.
> What I think your proposing instead is that when a user visits an article,
> we present them with a question that asks "would this search query be
> relevant to the article you are looking at".
>
> I can see this working, provided that the query is controlled and the
> question is *not* phrased like it is above.
>
> I think that for this to work, the question should be phrased in a way
> that elicits a simple "top-level" (maybe "yes" or "no") response. For
> example, the question "*is this page about*: 'hydrostone halifax nova
> scotia' " can be responded to with a thumbs up 👍 or thumbs down 👎, but a
> question like "is this article relevant to the following query: ..." seems
> more complicated 🤔 .
>
Indeed word smithing will be important here. I'm not sure 'is this page
about' will be quite the right question, but I'm also not sure what the
right question is. Relevance is a little more nuanced than what the page is
about, some judgement needs to be made about the intent of the query and if
the page can satisfy that intent.
>
> On Thu, May 4, 2017 at 6:29 PM, Erik Bernhardson <
> ebernhardson(a)wikimedia.org> wrote:
>
>> On Wed, May 3, 2017 at 12:44 PM, Jonathan Morgan <jmorgan(a)wikimedia.org>
>> wrote:
>>
>>> Hi Erik,
>>>
>>> I've been using some similar methods to evaluate Related Article
>>> recommendations
>>> <https://meta.wikimedia.org/wiki/Research:Evaluating_RelatedArticles_recomme…>
>>> and the source of the trending article card
>>> <https://meta.wikimedia.org/wiki/Research:Comparing_most_read_and_trending_e…>
>>> in the Explore feed on Android. Let me know if you'd like to sit down and
>>> chat about experimental design sometime.
>>>
>>> - J
>>>
>>>
>> This might be useful. I'll see if i can find a time on both our
>> calendars. I should note though this is explicitly not about experimental
>> design. The data is not going to be used for experimental purposes, but
>> rather to feed into a machine learning pipeline that will re-order search
>> results to provide the best results at the top of the list. For the purpose
>> of ensuring the long tail is represented in the training data for this
>> model I would like to have a few tens of thousands of labels for (query,
>> page) combinations each month. The relevance of pages to a query does have
>> some temporal aspect, so we would likely want to only use the last N months
>> worth of data (TBD).
>>
>> On Wed, May 3, 2017 at 12:24 PM, Erik Bernhardson <
>>> ebernhardson(a)wikimedia.org> wrote:
>>>
>>>> At our weekly relevance meeting an interesting idea came up about how
>>>> to collect relevance judgements for the long tail of queries, which make up
>>>> around 60% of search sessions.
>>>>
>>>> We are pondering asking questions on the article pages themselves.
>>>> Roughly we would manually curate some list of queries we want to collect
>>>> relevance judgements for. When a user has spent some threshold of time
>>>> (60s?) on a page we would, for some % of users, check if we have any
>>>> queries we want labeled for this page, and then ask them if the page is a
>>>> relevant result for that query. In this way the amount of work asked of
>>>> individuals is relatively low and hopefully something they can answer
>>>> without too much work. We know that the average page receives a few
>>>> thousand page views per day, so even with a relatively low response rate we
>>>> could probably collect a reasonable number of judgements over some medium
>>>> length time period (weeks?)
>>>>
>>>> These labels would almost certainly be noisy, we would need to collect
>>>> the same judgement many times to get any kind of certainty on the label.
>>>> Additionally we would not be able to really explain the nuances of a
>>>> grading scale with many points, we would probably have to use either a
>>>> thumbs up/thumbs down approach, or maybe a happy/sad/indifferent smiley
>>>> face.
>>>>
>>>> Does this seem reasonable? Are there other ways we could go about
>>>> collecting the same data? How to design it in a non-intrusive manner that
>>>> gets results, but doesn't annoy users? Other thoughts?
>>>>
>>>>
>>>> For some background:
>>>>
>>>> * We are currently generating labeled data using statistical analysis
>>>> (clickmodels) against historical click data. This analysis requires there
>>>> to be multiple search sessions with the same query presented with similar
>>>> results to estimate the relevance of those results. A manual review of the
>>>> results showed queries with clicks from at least 10 sessions had reasonable
>>>> but not great labels, queries with 35+ sessions looked pretty good, and
>>>> queries with hundreds of sessions were labeled really well.
>>>>
>>>> * an analysis of 80 days worth of search click logs showed that 35 to
>>>> 40% of search sessions are for queries that are repeated more than 10 times
>>>> in that 80 day period. Around 20% of search session are for queries that
>>>> are repeated more than 35 times in that 80 day period. (
>>>> https://phabricator.wikimedia.org/P5371)
>>>>
>>>> * Our privacy policy prevents us from keeping more than 90 days worth
>>>> of data from which to run these clickmodels. Practically 80 days is
>>>> probably a reasonable cutoff, as we will want to re-use the data multiple
>>>> times before needing to delete it and generate a new set of labels.
>>>>
>>>> * We currently collect human relevance judgements with Discernatron (
>>>> https://discernatron.wmflabs.org/). This is useful data for manual
>>>> evaluation of changes, but the data set is much too small (low hundreds of
>>>> queries, with an average of 50 documents per query) to integrate into
>>>> machine learning. The process of judging query/document pairs for the
>>>> community is quite tedious, and it doesn't seem like a great use of
>>>> engineer time for us to do this ourselves.
>>>>
>>>> _______________________________________________
>>>> AI mailing list
>>>> AI(a)lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/ai
>>>>
>>>>
>>>
>>>
>>> --
>>> Jonathan T. Morgan
>>> Senior Design Researcher
>>> Wikimedia Foundation
>>> User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
>>>
>>>
>>> _______________________________________________
>>> discovery mailing list
>>> discovery(a)lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>>
>>>
>>
>> _______________________________________________
>> discovery mailing list
>> discovery(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>
>>
>
>
> --
> Jan Drewniak
> UX Engineer, Discovery
> Wikimedia Foundation
>
> _______________________________________________
> discovery mailing list
> discovery(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/discovery
>
>
At our weekly relevance meeting an interesting idea came up about how to
collect relevance judgements for the long tail of queries, which make up
around 60% of search sessions.
We are pondering asking questions on the article pages themselves. Roughly
we would manually curate some list of queries we want to collect relevance
judgements for. When a user has spent some threshold of time (60s?) on a
page we would, for some % of users, check if we have any queries we want
labeled for this page, and then ask them if the page is a relevant result
for that query. In this way the amount of work asked of individuals is
relatively low and hopefully something they can answer without too much
work. We know that the average page receives a few thousand page views per
day, so even with a relatively low response rate we could probably collect
a reasonable number of judgements over some medium length time period
(weeks?)
These labels would almost certainly be noisy, we would need to collect the
same judgement many times to get any kind of certainty on the label.
Additionally we would not be able to really explain the nuances of a
grading scale with many points, we would probably have to use either a
thumbs up/thumbs down approach, or maybe a happy/sad/indifferent smiley
face.
Does this seem reasonable? Are there other ways we could go about
collecting the same data? How to design it in a non-intrusive manner that
gets results, but doesn't annoy users? Other thoughts?
For some background:
* We are currently generating labeled data using statistical analysis
(clickmodels) against historical click data. This analysis requires there
to be multiple search sessions with the same query presented with similar
results to estimate the relevance of those results. A manual review of the
results showed queries with clicks from at least 10 sessions had reasonable
but not great labels, queries with 35+ sessions looked pretty good, and
queries with hundreds of sessions were labeled really well.
* an analysis of 80 days worth of search click logs showed that 35 to 40%
of search sessions are for queries that are repeated more than 10 times in
that 80 day period. Around 20% of search session are for queries that are
repeated more than 35 times in that 80 day period. (
https://phabricator.wikimedia.org/P5371)
* Our privacy policy prevents us from keeping more than 90 days worth of
data from which to run these clickmodels. Practically 80 days is probably a
reasonable cutoff, as we will want to re-use the data multiple times before
needing to delete it and generate a new set of labels.
* We currently collect human relevance judgements with Discernatron (
https://discernatron.wmflabs.org/). This is useful data for manual
evaluation of changes, but the data set is much too small (low hundreds of
queries, with an average of 50 documents per query) to integrate into
machine learning. The process of judging query/document pairs for the
community is quite tedious, and it doesn't seem like a great use of
engineer time for us to do this ourselves.