Hey folks,
I hosted the AI Wishlist session at the Developer Summit[1]. At that
session, we brainstormed a set of AIs that we think would be interesting to
implement. Generally I asked people to do their best to follow template
that would help us remember why the AI was important, what it would help
with, and what resources might help get it implement.
Well, I've taken all of the notes and filed a large set of phab tasks under
a new "artificial-intelligence" tag. Please review all of the fun, new
proposals that are listed there and make sure you subscribe to those that
you're interested in.
See https://phabricator.wikimedia.org/tag/artificial-intelligence/
1. https://phabricator.wikimedia.org/T147710
-Aaron
Perhaps of interest to AI people.
Pine
---------- Forwarded message ----------
From: David Cuenca Tudela <dacuetu(a)gmail.com>
Date: Thu, Jan 19, 2017 at 3:25 AM
Subject: [Wikidata] IBM's Watson using Wikdata
To: "Discussion list for the Wikidata project." <
wikidata-l(a)lists.wikimedia.org>
Hi,
I didn't see it around here, but in a paper from Nov 2016 a group of
researchers from IBM used Wikidata to select which entities to feed Watson
for automatic QA generation:
"Training IBM Watson using Automatically Generated Question - Answer Pairs"
https://arxiv.org/abs/1611.03932
Cheers,
Micru
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
Hello,
Wikilabels [1] is the system to label edits for ORES. Until now, users
would have to visit a page in Wikipedia, for example WP:Labels [2] and
install a gadget and then label edits for ORES. With the new version
(0.4.0) deployed today, you can directly go to Wikilabels home page, for
example https://labels.wmflabs.org/ui/enwiki and label edits from there. If
you installed the gadget, you can remove it now. We also provided some sort
of minification and bundling to improve its performance.
Labeling edits would help ORES work more accurately and in case ORES review
tool is not enabled in your wiki, you can provide these data for us using
wikilabels so can enable it for your wiki as well!
[1] https://meta.wikimedia.org/wiki/Wiki_labels
[2] https://en.wikipedia.org/wiki/Wikipedia:Labels
Best
--
Amir Sarabadani Tafreshi
Software Engineer (contractor)
-------------------------------------
Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin
http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/681/51985.
\o/ for a few days! Thanks for the quick response Jacob!
On Tue, Jan 3, 2017 at 3:05 PM, Jacob Rogers <jrogers(a)wikimedia.org> wrote:
> Hi all,
>
> I'm not sure that I'll be able to get to this in a timely manner, so I'm
> ccing Aeryn as well. I should be able to review in a few days if it's not
> done by then. Apologies for the delay.
>
> Best,
> Jacob
>
> On Tue, Jan 3, 2017 at 9:42 AM, Aaron Halfaker <aaron.halfaker(a)gmail.com>
> wrote:
>
>> * fixed Jacob's email address
>>
>> On Tue, Jan 3, 2017 at 11:42 AM, Aaron Halfaker <aaron.halfaker(a)gmail.com
>> > wrote:
>>
>>> Hey Erik,
>>>
>>> Sorry to be so late to respond, but I wanted to take the time to read
>>> your message and the holidays left me a bit scattered.
>>>
>>> Reviewing what you intend to release, I don't see any clear problems
>>> with the dataset. I think the primary concern is really finding PII in
>>> the search string itself. Generally this is due to people accidentally
>>> pasting PII into the query box and then having other queries linked to them
>>> through some sort of identifier.
>>>
>>> Is there any way (even very confounded) that you could positively
>>> identify any content from a query string? It seems like the
>>> LongestMatchSize_mean_query_x_heading could be a strong indicator of
>>> content from the query string. However we don't have anything in article
>>> headers that couldn't be innocently placed in a query string. E.g. let's
>>> say we had a header with "Homer Simpson, 742 Evergreen Terrace,
>>> 636-555-1024 <(636)%20555-1024>"[1] and someone then searches for that
>>> and gets a perfect match, that wouldn't strongly imply that the person
>>> searching was Homer Simpson as they could very likely be searching for a
>>> legitimate header (non-suppressed or deleted) in Wikipedia.
>>>
>>> Either way, I've worked with Jacob Rogers (+CC) to review dataset
>>> publications like this in the past (see [2] for an example). I think he'll
>>> have some more specific concerns and advice.
>>>
>>> 1. https://en.wikipedia.org/wiki/The_Simpsons_house#Address_
>>> and_phone_number -- just so you know I'm not outing someone who might
>>> live at this address and phone number :)
>>> 2. https://figshare.com/articles/Deleted_Wikipedia_articles_
>>> spam_vandalism_attack_/4245035
>>>
>>> -Aaron
>>>
>>> On Thu, Dec 22, 2016 at 6:00 PM, Erik Bernhardson <
>>> ebernhardson(a)wikimedia.org> wrote:
>>>
>>>> tl/dr: Can feature vectors about relevance of (query, page_id) pairs be
>>>> released to the public if the final dataset only represents query's with
>>>> numeric id's?
>>>>
>>>> Over the past 2 months i've been spending free time working on
>>>> investigating machine learning for ranking. One of the earlier things i
>>>> tried, to get some semblance of proof it had the ability to improve our
>>>> search results, was port a set of features for text ranking from an open
>>>> source kaggle competitor to a datset i could create from our own data. For
>>>> relevance targets I took queries that had clicks from at least 50 unique
>>>> sessions over a 60 day period and ran them through a click model (DBN).
>>>> Perhaps not as useful as human judgements but working with what I have
>>>> available.
>>>>
>>>> This actually showed it has some promise, and I've been moving further
>>>> along. An idea was provided to me though about releasing the feature
>>>> vectors from my initial investigation in an open format that might be
>>>> useful for others. Each feature vector is for a (query, hit_page_id) pair
>>>> that was displayed to at least 50 users.
>>>>
>>>> I don't have my original data, but I have all the code and just ran
>>>> through it with 100 normalized queries to get a count, and there are 4852
>>>> features. Lots of them are probably useless, but choosing which ones is
>>>> probably half the battle. These are ~230MB in pickle format, which stores
>>>> the floats in binary. This can then be compressed to ~20MB with gzip, so
>>>> the data size isn't particularly insane. In a released dataset i would
>>>> probably use 10k normalized queries, meaning about 100x this size Could
>>>> plausibly release as csv's instead of pickled numpy arrays. That will
>>>> probably increase the data size further, but since we are only talking ~2GB
>>>> after compression could go either way.
>>>>
>>>> The list of feature names is in https://phabricator.wikimedia.org/P4677
>>>> A few example feature names and their meaning, which hopefully is enough to
>>>> understand the rest of the feature names:
>>>>
>>>> DiceDistance_Bigram_max_norm_query_x_outgoing_link_1D.pkl
>>>> - dice distance of bigrams in normalized (stemmed) query string versus
>>>> outgoing links. outgoing links are an array field, so the dice distanece is
>>>> calculated per item and this feature has the max value.
>>>> DigitCount_query_1D.pkl
>>>>
>>>> - Number of digits in the raw user query
>>>> ES_TFIDF_Unigram_Top50_CosineSim_norm_query_category.plain_t
>>>> ermvec_x_category.plain_termvec_1D.pkl
>>>>
>>>> - Cosine similarity of the top 50 terms, as reported by elasticsearch
>>>> termvectors api, of the normalized query vs the category.plain field of
>>>> matching document. More terms would perhaps have been nice, but doing this
>>>> all offline in python made that a bit of a time+space tradeoff.
>>>>
>>>> Ident_Log10_score_mean_query_opening_text_termvec_1D.pkl
>>>> - log base 10 of the score from the elasticsearch termvectors api on
>>>> the raw user query applied to the opening_text field analysis chain.
>>>>
>>>> LongestMatchSize_mean_query_x_heading_1D.pkl
>>>> - mean longest match, in number of characters of the query vs the list
>>>> of headings for the page
>>>>
>>>>
>>>> The main question here i think revolves around is this still PII? The
>>>> exact queries would be normalized into id's and not released. We could
>>>> leave the page_id in or out of the dataset. With it left in people using
>>>> the dataset could plausibly come up with their own query independent
>>>> features to add. With a large enough feature vector for (query_id, page_id)
>>>> the query could theoretically be reverse engineered, but from a more
>>>> practical side I'm not sure that's really a valid concern.
>>>>
>>>> Thoughts? Concerns? Questions?
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> AI mailing list
>>>> AI(a)lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/ai
>>>>
>>>>
>>>
>>
>
>
> --
>
> Jacob Rogers
> Legal Counsel
> Wikimedia Foundation
>
> NOTICE: This message might have confidential or legally privileged
> information in it. If you have received this message by accident, please
> delete it and let us know about the mistake. As an attorney for the
> Wikimedia Foundation, for legal/ethical reasons I cannot give legal advice
> to, or serve as a lawyer for, community members, volunteers, or staff
> members in their personal capacity. For more on what this means, please see
> our legal disclaimer
> <https://meta.wikimedia.org/wiki/Wikimedia_Legal_Disclaimer>.
>
>
tl/dr: Can feature vectors about relevance of (query, page_id) pairs be
released to the public if the final dataset only represents query's with
numeric id's?
Over the past 2 months i've been spending free time working on
investigating machine learning for ranking. One of the earlier things i
tried, to get some semblance of proof it had the ability to improve our
search results, was port a set of features for text ranking from an open
source kaggle competitor to a datset i could create from our own data. For
relevance targets I took queries that had clicks from at least 50 unique
sessions over a 60 day period and ran them through a click model (DBN).
Perhaps not as useful as human judgements but working with what I have
available.
This actually showed it has some promise, and I've been moving further
along. An idea was provided to me though about releasing the feature
vectors from my initial investigation in an open format that might be
useful for others. Each feature vector is for a (query, hit_page_id) pair
that was displayed to at least 50 users.
I don't have my original data, but I have all the code and just ran through
it with 100 normalized queries to get a count, and there are 4852 features.
Lots of them are probably useless, but choosing which ones is probably half
the battle. These are ~230MB in pickle format, which stores the floats in
binary. This can then be compressed to ~20MB with gzip, so the data size
isn't particularly insane. In a released dataset i would probably use 10k
normalized queries, meaning about 100x this size Could plausibly release as
csv's instead of pickled numpy arrays. That will probably increase the data
size further, but since we are only talking ~2GB after compression could go
either way.
The list of feature names is in https://phabricator.wikimedia.org/P4677 A
few example feature names and their meaning, which hopefully is enough to
understand the rest of the feature names:
DiceDistance_Bigram_max_norm_query_x_outgoing_link_1D.pkl
- dice distance of bigrams in normalized (stemmed) query string versus
outgoing links. outgoing links are an array field, so the dice distanece is
calculated per item and this feature has the max value.
DigitCount_query_1D.pkl
- Number of digits in the raw user query
ES_TFIDF_Unigram_Top50_CosineSim_norm_query_category.plain_termvec_x_category.plain_termvec_1D.pkl
- Cosine similarity of the top 50 terms, as reported by elasticsearch
termvectors api, of the normalized query vs the category.plain field of
matching document. More terms would perhaps have been nice, but doing this
all offline in python made that a bit of a time+space tradeoff.
Ident_Log10_score_mean_query_opening_text_termvec_1D.pkl
- log base 10 of the score from the elasticsearch termvectors api on the
raw user query applied to the opening_text field analysis chain.
LongestMatchSize_mean_query_x_heading_1D.pkl
- mean longest match, in number of characters of the query vs the list of
headings for the page
The main question here i think revolves around is this still PII? The exact
queries would be normalized into id's and not released. We could leave the
page_id in or out of the dataset. With it left in people using the dataset
could plausibly come up with their own query independent features to add.
With a large enough feature vector for (query_id, page_id) the query could
theoretically be reverse engineered, but from a more practical side I'm not
sure that's really a valid concern.
Thoughts? Concerns? Questions?