Hi everyone,
As we keep coming up with more ways to try to rescue unsuccessful
queries—"Did you mean" suggestions, language detection, quote stripping,
wrong keyboard detection, etc—we have to have a plan for how they interact
with each other.
I've put together a straw man proposal for how to deal with all of this to
have a more co-ordinated conversation:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/So_Many_Search_Optio…
Comments and questions here or on the talk page are welcome!
—Trey
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
tl/dr: Can feature vectors about relevance of (query, page_id) pairs be
released to the public if the final dataset only represents query's with
numeric id's?
Over the past 2 months i've been spending free time working on
investigating machine learning for ranking. One of the earlier things i
tried, to get some semblance of proof it had the ability to improve our
search results, was port a set of features for text ranking from an open
source kaggle competitor to a datset i could create from our own data. For
relevance targets I took queries that had clicks from at least 50 unique
sessions over a 60 day period and ran them through a click model (DBN).
Perhaps not as useful as human judgements but working with what I have
available.
This actually showed it has some promise, and I've been moving further
along. An idea was provided to me though about releasing the feature
vectors from my initial investigation in an open format that might be
useful for others. Each feature vector is for a (query, hit_page_id) pair
that was displayed to at least 50 users.
I don't have my original data, but I have all the code and just ran through
it with 100 normalized queries to get a count, and there are 4852 features.
Lots of them are probably useless, but choosing which ones is probably half
the battle. These are ~230MB in pickle format, which stores the floats in
binary. This can then be compressed to ~20MB with gzip, so the data size
isn't particularly insane. In a released dataset i would probably use 10k
normalized queries, meaning about 100x this size Could plausibly release as
csv's instead of pickled numpy arrays. That will probably increase the data
size further, but since we are only talking ~2GB after compression could go
either way.
The list of feature names is in https://phabricator.wikimedia.org/P4677 A
few example feature names and their meaning, which hopefully is enough to
understand the rest of the feature names:
DiceDistance_Bigram_max_norm_query_x_outgoing_link_1D.pkl
- dice distance of bigrams in normalized (stemmed) query string versus
outgoing links. outgoing links are an array field, so the dice distanece is
calculated per item and this feature has the max value.
DigitCount_query_1D.pkl
- Number of digits in the raw user query
ES_TFIDF_Unigram_Top50_CosineSim_norm_query_category.plain_termvec_x_category.plain_termvec_1D.pkl
- Cosine similarity of the top 50 terms, as reported by elasticsearch
termvectors api, of the normalized query vs the category.plain field of
matching document. More terms would perhaps have been nice, but doing this
all offline in python made that a bit of a time+space tradeoff.
Ident_Log10_score_mean_query_opening_text_termvec_1D.pkl
- log base 10 of the score from the elasticsearch termvectors api on the
raw user query applied to the opening_text field analysis chain.
LongestMatchSize_mean_query_x_heading_1D.pkl
- mean longest match, in number of characters of the query vs the list of
headings for the page
The main question here i think revolves around is this still PII? The exact
queries would be normalized into id's and not released. We could leave the
page_id in or out of the dataset. With it left in people using the dataset
could plausibly come up with their own query independent features to add.
With a large enough feature vector for (query_id, page_id) the query could
theoretically be reverse engineered, but from a more practical side I'm not
sure that's really a valid concern.
Thoughts? Concerns? Questions?
Season's Greetings,
A few updates from the Discovery department this week.
This is the last weekly update from the Discovery department for the year.
We'll be skipping next week due to the holiday's and will see you all in
January with a fresh 2017 edition.
== Highlights ==
* Secondary result functionality will be available over the search API in
early January! Currently, this allows consumers of the search API to
benefit from automated language detection [[TextCat]] and forwarding of
search queries. [0] [1]
== Discussions ==
=== Search ===
* Secondary result functionality will be available over the search API in
early January! Currently, this allows consumers of the search API to
benefit from automated language detection [[TextCat]] and forwarding of
search queries. [0] [1]
* Corrected an error on Hebrew wikis where searches without diacritics
could sometimes not find appropriate results that contained diacritics. [3]
Feedback and suggestions on this weekly update are welcome.
[0] https://www.mediawiki.org/wiki/TextCat
[1] https://phabricator.wikimedia.org/T142795
[3] https://phabricator.wikimedia.org/T3836
----
The full update, and archive of all past updates, can be found on
MediaWiki.org:
https://www.mediawiki.org/wiki/Discovery/Status_updates
Interested in getting involved? See tasks marked as "Easy" or "Volunteer
needed" in Phabricator.
[1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R
[2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R
Yours,
Chris Koerner
Community Liaison - Discovery
Wikimedia Foundation
Hello all,
Yesterday, an announcement (Now live: Shared structured data), incorrectly
stated that Structured Data had been launched on Commons.
The feature which was inaccurately named “Structured Data”, enables users
to add tabular data to the data namespace on Commons via the regular page
editor and to further display and/or visualize that data from other wikis.
This work is unrelated to an ongoing project called Structured Data on
Commons. For more on the newly launched feature, see the Tabular Data [1]
and Map Data [2] help pages on MediaWiki.org.
For information on the Structured Data on Commons project, designed to
associate structured data with media files on Commons to improve their
discoverability, please visit the project page on Commons. [3]
Thank you,
-Katie
[1] - https://www.mediawiki.org/wiki/Help:Tabular_Data
[2] - https://www.mediawiki.org/wiki/Help:Map_Data
[3] - https://commons.wikimedia.org/wiki/Commons:Structured_data
Micru, thanks, I think Datasets sounds like a good name too!
On Thu, Dec 22, 2016 at 2:44 PM David Cuenca Tudela <dacuetu(a)gmail.com>
wrote:
> On Thu, Dec 22, 2016 at 8:38 PM, Brad Jorsch (Anomie) <
> bjorsch(a)wikimedia.org
> > wrote:
>
> > On Thu, Dec 22, 2016 at 2:30 PM, Yuri Astrakhan <
> yastrakhan(a)wikimedia.org>
> > wrote:
> >
> > > Gift season! We have launched structured data on Commons, available
> from
> > > all wikis.
> > >
> >
> > I was momentarily excited, then I read a little farther and discovered
> this
> > isn't about https://commons.wikimedia.org/wiki/Commons:Structured_data.
> >
>
> Same here, I think it needs a better name...
>
> What about calling it datasets or structured datasets?
>
> Cheers,
> Micru
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Yes, there seem to have been a bit of a naming collision. Tabular data and
map data have been jointly known as structured data, but there is also the
Structured Data project, which IMO should be called Structured Metadata
project :) Naming suggestions are welcome!
P.S. Brad, I'm sorry tabular and map data did not excite you :(
On Thu, Dec 22, 2016 at 2:38 PM Brad Jorsch (Anomie) <bjorsch(a)wikimedia.org>
wrote:
> On Thu, Dec 22, 2016 at 2:30 PM, Yuri Astrakhan <yastrakhan(a)wikimedia.org>
> wrote:
>
> > Gift season! We have launched structured data on Commons, available from
> > all wikis.
> >
>
> I was momentarily excited, then I read a little farther and discovered this
> isn't about https://commons.wikimedia.org/wiki/Commons:Structured_data.
>
>
> --
> Brad Jorsch (Anomie)
> Senior Software Engineer
> Wikimedia Foundation
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Gift season! We have launched structured data on Commons, available from
all wikis.
TLDR; One data store. Use everywhere. Upload table data to Commons, with
localization, and use it to create wiki tables, lists, or use directly in
graphs. Works for GeoJSON maps too. Must be licensed as CC0. Try this
per-state GDP map demo, and select multiple years. More demos at the bottom.
US Map state highlight
<https://en.wikipedia.org/wiki/Template:Graph:US_Map_state_highlight>
Data can now be stored as *.tab and *.map pages in the data namespace on
Commons. That data may contain localization, so a table cell could be in
multiple languages. And that data is accessible from any wikis, by Lua
scripts, Graphs, and Maps.
Lua lets you generate wiki tables from the data by filtering, converting,
mixing, and formatting the raw data. Lua also lets you generate lists. Or
any wiki markup.
Graphs can use both .tab and .map directly to visualize the data and let
users interact with it. The GDP demo above uses a map from Commons, and
colors each segment with the data based on a data table.
Kartographer (<maplink>/<mapframe>) can use the .map data as an extra layer
on top of the base map. This way we can show endangered species' habitat.
== Demo ==
* Raw data example
<https://commons.wikimedia.org/wiki/Data:Weather/New_York_City.tab>
* Interactive Weather data
<https://en.wikipedia.org/wiki/Template:Graph:Weather_monthly_history>
* Same data in Weather template
<https://en.wikipedia.org/wiki/User:Yurik/WeatherDemo>
* Interactive GDP map
<https://en.wikipedia.org/wiki/Template:Graph:US_Map_state_highlight>
* Endangered Jemez Mountains salamander - habitat
<https://en.wikipedia.org/wiki/Jemez_Mountains_salamander#/maplink/0>
* Population history
<https://en.wikipedia.org/wiki/Template:Graph:Population_history>
* Line chart <https://en.wikipedia.org/wiki/Template:Graph:Lines>
== Getting started ==
* Try creating a page at data:Sandbox/<user>.tab on Commons. Don't forget
the .tab extension, or it won't work.
* Try using some data with the Line chart graph template
A thorough guide is needed, help is welcome!
== Documentation links ==
* Tabular help <https://www.mediawiki.org/wiki/Help:Tabular_Data>
* Map help <https://www.mediawiki.org/wiki/Help:Map_Data>
If you find a bug, create Phabricator ticket with #tabular-data tag, or
comment on the documentation talk pages.
== FAQ ==
* Relation to Wikidata: Wikidata is about "facts" (small pieces of
information). Structured data is about "blobs" - large amounts of data like
the historical weather or the outline of the state of New York.
== TODOs ==
* Add a nice "table editor" - editing JSON by hand is cruel. T134618
* "What links here" should track data usage across wikis. Will allow
quicker auto-refresh of the pages too. T153966
* Support data redirects. T153598
* Mega epic: Support external data feeds.