Browser extension for unsourced Wikipedia articles

List overview All Threads
Download

newer

older

Re: [Cloud] Tools that need user...

Re: [Cloud] interwiki languages

Guilherme Gonçalves

29 Sep 2017 29 Sep '17

10:34 a.m.

Hi everyone,

I've been hacking on a new tool and I thought I'd share what (little) I have so far to get some comments and learn of related approaches from the community.

The basic idea would be to have a browser extension that tells the user if the current page they're viewing looks like a good reference for a Wikipedia article, for some whitelisted domains like news websites. This would hopefully prompt casual/opportunistic edits, especially for articles that may be overlooked normally.

As a proof of concept for a backend, I built a simple bag-of-words model of the TextExtracts of enwiki's Category:All_articles_needing_additional_references. I then set up a tool [1] to receive HTML input and retrieve the 5 most similar articles to that input. You can try it out in your browser [2], or on the command line [3]. The results could definitely be better, but having tried it on a few different articles over the past few days, I think there's some potential there.

I'd be interested in hearing your thoughts on this. Specifically:

* If such a backend/API were available, would you be interested in using it for other tools? If so, what functionality would you expect from it? * I'm thinking of just throwing away the above proof of concept and using ElasticSearch, though I don't know a lot about it. Is anyone aware of a similar dataset that already exists there, by any chance? Or any reasons not to go that way? * Any other comments on the overall idea or implementation?

Thanks!

1- https://github.com/eggpi/similarity 2- https://tools.wmflabs.org/similarity/ 3- Example: curl https://www.nytimes.com/2017/09/22/opinion/sunday/portugal-drug-decriminaliz... | curl -X POST http://tools.wmflabs.org/similarity/search --form "text=<-"

-- Guilherme P. Gonçalves

Attachments:

attachment.htm (text/html — 2.3 KB)

Show replies by date

Mukunda Modell

1 Oct 1 Oct

9:36 p.m.

I think this is a really cool idea. I don't know of other similar tools but it does sound like something that should be a good fit for elasticsearch.

On Fri, Sep 29, 2017 at 9:34 AM Guilherme Gonçalves < guilherme.p.gonc@gmail.com> wrote:

...

Hi everyone,

I've been hacking on a new tool and I thought I'd share what (little) I have so far to get some comments and learn of related approaches from the community.

The basic idea would be to have a browser extension that tells the user if the current page they're viewing looks like a good reference for a Wikipedia article, for some whitelisted domains like news websites. This would hopefully prompt casual/opportunistic edits, especially for articles that may be overlooked normally.

As a proof of concept for a backend, I built a simple bag-of-words model of the TextExtracts of enwiki's Category:All_articles_needing_additional_references. I then set up a tool [1] to receive HTML input and retrieve the 5 most similar articles to that input. You can try it out in your browser [2], or on the command line [3]. The results could definitely be better, but having tried it on a few different articles over the past few days, I think there's some potential there.

I'd be interested in hearing your thoughts on this. Specifically:

If such a backend/API were available, would you be interested in using

it for other tools? If so, what functionality would you expect from it?

I'm thinking of just throwing away the above proof of concept and using

ElasticSearch, though I don't know a lot about it. Is anyone aware of a similar dataset that already exists there, by any chance? Or any reasons not to go that way?

Any other comments on the overall idea or implementation?

Thanks!

1- https://github.com/eggpi/similarity 2- https://tools.wmflabs.org/similarity/ 3- Example: curl https://www.nytimes.com/2017/09/22/opinion/sunday/portugal-drug-decriminaliz... | curl -X POST http://tools.wmflabs.org/similarity/search --form "text=<-" -- Guilherme P. Gonçalves _______________________________________________ Cloud mailing list Cloud@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/cloud

Morten Wang

6 Oct 6 Oct

2:17 p.m.

In my experience, the problem you're trying to solve boils down to finding articles similar to a given search query that are in the given category. Trying to outsmart Lucene on that kind of a problem is going to be challenging given that it's for example used as a benchmark in research[1], so switching over to ElasticSearch is arguably the way to go.

There's a specific feature in Lucene called "MoreLikeThis", and it's also exposed in WP's search API to find articles similar to other articles. The documentation[2] of that feature provides a fairly good explanation of how it works, making it a possible starting point on how to filter a given document to improve the search results.

If I remember correctly there are a couple of research papers that study how to recommend sources for articles (or articles for a given source), but I'd have to go look for them to find them. You might want to consider searching the Research Newsletter archives and Google Scholar as that might give you a couple of existing approaches.

Footnotes: 1: A paper I reviewed for the Research Newsletter used it: https://meta.wikimedia.org/wiki/Research:Newsletter/2016/May#Evaluating_link... 2: https://lucene.apache.org/core/3_0_3/api/contrib-queries/org/apache/lucene/s...

Cheers, Morten

On 1 October 2017 at 18:36, Mukunda Modell mmodell@wikimedia.org wrote:

...

I think this is a really cool idea. I don't know of other similar tools but it does sound like something that should be a good fit for elasticsearch.

On Fri, Sep 29, 2017 at 9:34 AM Guilherme Gonçalves < guilherme.p.gonc@gmail.com> wrote:

...
Hi everyone,

I've been hacking on a new tool and I thought I'd share what (little) I have so far to get some comments and learn of related approaches from the community.

The basic idea would be to have a browser extension that tells the user if the current page they're viewing looks like a good reference for a Wikipedia article, for some whitelisted domains like news websites. This would hopefully prompt casual/opportunistic edits, especially for articles that may be overlooked normally.

As a proof of concept for a backend, I built a simple bag-of-words model of the TextExtracts of enwiki's Category:All_articles_needing_additional_references. I then set up a tool [1] to receive HTML input and retrieve the 5 most similar articles to that input. You can try it out in your browser [2], or on the command line [3]. The results could definitely be better, but having tried it on a few different articles over the past few days, I think there's some potential there.

I'd be interested in hearing your thoughts on this. Specifically:

If such a backend/API were available, would you be interested in using

it for other tools? If so, what functionality would you expect from it?

I'm thinking of just throwing away the above proof of concept and using

ElasticSearch, though I don't know a lot about it. Is anyone aware of a similar dataset that already exists there, by any chance? Or any reasons not to go that way?

Any other comments on the overall idea or implementation?

Thanks!

1- https://github.com/eggpi/similarity 2- https://tools.wmflabs.org/similarity/ 3- Example: curl https://www.nytimes.com/2017/ 09/22/opinion/sunday/portugal-drug-decriminalization.html | curl -X POST http://tools.wmflabs.org/similarity/search --form "text=<-" -- Guilherme P. Gonçalves _______________________________________________ Cloud mailing list Cloud@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/cloud

Cloud mailing list Cloud@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/cloud

Guilherme Gonçalves

8 Oct 8 Oct

8:34 a.m.

This is great, thank you all for your input!

It does seem like ElasticSearch (and likely MoreLikeThis) are the way to go, and I'm very happy to hear that this could be integrated with other use cases relatively easily. I'll definitely keep those in mind and I hope to come back to this in a few weeks.

Thanks again!

2017-10-06 19:17 GMT+01:00 Morten Wang nettrom@gmail.com:

...

In my experience, the problem you're trying to solve boils down to finding articles similar to a given search query that are in the given category. Trying to outsmart Lucene on that kind of a problem is going to be challenging given that it's for example used as a benchmark in research[1], so switching over to ElasticSearch is arguably the way to go.

There's a specific feature in Lucene called "MoreLikeThis", and it's also exposed in WP's search API to find articles similar to other articles. The documentation[2] of that feature provides a fairly good explanation of how it works, making it a possible starting point on how to filter a given document to improve the search results.

If I remember correctly there are a couple of research papers that study how to recommend sources for articles (or articles for a given source), but I'd have to go look for them to find them. You might want to consider searching the Research Newsletter archives and Google Scholar as that might give you a couple of existing approaches.

Footnotes: 1: A paper I reviewed for the Research Newsletter used it: https://meta.wikimedia.org/wiki/Research:Newsletter/ 2016/May#Evaluating_link-based_recommendations_for_Wikipedia 2: https://lucene.apache.org/core/3_0_3/api/contrib- queries/org/apache/lucene/search/similar/MoreLikeThis.html

Cheers, Morten

On 1 October 2017 at 18:36, Mukunda Modell mmodell@wikimedia.org wrote:

...
I think this is a really cool idea. I don't know of other similar tools but it does sound like something that should be a good fit for elasticsearch.

On Fri, Sep 29, 2017 at 9:34 AM Guilherme Gonçalves < guilherme.p.gonc@gmail.com> wrote:

...
Hi everyone,

I've been hacking on a new tool and I thought I'd share what (little) I have so far to get some comments and learn of related approaches from the community.

The basic idea would be to have a browser extension that tells the user if the current page they're viewing looks like a good reference for a Wikipedia article, for some whitelisted domains like news websites. This would hopefully prompt casual/opportunistic edits, especially for articles that may be overlooked normally.

As a proof of concept for a backend, I built a simple bag-of-words model of the TextExtracts of enwiki's Category:All_articles_needing_additional_references. I then set up a tool [1] to receive HTML input and retrieve the 5 most similar articles to that input. You can try it out in your browser [2], or on the command line [3]. The results could definitely be better, but having tried it on a few different articles over the past few days, I think there's some potential there.

I'd be interested in hearing your thoughts on this. Specifically:

If such a backend/API were available, would you be interested in using

it for other tools? If so, what functionality would you expect from it?

I'm thinking of just throwing away the above proof of concept and

using ElasticSearch, though I don't know a lot about it. Is anyone aware of a similar dataset that already exists there, by any chance? Or any reasons not to go that way?

Any other comments on the overall idea or implementation?

Thanks!

1- https://github.com/eggpi/similarity 2- https://tools.wmflabs.org/similarity/ 3- Example: curl https://www.nytimes.com/2017/0 9/22/opinion/sunday/portugal-drug-decriminalization.html | curl -X POST http://tools.wmflabs.org/similarity/search --form "text=<-" -- Guilherme P. Gonçalves _______________________________________________ Cloud mailing list Cloud@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/cloud

Cloud mailing list Cloud@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/cloud

Cloud mailing list Cloud@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/cloud

-- Guilherme P. Gonçalves

Nitin Gadia

10:06 a.m.

Why am I being sent emails? Where did this come from?

On Sun, Oct 8, 2017 at 8:34 AM, Guilherme Gonçalves < guilherme.p.gonc@gmail.com> wrote:

...

This is great, thank you all for your input!

It does seem like ElasticSearch (and likely MoreLikeThis) are the way to go, and I'm very happy to hear that this could be integrated with other use cases relatively easily. I'll definitely keep those in mind and I hope to come back to this in a few weeks.

Thanks again!

2017-10-06 19:17 GMT+01:00 Morten Wang nettrom@gmail.com:

...
In my experience, the problem you're trying to solve boils down to finding articles similar to a given search query that are in the given category. Trying to outsmart Lucene on that kind of a problem is going to be challenging given that it's for example used as a benchmark in research[1], so switching over to ElasticSearch is arguably the way to go.

There's a specific feature in Lucene called "MoreLikeThis", and it's also exposed in WP's search API to find articles similar to other articles. The documentation[2] of that feature provides a fairly good explanation of how it works, making it a possible starting point on how to filter a given document to improve the search results.

If I remember correctly there are a couple of research papers that study how to recommend sources for articles (or articles for a given source), but I'd have to go look for them to find them. You might want to consider searching the Research Newsletter archives and Google Scholar as that might give you a couple of existing approaches.

Footnotes: 1: A paper I reviewed for the Research Newsletter used it: https://meta.wikimedia.org/wiki/Research:Newsletter/2016 /May#Evaluating_link-based_recommendations_for_Wikipedia 2: https://lucene.apache.org/core/3_0_3/api/contrib-queries/ org/apache/lucene/search/similar/MoreLikeThis.html

Cheers, Morten

On 1 October 2017 at 18:36, Mukunda Modell mmodell@wikimedia.org wrote:

...
I think this is a really cool idea. I don't know of other similar tools but it does sound like something that should be a good fit for elasticsearch.

On Fri, Sep 29, 2017 at 9:34 AM Guilherme Gonçalves < guilherme.p.gonc@gmail.com> wrote:

...
Hi everyone,

I've been hacking on a new tool and I thought I'd share what (little) I have so far to get some comments and learn of related approaches from the community.

The basic idea would be to have a browser extension that tells the user if the current page they're viewing looks like a good reference for a Wikipedia article, for some whitelisted domains like news websites. This would hopefully prompt casual/opportunistic edits, especially for articles that may be overlooked normally.

As a proof of concept for a backend, I built a simple bag-of-words model of the TextExtracts of enwiki's Category:All_articles_needing_additional_references. I then set up a tool [1] to receive HTML input and retrieve the 5 most similar articles to that input. You can try it out in your browser [2], or on the command line [3]. The results could definitely be better, but having tried it on a few different articles over the past few days, I think there's some potential there.

I'd be interested in hearing your thoughts on this. Specifically:

If such a backend/API were available, would you be interested in

using it for other tools? If so, what functionality would you expect from it?

I'm thinking of just throwing away the above proof of concept and

using ElasticSearch, though I don't know a lot about it. Is anyone aware of a similar dataset that already exists there, by any chance? Or any reasons not to go that way?

Any other comments on the overall idea or implementation?

Thanks!

1- https://github.com/eggpi/similarity 2- https://tools.wmflabs.org/similarity/ 3- Example: curl https://www.nytimes.com/2017/0 9/22/opinion/sunday/portugal-drug-decriminalization.html | curl -X POST http://tools.wmflabs.org/similarity/search --form "text=<-" -- Guilherme P. Gonçalves _______________________________________________ Cloud mailing list Cloud@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/cloud

Cloud mailing list Cloud@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/cloud

Cloud mailing list Cloud@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/cloud

-- Guilherme P. Gonçalves

Cloud mailing list Cloud@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/cloud

Fæ

10:22 a.m.

On 8 October 2017 at 15:06, Nitin Gadia nittyjee@gmail.com wrote:

...

Why am I being sent emails? Where did this come from?

See https://wikitech.wikimedia.org/wiki/User:BryanDavis/Rebranding_Cloud_Service...

The WMF marketing/rebranding exercise included renaming the email list you were subscribed to. It may be less confusing if the footer of the email list mentioned the previous name; probably for at least 12 months.

Fae

-- faewik@gmail.com https://commons.wikimedia.org/wiki/User:Fae

Nitin Gadia

16 Oct 16 Oct

7:43 p.m.

How do I stop being sent these emails? Would rather not mute or block, would rather end it at its source.

On Sun, Oct 8, 2017 at 9:22 AM, Fæ faewik@gmail.com wrote:

...

On 8 October 2017 at 15:06, Nitin Gadia nittyjee@gmail.com wrote:

...
Why am I being sent emails? Where did this come from?

See https://wikitech.wikimedia.org/wiki/User:BryanDavis/ Rebranding_Cloud_Services_products

The WMF marketing/rebranding exercise included renaming the email list you were subscribed to. It may be less confusing if the footer of the email list mentioned the previous name; probably for at least 12 months.

Fae

faewik@gmail.com https://commons.wikimedia.org/wiki/User:Fae

Cloud mailing list Cloud@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/cloud

Nitin Gadia

7:44 p.m.

... oh right, the Cloud email list. Done :)

On Mon, Oct 16, 2017 at 6:43 PM, Nitin Gadia nittyjee@gmail.com wrote:

...

How do I stop being sent these emails? Would rather not mute or block, would rather end it at its source.

On Sun, Oct 8, 2017 at 9:22 AM, Fæ faewik@gmail.com wrote:

...
On 8 October 2017 at 15:06, Nitin Gadia nittyjee@gmail.com wrote:

...
Why am I being sent emails? Where did this come from?

See https://wikitech.wikimedia.org/wiki/User:BryanDavis/Rebrandi ng_Cloud_Services_products

The WMF marketing/rebranding exercise included renaming the email list you were subscribed to. It may be less confusing if the footer of the email list mentioned the previous name; probably for at least 12 months.

Fae

faewik@gmail.com https://commons.wikimedia.org/wiki/User:Fae

Cloud mailing list Cloud@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/cloud

Guilherme Gonçalves

24 Dec 24 Dec

7:40 a.m.

Hi everyone,

Apologies for resurrecting this old thread, but I finally got around to making this (mostly) work so I thought I'd come back with an update. You can install the extension for either Chrome or Firefox below:

https://chrome.google.com/webstore/detail/wikipedia-needs-reference/michclig... https://addons.mozilla.org/en-GB/firefox/addon/wikipedia-needs-references/

The full code for the extension, server and the script that populates ElasticSearch are on GitHub (http://github.com/eggpi/similarity/), and the backend is hosted on Toolforge.

It's definitely experimental and lacking in various ways (there's not even a proper icon yet!), but I've used it for a few weeks and managed to make some edits through it. If this sounds interesting, please give it a try and feel free to file issues.

Thanks!

2017-10-08 14:34 GMT+02:00 Guilherme Gonçalves guilherme.p.gonc@gmail.com:

...

This is great, thank you all for your input!

It does seem like ElasticSearch (and likely MoreLikeThis) are the way to go, and I'm very happy to hear that this could be integrated with other use cases relatively easily. I'll definitely keep those in mind and I hope to come back to this in a few weeks.

Thanks again!

2017-10-06 19:17 GMT+01:00 Morten Wang nettrom@gmail.com:

...
In my experience, the problem you're trying to solve boils down to finding articles similar to a given search query that are in the given category. Trying to outsmart Lucene on that kind of a problem is going to be challenging given that it's for example used as a benchmark in research[1], so switching over to ElasticSearch is arguably the way to go.

There's a specific feature in Lucene called "MoreLikeThis", and it's also exposed in WP's search API to find articles similar to other articles. The documentation[2] of that feature provides a fairly good explanation of how it works, making it a possible starting point on how to filter a given document to improve the search results.

If I remember correctly there are a couple of research papers that study how to recommend sources for articles (or articles for a given source), but I'd have to go look for them to find them. You might want to consider searching the Research Newsletter archives and Google Scholar as that might give you a couple of existing approaches.

Footnotes: 1: A paper I reviewed for the Research Newsletter used it: https://meta.wikimedia.org/wiki/Research:Newsletter/2016 /May#Evaluating_link-based_recommendations_for_Wikipedia 2: https://lucene.apache.org/core/3_0_3/api/contrib-queries/ org/apache/lucene/search/similar/MoreLikeThis.html

Cheers, Morten

On 1 October 2017 at 18:36, Mukunda Modell mmodell@wikimedia.org wrote:

...
I think this is a really cool idea. I don't know of other similar tools but it does sound like something that should be a good fit for elasticsearch.

On Fri, Sep 29, 2017 at 9:34 AM Guilherme Gonçalves < guilherme.p.gonc@gmail.com> wrote:

...
Hi everyone,

I've been hacking on a new tool and I thought I'd share what (little) I have so far to get some comments and learn of related approaches from the community.

The basic idea would be to have a browser extension that tells the user if the current page they're viewing looks like a good reference for a Wikipedia article, for some whitelisted domains like news websites. This would hopefully prompt casual/opportunistic edits, especially for articles that may be overlooked normally.

As a proof of concept for a backend, I built a simple bag-of-words model of the TextExtracts of enwiki's Category:All_articles_needing_additional_references. I then set up a tool [1] to receive HTML input and retrieve the 5 most similar articles to that input. You can try it out in your browser [2], or on the command line [3]. The results could definitely be better, but having tried it on a few different articles over the past few days, I think there's some potential there.

I'd be interested in hearing your thoughts on this. Specifically:

If such a backend/API were available, would you be interested in

using it for other tools? If so, what functionality would you expect from it?

I'm thinking of just throwing away the above proof of concept and

using ElasticSearch, though I don't know a lot about it. Is anyone aware of a similar dataset that already exists there, by any chance? Or any reasons not to go that way?

Any other comments on the overall idea or implementation?

Thanks!

1- https://github.com/eggpi/similarity 2- https://tools.wmflabs.org/similarity/ 3- Example: curl https://www.nytimes.com/2017/0 9/22/opinion/sunday/portugal-drug-decriminalization.html | curl -X POST http://tools.wmflabs.org/similarity/search --form "text=<-" -- Guilherme P. Gonçalves _______________________________________________ Cloud mailing list Cloud@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/cloud

Cloud mailing list Cloud@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/cloud

Cloud mailing list Cloud@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/cloud

-- Guilherme P. Gonçalves

-- Guilherme P. Gonçalves

2558

Age (days ago)

2644

Last active (days ago)

cloud@lists.wikimedia.org

8 comments

5 participants

tags (0)

participants (5)

Fæ
Guilherme Gonçalves
Morten Wang
Mukunda Modell
Nitin Gadia