List of users who have access to certain references

List overview All Threads
Download

newer

older

[Cloud-announce] [Survey] Help...

How to Setup Cron Job for...

Huji Lee

27 Dec 2018 27 Dec '18

12:41 p.m.

This is an idea that came up on fawiki, and there is some merit to it. I just want to figure out the best approach to implement it and would love your input.

*TL;DR: *We want to sweep through the recent edits in articles, look at each diff, see if it contains the addition of a "{{cite book}}" template, and if so, set it aside for future processing by another code.

I wonder if there are already scripts in pywikibot that would help initiate this. If not, I wonder what is the best strategy to implement this using MW API.

Thanks, Huji

------------

Long version:

The idea is to identify users who probably have access to certain offline sources, so that if another user needs something to be checked in that source and they don't have access to it, they know who to ask. For instance, if I have access to a physical copy of Encyclopedia Britannica (let's say it is a book and is not available digitally), and you want me to check if it has an entry for Sir Isaac Newton, it would be great if instead of or in addition to asking on the village pump (which I might not follow), you would ask me directly.

The assumption is that if the same user keeps adding the same "{{cite book}}" template in many articles (e.g. if I add the {{cite book | title = Encyclopedia Britannica | ... }} in several edits across several articles), then that user most likely has access to that source. And if these edits are relatively recent and the user is still active, then chances are the user can still access that source if another user asks them to.

So if we find all such edits, we probably can aggregate them into a table that shows "Huji" added a {{cite book}} for a book titled "Encyclopedia Britannica" 17 times, and so on and so forth. Sorting it by the frequency column, we might have a good list of user-source pairs.

Attachments:

attachment.htm (text/html — 3.2 KB)

Show replies by date

John

27 Dec 27 Dec

12:51 p.m.

Using a combination of pywiki and mwparserfromhell it shouldn’t be too much of an issue for a wiki. It might be hard for such a bot to keep up on say enwiki, but slower wikis shouldn’t be an issue. Pair that with a database backend, and you should be able to do it without too much issues.

On Thu, Dec 27, 2018 at 11:42 AM Huji Lee huji.huji@gmail.com wrote:

...

This is an idea that came up on fawiki, and there is some merit to it. I just want to figure out the best approach to implement it and would love your input.

*TL;DR: *We want to sweep through the recent edits in articles, look at each diff, see if it contains the addition of a "{{cite book}}" template, and if so, set it aside for future processing by another code.

I wonder if there are already scripts in pywikibot that would help initiate this. If not, I wonder what is the best strategy to implement this using MW API.

Thanks, Huji

Long version:

The idea is to identify users who probably have access to certain offline sources, so that if another user needs something to be checked in that source and they don't have access to it, they know who to ask. For instance, if I have access to a physical copy of Encyclopedia Britannica (let's say it is a book and is not available digitally), and you want me to check if it has an entry for Sir Isaac Newton, it would be great if instead of or in addition to asking on the village pump (which I might not follow), you would ask me directly.

The assumption is that if the same user keeps adding the same "{{cite book}}" template in many articles (e.g. if I add the {{cite book | title = Encyclopedia Britannica | ... }} in several edits across several articles), then that user most likely has access to that source. And if these edits are relatively recent and the user is still active, then chances are the user can still access that source if another user asks them to.

So if we find all such edits, we probably can aggregate them into a table that shows "Huji" added a {{cite book}} for a book titled "Encyclopedia Britannica" 17 times, and so on and so forth. Sorting it by the frequency column, we might have a good list of user-source pairs.

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

T Paris

1:23 p.m.

Could I ask that you guys make this an “opt in” feature. Both because it’ll speed up the bot and also because once you start identifying which books people own, you start to develop a profile on people.

v/r, TP

Sent from Mail for Windows 10

From: Huji Lee Sent: Thursday, December 27, 2018 11:42 AM To: Labs Subject: [Cloud] List of users who have access to certain references

This is an idea that came up on fawiki, and there is some merit to it. I just want to figure out the best approach to implement it and would love your input.

TL;DR: We want to sweep through the recent edits in articles, look at each diff, see if it contains the addition of a "{{cite book}}" template, and if so, set it aside for future processing by another code.

I wonder if there are already scripts in pywikibot that would help initiate this. If not, I wonder what is the best strategy to implement this using MW API.

Thanks, Huji

------------

Long version:

Huji Lee

1:37 p.m.

We will never know who "owns" which book. We only know that they have used it as a source a number of times. It could very well be that they just can easily borrow that book from a library (as is my case, with a lot of books and journals I have used as sources on Wikipedia).

The profiling issue is beyond this discussion, and I will make sure to mention that on fawiki, but one can already "profile" users using their edits (it is quite easy for people to look at my edits on fawiki and realize that I read and write about Persian music based on my fawiki edits; knowing that I also use some of the books on this topic as my source wouldn't add much to the picture; of note, my real world life and identity is unrelated to Persian music or music in general, so profiles are not always as revealing anyway).

@John: I had not heard of mwparserfromhell and it is really cool! But how exactly does it come into play? The issue is less of being able to parse wikicode (what we really need is pretty much a regex search for the Persian equivalent of {{cite book}} template, and a second regex pattern that looks for the "name" parameter inside matches for the first one). Frankly, I am less worried about the steps *after* we found a "ciite book" instance, and more about the steps leading to it (running many many diffs).

Perhaps I am not fully understanding your thoughts, so please elaborate.

Thank you both!

On Thu, Dec 27, 2018 at 1:24 PM T Paris tparis.wiki@gmail.com wrote:

...

Could I ask that you guys make this an “opt in” feature. Both because it’ll speed up the bot and also because once you start identifying which books people own, you start to develop a profile on people.

v/r,

TP

Sent from Mail https://go.microsoft.com/fwlink/?LinkId=550986 for Windows 10

*From: *Huji Lee huji.huji@gmail.com *Sent: *Thursday, December 27, 2018 11:42 AM *To: *Labs labs-l@lists.wikimedia.org *Subject: *[Cloud] List of users who have access to certain references

This is an idea that came up on fawiki, and there is some merit to it. I just want to figure out the best approach to implement it and would love your input.

*TL;DR: *We want to sweep through the recent edits in articles, look at each diff, see if it contains the addition of a "{{cite book}}" template, and if so, set it aside for future processing by another code.

I wonder if there are already scripts in pywikibot that would help initiate this. If not, I wonder what is the best strategy to implement this using MW API.

Thanks,

Huji

Long version:

The idea is to identify users who probably have access to certain offline sources, so that if another user needs something to be checked in that source and they don't have access to it, they know who to ask. For instance, if I have access to a physical copy of Encyclopedia Britannica (let's say it is a book and is not available digitally), and you want me to check if it has an entry for Sir Isaac Newton, it would be great if instead of or in addition to asking on the village pump (which I might not follow), you would ask me directly.

The assumption is that if the same user keeps adding the same "{{cite book}}" template in many articles (e.g. if I add the {{cite book | title = Encyclopedia Britannica | ... }} in several edits across several articles), then that user most likely has access to that source. And if these edits are relatively recent and the user is still active, then chances are the user can still access that source if another user asks them to.

So if we find all such edits, we probably can aggregate them into a table that shows "Huji" added a {{cite book}} for a book titled "Encyclopedia Britannica" 17 times, and so on and so forth. Sorting it by the frequency column, we might have a good list of user-source pairs.

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

John

1:57 p.m.

What’s fawiki’s edit rate? Processing a diff shouldn’t take more than 1-2 seconds especially if you optimize the logic. I’m just spitballing ideas at this point, but the logic should be easy

On Thu, Dec 27, 2018 at 12:37 PM Huji Lee huji.huji@gmail.com wrote:

...

We will never know who "owns" which book. We only know that they have used it as a source a number of times. It could very well be that they just can easily borrow that book from a library (as is my case, with a lot of books and journals I have used as sources on I Wikipedia).

The profiling issue is beyond this discussion, and I will make sure to mention that on fawiki, but one can already "profile" users using their edits (it is quite easy for people to look at my edits on fawiki and realize that I read and write about Persian music based on my fawiki edits; knowing that I also use some of the books on this topic as my source wouldn't add much to the picture; of note, my real world life and identity is unrelated to Persian music or music in general, so profiles are not always as revealing anyway).

@John: I had not heard of mwparserfromhell and it is really cool! But how exactly does it come into play? The issue is less of being able to parse wikicode (what we really need is pretty much a regex search for the Persian equivalent of {{cite book}} template, and a second regex pattern that looks for the "name" parameter inside matches for the first one). Frankly, I am less worried about the steps *after* we found a "ciite book" instance, and more about the steps leading to it (running many many diffs).

Perhaps I am not fully understanding your thoughts, so please elaborate.

Thank you both!

On Thu, Dec 27, 2018 at 1:24 PM T Paris tparis.wiki@gmail.com wrote:

...
Could I ask that you guys make this an “opt in” feature. Both because it’ll speed up the bot and also because once you start identifying which books people own, you start to develop a profile on people.

v/r,

TP

Sent from Mail https://go.microsoft.com/fwlink/?LinkId=550986 for Windows 10

*From: *Huji Lee huji.huji@gmail.com *Sent: *Thursday, December 27, 2018 11:42 AM *To: *Labs labs-l@lists.wikimedia.org *Subject: *[Cloud] List of users who have access to certain references

This is an idea that came up on fawiki, and there is some merit to it. I just want to figure out the best approach to implement it and would love your input.

*TL;DR: *We want to sweep through the recent edits in articles, look at each diff, see if it contains the addition of a "{{cite book}}" template, and if so, set it aside for future processing by another code.

I wonder if there are already scripts in pywikibot that would help initiate this. If not, I wonder what is the best strategy to implement this using MW API.

Thanks,

Huji

Long version:

The idea is to identify users who probably have access to certain offline sources, so that if another user needs something to be checked in that source and they don't have access to it, they know who to ask. For instance, if I have access to a physical copy of Encyclopedia Britannica (let's say it is a book and is not available digitally), and you want me to check if it has an entry for Sir Isaac Newton, it would be great if instead of or in addition to asking on the village pump (which I might not follow), you would ask me directly.

The assumption is that if the same user keeps adding the same "{{cite book}}" template in many articles (e.g. if I add the {{cite book | title = Encyclopedia Britannica | ... }} in several edits across several articles), then that user most likely has access to that source. And if these edits are relatively recent and the user is still active, then chances are the user can still access that source if another user asks them to.

So if we find all such edits, we probably can aggregate them into a table that shows "Huji" added a {{cite book}} for a book titled "Encyclopedia Britannica" 17 times, and so on and so forth. Sorting it by the frequency column, we might have a good list of user-source pairs.

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Huji Lee

3:40 p.m.

Got it. I am also looking for rough ideas at this point. The edit rate of fawiki is not that high, some 5-6K per day https://quarry.wmflabs.org/query/32328 (and I am guessing 3-4K if restricting to article namespace). But note that we only care about edits in the last 6-12 months, by users who have been active in the last 1-2 months. Older edits may translate to a resource you no more have access to, and inactive users are not particularly helpful for our purposes either.

On Thu, Dec 27, 2018 at 1:57 PM John phoenixoverride@gmail.com wrote:

...

What’s fawiki’s edit rate? Processing a diff shouldn’t take more than 1-2 seconds especially if you optimize the logic. I’m just spitballing ideas at this point, but the logic should be easy

On Thu, Dec 27, 2018 at 12:37 PM Huji Lee huji.huji@gmail.com wrote:

...
We will never know who "owns" which book. We only know that they have used it as a source a number of times. It could very well be that they just can easily borrow that book from a library (as is my case, with a lot of books and journals I have used as sources on I Wikipedia).

The profiling issue is beyond this discussion, and I will make sure to mention that on fawiki, but one can already "profile" users using their edits (it is quite easy for people to look at my edits on fawiki and realize that I read and write about Persian music based on my fawiki edits; knowing that I also use some of the books on this topic as my source wouldn't add much to the picture; of note, my real world life and identity is unrelated to Persian music or music in general, so profiles are not always as revealing anyway).

@John: I had not heard of mwparserfromhell and it is really cool! But how exactly does it come into play? The issue is less of being able to parse wikicode (what we really need is pretty much a regex search for the Persian equivalent of {{cite book}} template, and a second regex pattern that looks for the "name" parameter inside matches for the first one). Frankly, I am less worried about the steps *after* we found a "ciite book" instance, and more about the steps leading to it (running many many diffs).

Perhaps I am not fully understanding your thoughts, so please elaborate.

Thank you both!

On Thu, Dec 27, 2018 at 1:24 PM T Paris tparis.wiki@gmail.com wrote:

...
Could I ask that you guys make this an “opt in” feature. Both because it’ll speed up the bot and also because once you start identifying which books people own, you start to develop a profile on people.

v/r,

TP

Sent from Mail https://go.microsoft.com/fwlink/?LinkId=550986 for Windows 10

*From: *Huji Lee huji.huji@gmail.com *Sent: *Thursday, December 27, 2018 11:42 AM *To: *Labs labs-l@lists.wikimedia.org *Subject: *[Cloud] List of users who have access to certain references

This is an idea that came up on fawiki, and there is some merit to it. I just want to figure out the best approach to implement it and would love your input.

*TL;DR: *We want to sweep through the recent edits in articles, look at each diff, see if it contains the addition of a "{{cite book}}" template, and if so, set it aside for future processing by another code.

I wonder if there are already scripts in pywikibot that would help initiate this. If not, I wonder what is the best strategy to implement this using MW API.

Thanks,

Huji

Long version:

The idea is to identify users who probably have access to certain offline sources, so that if another user needs something to be checked in that source and they don't have access to it, they know who to ask. For instance, if I have access to a physical copy of Encyclopedia Britannica (let's say it is a book and is not available digitally), and you want me to check if it has an entry for Sir Isaac Newton, it would be great if instead of or in addition to asking on the village pump (which I might not follow), you would ask me directly.

The assumption is that if the same user keeps adding the same "{{cite book}}" template in many articles (e.g. if I add the {{cite book | title = Encyclopedia Britannica | ... }} in several edits across several articles), then that user most likely has access to that source. And if these edits are relatively recent and the user is still active, then chances are the user can still access that source if another user asks them to.

So if we find all such edits, we probably can aggregate them into a table that shows "Huji" added a {{cite book}} for a book titled "Encyclopedia Britannica" 17 times, and so on and so forth. Sorting it by the frequency column, we might have a good list of user-source pairs.

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

2187

Age (days ago)

2187

Last active (days ago)

cloud@lists.wikimedia.org

5 comments

3 participants

tags (0)

participants (3)

Huji Lee
John
T Paris