Thanks to all for a great discussion on this. I appreciated all the suggestions, including people discussing their own practices. Fantastic. I just never would have thought of things like checking the bug tracker to learn about practices :)
Just wanted to update you on discussions I had with the author of the watchers tool on toolserver, MZMcBride. I include the full thread below (with his permission), but the short story is that even Wikipedia doesn't collect data on when people add a page and remove a page from their watchlists, they only maintain who has what pages on their lists at any one time (data which they don't release). So the only way to build up a true empirical picture of this would be lots of rapid snapshots of the watchlist table. It's an interesting example of the challenges in doing research on systems designed to run a busy website as opposed to collect research data!
OTOH if you are just interested in watcher counts below the minimum of 30 watchers, then it seems there is scope for that data to be released to researchers.
Most likely I will run a sensitivity analysis looking at the robustness of our results to different assumptions.
Thanks again, James
On Jul 5, 2010, at 20:40, MZMcBride wrote:
Jameshowison wrote:
I'm working on a study for which I'd like to know more about editors' watchlisting practices.
Apologies for the delayed response. The study certainly sounds interesting.
Thanks.
Currently my plan was to assume that anyone who has edited an article in the past 6 months has it on their watchlist (or is otherwise informed about new edits). Obviously a very corse assumption.
Very coarse, for a variety of reasons. The biggest reason, I imagine, is that unregistered users make up a decent-sized percentage of contributions, and unregistered users cannot have watchlists. A lot of people also use automated or semi-automated tools to do "batch editing" (vandalism reversion, code fixes, etc.) in which they'll edit a page, but never return to it or have any interest in watching it.
Yes, that's helpful. Will probably run the analysis without the unregistered users. Ah, the tradeoffs of research.
What would be fantastic would be to know the empirical distribution (e.g. editors put the page they edit on their watchlist at some % chance, perhaps changing based on their number/tenure of edits that page, especially if they had never edited the page at that point).
The watchlist table is made up of a few columns.[1] The two columns that are exposed to Toolserver users are wl_namespace and wl_title. In the public Wikimedia dumps[2] the watchlist table is excluded entirely.
If you are able to share more specific information, I'm happy to do the required computations, only ever releasing aggregate information. What were you able to share with the other researcher?
The other researcher wanted aggregate information about all page titles in all namespaces and their corresponding categories. The public "watcher" tool[3] has an imposed restriction to only show the number of watchers if the number is 30 or greater. The data I released to her had this restriction lifted.
Right, got it. So this would give a count of watchers for each page, at a single time. Never the actual watchers.
FWIW I've been using the pages associated with WikiProject Oregon as my prototype. A list of dated watch/unwatch events for those pages that I could link to revisions (edits) would be ideal; not sure if that's the format of the data or if it can be changed to such a format (seems to require the full history of the watchlist table?).
When a page is added to a watchlist, or whose watchlist it was added to, is not stored or not available, respectively.
Yeah, that makes sense. So the only way to get what I was interested in would be to track the list of editors and their watched pages (which is collected but not available) frequently, such lists would let you approximate when they added and removed this. I do understand the reasons why this data is not available, so I'm not angling for that.
Perhaps relatively frequent snapshots of watcher numbers (not thresholded at 30) would provide some data, although not the type that could be correlated with editing actions.
Most likely I will simply vary the length of my coarse assumption and see if the results are sensitive to that. It's not ideal, but at least I'm sure I haven't missed something obvious.
ps. the wl_notificationtimestamp is intriguing, I checked the manual, seems to only hold the last notification, not a history of notification, right?
This is explained at the manual page for the watchlist table.[1] Essentially this column is only used on wikis where e-mail notifications (ENotif) is enabled. Sites like the English Wikipedia do not have this feature enabled due to performance reasons.
Thanks a million for your help.
On Jul 6, 2010, at 17:58, MZMcBride wrote:
James Howison wrote:
There was some interest in this question on the wiki-research-l@lists.wikimedia.org list; would you mind if I quoted parts of your email in my summary to that list? Happy to run a draft by you to ensure I'm not misrepresenting or otherwise stepping on toes.
Please feel free to pass along anything you found helpful. No need to run it by me first.
MZMcBride