Thanks to all for a great discussion on this. I
appreciated all the suggestions, including people discussing their own practices.
Fantastic. I just never would have thought of things like checking the bug tracker to
learn about practices :)
Just wanted to update you on discussions I had with the author of the watchers tool on
toolserver, MZMcBride. I include the full thread below (with his permission), but the
short story is that even Wikipedia doesn't collect data on when people add a page and
remove a page from their watchlists, they only maintain who has what pages on their lists
at any one time (data which they don't release). So the only way to build up a true
empirical picture of this would be lots of rapid snapshots of the watchlist table.
It's an interesting example of the challenges in doing research on systems designed
to run a busy website as opposed to collect research data!
OTOH if you are just interested in watcher counts below the minimum of 30 watchers, then
it seems there is scope for that data to be released to researchers.
Most likely I will run a sensitivity analysis looking at the robustness of our results to
different assumptions.
Thanks again,
James
On Jul 5, 2010, at 20:40, MZMcBride wrote:
Jameshowison wrote:
I'm working on a study for which I'd like
to know more about editors'
watchlisting practices.
Apologies for the delayed response. The study certainly sounds interesting.
Thanks.
Currently
my plan was to assume that anyone who has edited an article in the
past 6 months has it on their watchlist (or is otherwise informed about new
edits). Obviously a very corse assumption.
Very coarse, for a variety of reasons. The biggest reason, I imagine, is
that unregistered users make up a decent-sized percentage of contributions,
and unregistered users cannot have watchlists. A lot of people also use
automated or semi-automated tools to do "batch editing" (vandalism
reversion, code fixes, etc.) in which they'll edit a page, but never return
to it or have any interest in watching it.
Yes, that's helpful. Will probably run the analysis without the unregistered users.
Ah, the tradeoffs of research.
What
would be fantastic would be to know the empirical distribution (e.g.
editors put the page they edit on their watchlist at some % chance, perhaps
changing based on their number/tenure of edits that page, especially if they
had never edited the page at that point).
The watchlist table is made up of a few columns.[1] The two columns that are
exposed to Toolserver users are wl_namespace and wl_title. In the public
Wikimedia dumps[2] the watchlist table is excluded entirely.
If you are able to share more specific
information, I'm happy to do the
required computations, only ever releasing aggregate information. What were
you able to share with the other researcher?
The other researcher wanted aggregate information about all page titles in
all namespaces and their corresponding categories. The public "watcher"
tool[3] has an imposed restriction to only show the number of watchers if
the number is 30 or greater. The data I released to her had this restriction
lifted.
Right, got it. So this would give a count of watchers for each page, at a single time.
Never the actual watchers.
FWIW
I've been using the pages associated with WikiProject Oregon as my
prototype. A list of dated watch/unwatch events for those pages that I could
link to revisions (edits) would be ideal; not sure if that's the format of the
data or if it can be changed to such a format (seems to require the full
history of the watchlist table?).
When a page is added to a watchlist, or whose watchlist it was added to, is
not stored or not available, respectively.
Yeah, that makes sense. So the only way to get what I was interested in would be to
track the list of editors and their watched pages (which is collected but not available)
frequently, such lists would let you approximate when they added and removed this. I do
understand the reasons why this data is not available, so I'm not angling for that.
Perhaps relatively frequent snapshots of watcher numbers (not thresholded at 30) would
provide some data, although not the type that could be correlated with editing actions.
Most likely I will simply vary the length of my coarse assumption and see if the results
are sensitive to that. It's not ideal, but at least I'm sure I haven't missed
something obvious.
ps. the
wl_notificationtimestamp is intriguing, I checked the manual, seems to
only hold the last notification, not a history of notification, right?
This is explained at the manual page for the watchlist table.[1] Essentially
this column is only used on wikis where e-mail notifications (ENotif) is
enabled. Sites like the English Wikipedia do not have this feature enabled
due to performance reasons.
Thanks a million for your help.
James Howison wrote:
There was some interest in this question on the
wiki-research-l(a)lists.wikimedia.org list; would you mind if I quoted parts of
your email in my summary to that list? Happy to run a draft by you to ensure
I'm not misrepresenting or otherwise stepping on toes.
Please feel free to pass along anything you found helpful. No need to run it
by me first.
MZMcBride
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org