[cc'ed from thread on wikien-l to wikitech-l]
On wikien-l, Neil Harris wrote:
charles matthews wrote:
But I think we know all this. To return to Toynbee, it is more a question of how to get Wikipedians to 'feel challenged', on the specifics. Right now, with the site running slow, the main practical challenge seems to be hardward/developers/cash. _I_ mostly feel challenged by the sheer breadth of approach needed.
At the risk of being tedious, article rating article rating article rating article rating article rating
Brion, what would it take to get article rating switched on? Is there any such feature you would allow in, or is it basically off the agenda and I should stop asking?
- d.
David Gerard wrote:
Brion, what would it take to get article rating switched on? Is there any such feature you would allow in, or is it basically off the agenda and I should stop asking?
Well, it needs to be shown to work correctly on pages with thousands of revisions without bogging the server or otherwise exploding in interesting ways.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
David Gerard wrote:
Brion, what would it take to get article rating switched on? Is there any such feature you would allow in, or is it basically off the agenda and I should stop asking?
Well, it needs to be shown to work correctly on pages with thousands of revisions without bogging the server or otherwise exploding in interesting ways.
-- brion vibber (brion @ pobox.com)
Brion,
Am I being naive here, or would a super-dumb implementation with a single table with the columns shown below be enough to work in the short term?
Page_ID Revision_ID User_ID Rating_ID Rating value Timestamp
If a user rates five parameters, five entries go into the table. Even if users were to rate articles at the same rate that they are currently editing them (about once a second), and each rating had five rating dimensions, the table would grow at a rate of about half-a-million entries per day. If each entry took (say) twenty bytes, that would be a growth of 10 megabytes per day, or about four gigabytes a year. Since the table would grow by simply appending records to its end, and would not otherwise change, it would not add much load to the database when being written, as adding five records at once in a single InnoDB transaction would result in only a single disk write.
To throttle back the load from rating, it might be reasonable to restrict rating only to logged-on users, and, if that's not enough, * to throttle the rate at which they could rate articles? * to restrict rating only to times when the server load was low? * to restrict rating only to users with a certain number of edits and/or time since account was registered?
Now, this could clearly be made much, much more efficient in a number of obvious way, but it would be simple enough to implement in the short term to get things going.
Also in the short term, the only output method needed would be the ability to dump an XML or CSV file showing all the rating records for an given article in a given time-period, and this could be restricted to admins for the time being if the random access required would be a significant load on the database. This would be enough to allow users to start experimenting with ratings analysis schemes.
To prevent the unlimited growth of the table, it would be quite reasonable to archive rating records more than (say) a year old into a compressed XML dump.
Finally, in any case, since ratings would only be an experiment during this phase, the whole ratings system could be turned off at any time if it presented a serious problem.
-- Neil
Neil Harris wrote:
Am I being naive here, or would a super-dumb implementation with a single table with the columns shown below be enough to work in the short term?
<snip>
Have you looked at the existing implementation?
-- brion vibber (brion @ pobox.com)
Neil Harris wrote:
Am I being naive here, or would a super-dumb implementation with a single table with the columns shown below be enough to work in the short term?
Page_ID Revision_ID User_ID Rating_ID Rating value Timestamp
This is what I did; no timestamp, but a varchar for comments. Topics to rate and their range (e.g, 1-5) are encoded here as well for user #0. That's about as dumb as it gets ;-)
Magnus
Magnus Manske wrote:
Neil Harris wrote:
Am I being naive here, or would a super-dumb implementation with a single table with the columns shown below be enough to work in the short term?
Page_ID Revision_ID User_ID Rating_ID Rating value Timestamp
This is what I did; no timestamp, but a varchar for comments. Topics to rate and their range (e.g, 1-5) are encoded here as well for user #0. That's about as dumb as it gets ;-)
I still prefer a 0-10 range of ratings. I think a decimal normalization would be easier to work with in any subsequent analysis of results.
Ec
Ray Saintonge wrote:
Magnus Manske wrote:
Neil Harris wrote:
Am I being naive here, or would a super-dumb implementation with a single table with the columns shown below be enough to work in the short term?
Page_ID Revision_ID User_ID Rating_ID Rating value Timestamp
This is what I did; no timestamp, but a varchar for comments. Topics to rate and their range (e.g, 1-5) are encoded here as well for user #0. That's about as dumb as it gets ;-)
I still prefer a 0-10 range of ratings. I think a decimal normalization would be easier to work with in any subsequent analysis of results.
One can set the range for each topic individually.
BTW, with values 0-10, you'd have eleven values...
Magnus
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Magnus Manske wrote:
Ray Saintonge wrote:
Magnus Manske wrote:
Neil Harris wrote:
Am I being naive here, or would a super-dumb implementation with a single table with the columns shown below be enough to work in the short term?
Page_ID Revision_ID User_ID Rating_ID Rating value Timestamp
This is what I did; no timestamp, but a varchar for comments. Topics to rate and their range (e.g, 1-5) are encoded here as well for user #0. That's about as dumb as it gets ;-)
I still prefer a 0-10 range of ratings. I think a decimal normalization would be easier to work with in any subsequent analysis of results.
One can set the range for each topic individually.
BTW, with values 0-10, you'd have eleven values...
Magnus
11 values, 10 histogram bins. It'd be dreadfully easy to make a page like Special:Pages_Rated_0_to_1 with that kind of an approach, and to make a table of links to each of these ten pages. - --Chris
Magnus Manske wrote:
Ray Saintonge wrote:
Magnus Manske wrote:
Neil Harris wrote:
Am I being naive here, or would a super-dumb implementation with a single table with the columns shown below be enough to work in the short term?
Page_ID Revision_ID User_ID Rating_ID Rating value Timestamp
This is what I did; no timestamp, but a varchar for comments. Topics to rate and their range (e.g, 1-5) are encoded here as well for user #0. That's about as dumb as it gets ;-)
I still prefer a 0-10 range of ratings. I think a decimal normalization would be easier to work with in any subsequent analysis of results.
One can set the range for each topic individually.
BTW, with values 0-10, you'd have eleven values...
Yes. That's a problem??
Probabilities are based on a continuum between 0.0 and 1.0, but I think we want to limit people to strictly integral input with an apparent middle value of 5. The net averaged results will be probabilities times ten.
I think there are algorithms that can be applied to simple votes that will give us a single results. (Since I'm not a developper/programmer, I try to avoid being a thorn in the side of that lot through persistent POV pushing. :-) ). The net rating of an article can be a simple weighted average of individuals' ratings over some number of edits. A completely unrated article would have a rating of 0.0 over 0 edits.
The weighting of the average would depend on the age of the edit. Ratings since the last edit would receive full weight, those before the last edit would have a 0.9 weight, those before the second last edit would have a 0.8 weight, and so on. The net rating would be recalculated whenever an edit is made or a new rating added. Safeguards can be added to ensure that only a user's most recent rating is considered.
Ec
Ray Saintonge wrote:
Magnus Manske wrote:
Ray Saintonge wrote:
Magnus Manske wrote:
Neil Harris wrote:
Am I being naive here, or would a super-dumb implementation with a single table with the columns shown below be enough to work in the short term?
Page_ID Revision_ID User_ID Rating_ID Rating value Timestamp
This is what I did; no timestamp, but a varchar for comments. Topics to rate and their range (e.g, 1-5) are encoded here as well for user #0. That's about as dumb as it gets ;-)
I still prefer a 0-10 range of ratings. I think a decimal normalization would be easier to work with in any subsequent analysis of results.
One can set the range for each topic individually.
BTW, with values 0-10, you'd have eleven values...
Yes. That's a problem??
Not at all. I was just wondering, eleven values vs. "decimal normalization" (which seems to be based on ten). But, what do I know about statistics?
Anyway, the range will not be determined by the software. Bureaucrats can set a range of their choice when they create a topic.
Magnus
wikitech-l@lists.wikimedia.org