Peng Wan's proposal to figure out the most popular pages

List overview All Threads
Download

newer

older

Google Sumer of Code

New SVN committer

Peng Wan

6 Apr 2011 6 Apr '11

5:02 p.m.

This is Peng Wan. I have submitted my application to wikimedia of Gsoc. My Project title is "Figuring out the most popular pages".

Here is the project's short description: The feature aims to figure out the most popular and favorite pages in wikimedia. The most popular pages are calculated when users click on pages. The click event can send the action to the database. Then we can figure out the most popular pages by querying through the destination urls. As for the most favorite pages, I want to add a "like" or "+1" tag in every page. If one user likes the content in the page, s/he just need to click the "like" link to add the "like" number in database.

Here is my proposal link: http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/buaajacks...

I would appreciate for your advice about my proposal.

Thanks Peng Wan

Show replies by date

David Gerard

6 Apr 6 Apr

5:05 p.m.

On 6 April 2011 16:02, Peng Wan buaajackson@gmail.com wrote:

...

This is Peng Wan. I have submitted my application to wikimedia of Gsoc. My Project title is "Figuring out the most popular pages".

Does it do anything http://stats.grok.se/ doesn't?

- d.

Andrew G. West

5:46 p.m.

New subject: Peng Wan's proposal to figure out the most popular pages

See also: http://dammit.lt/wikistats/

I've parsed every one of these files (at hour granularity; grok.se aggregates at day-level, I believe) since Jan. 2010 into a DB structure indexed by page title. It takes up about 400GB of space, at the moment.

While a comprehensive measurement study over this data would be interesting (long term trends, traffic spikes during cultural events, etc.) -- the technical infrastructure is already in place. I doubt a measurement study meets GSoC requirements.

Thanks, -AW

On 04/06/2011 11:05 AM, David Gerard wrote:

...

On 6 April 2011 16:02, Peng Wanbuaajackson@gmail.com wrote:

...
This is Peng Wan. I have submitted my application to wikimedia of Gsoc. My Project title is "Figuring out the most popular pages".

Does it do anything http://stats.grok.se/ doesn't?

d.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Andrew G. West, Doctoral Student Dept. of Computer and Information Science University of Pennsylvania, Philadelphia PA Website: http://www.cis.upenn.edu/~westand

Gerard Meijssen

9:32 p.m.

Hoi, While it is interesting to know what articles are popular, it serves no real purpose except for curiosity. When the information does show what articles are missed most often, you provide functionality that we do not have and that has a really practical application.

When the people in India, Pakistan and Sri Lanka all at the same time have a bout of cricket fever, it will show in the traffic numbers from these areas. When a Pakistani cricketer is really popular on Wikipedia and he is missing on the Tamil or Singhala Wikipedia it is the type of intelligence that makes for an article that is likely to be really popular.

There is a big need for statistics that point to the articles that are missing and, there are several approaches to such data. In my opinion given the right take on this issue it is definitely a GSOC worthy project. Thanks, GerardM

On 6 April 2011 17:46, Andrew G. West westand@cis.upenn.edu wrote:

...

See also: http://dammit.lt/wikistats/

I've parsed every one of these files (at hour granularity; grok.se aggregates at day-level, I believe) since Jan. 2010 into a DB structure indexed by page title. It takes up about 400GB of space, at the moment.

While a comprehensive measurement study over this data would be interesting (long term trends, traffic spikes during cultural events, etc.) -- the technical infrastructure is already in place. I doubt a measurement study meets GSoC requirements.

Thanks, -AW

On 04/06/2011 11:05 AM, David Gerard wrote:

...
On 6 April 2011 16:02, Peng Wanbuaajackson@gmail.com wrote:

...
This is Peng Wan. I have submitted my application to wikimedia of Gsoc. My Project title is "Figuring out the most popular pages".

Does it do anything http://stats.grok.se/ doesn't?

d.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Andrew G. West, Doctoral Student Dept. of Computer and Information Science University of Pennsylvania, Philadelphia PA Website: http://www.cis.upenn.edu/~westand

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

MZMcBride

7 Apr 7 Apr

2:44 a.m.

New subject: Peng Wan's proposal to figure out the most popular pages

Andrew G. West wrote:

...

I've parsed every one of these files (at hour granularity; grok.se aggregates at day-level, I believe) since Jan. 2010 into a DB structure indexed by page title. It takes up about 400GB of space, at the moment.

Is your database available to the public? The Toolserver folks have been talking about getting the page view stats into usable form for quite some time, but nothing's happened yet. If you have an API or something similar, that would be fantastic. (stats.grok.se has a rudimentary API that I don't imagine many people are aware of.)

MZMcBride

Andrew G. West

5:13 a.m.

New subject: Peng Wan's proposal to figure out the most popular pages

Not sure I want to throw the API open to the public (the grok.se folks, and others, have a fine service for casual experimentation).

However, I am willing to share the data with interested researchers who need to do some serious crunching (I have a Java API and could distribute database credentials on a per-case basis).

I'll note that I only parse English Wikipedia at this time. I've found it useful in my anti-vandalism research (i.e., "given that edit survived between time [w] and [x] on article [y], we estimate it received [z] views"). Thanks, -AW

On 04/06/2011 08:44 PM, MZMcBride wrote:

...

Andrew G. West wrote:

...
I've parsed every one of these files (at hour granularity; grok.se aggregates at day-level, I believe) since Jan. 2010 into a DB structure indexed by page title. It takes up about 400GB of space, at the moment.

Is your database available to the public? The Toolserver folks have been talking about getting the page view stats into usable form for quite some time, but nothing's happened yet. If you have an API or something similar, that would be fantastic. (stats.grok.se has a rudimentary API that I don't imagine many people are aware of.)

MZMcBride

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Andrew G. West, Doctoral Student Dept. of Computer and Information Science University of Pennsylvania, Philadelphia PA Website: http://www.cis.upenn.edu/~westand

4850

Age (days ago)

4851

Last active (days ago)

wikitech-l@lists.wikimedia.org

5 comments

5 participants

tags (0)

participants (5)

Andrew G. West
David Gerard
Gerard Meijssen
MZMcBride
Peng Wan