Distributed process pool

List overview All Threads
Download

newer

older

UserComments: new extension...

MediaWiki automated test run...

Jonathan

17 Mar 2006 17 Mar '06

11:24 p.m.

Hello,

Is there a way analyze the wikipedia logs to figure out what processes takes the most time? There is no immediate need, but I wanted to shoot off an idea to consider. If we were able to capture the processes that do take a heavy server load and push them onto a distributed process pool, would it help wikipedia or mediawiki in general? I'd imagine there would be a difference with process time over the speed of network traffic. Let say we determined that the code that creates a diff for two pages is a hog and can be put into the pool. We could use something like BOINC, http://boinc.berkeley.edu/, to standardize the pool. We can add the diff process to the pool as the server load gets heavy. The use of BOINC is more specific to research tasks, and it would need to be different for mediawiki. I just used the idea to keep this message short to get your feedback.

Thanks

Jonathan

Show replies by date

Lars Aronsson

18 Mar 18 Mar

2:02 a.m.

Jonathan wrote:

...

Is there a way analyze the wikipedia logs to figure out what processes takes the most time? There is no immediate need, but I wanted to shoot off an idea to consider. If we were able to capture the processes that do take a heavy server load and push them onto a distributed process pool, would it help wikipedia or mediawiki in general? I'd imagine there

This sounds like a nice theory, but what you need first is the numbers. There are just sooo many more (100 times? 1000 times?) normal page views than diffs, history views or edits. And the normal page views are taken care of by caching proxies (Squid) already.

I don't know what the numbers are today, or what the hit-miss-ratio of the Squid cache is. It would be interesting to know. Are these statistics documented anywhere?

Page requests per day hasn't been documented after October 2004, http://stats.wikimedia.org/EN/TablesWikipediaEN.htm

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Jonathan

3:01 a.m.

Lars Aronsson wrote:

...

This sounds like a nice theory, but what you need first is the numbers. There are just sooo many more (100 times? 1000 times?) normal page views than diffs, history views or edits. And the normal page views are taken care of by caching proxies (Squid) already.

I don't know what the numbers are today, or what the hit-miss-ratio of the Squid cache is. It would be interesting to know. Are these statistics documented anywhere?

Page requests per day hasn't been documented after October 2004, http://stats.wikimedia.org/EN/TablesWikipediaEN.htm

The diff was just an example. If we could push the bots onto the pool, that would be another example. I don't know if that would be helpful.

With the numbers, we need to consider server load. Squid is one method to off-load the work a tier away from the server.

Anything that can be fetched from the database once, packaged together, and passed to a node to be processed can be off-loaded. The diff seem like an easy target to make an example out of it.

Jonathan

Brion Vibber

3:47 a.m.

Jonathan wrote:

...

Is there a way analyze the wikipedia logs to figure out what processes takes the most time?

For some past stuff, see: http://www.google.com/search?q=site%3Ameta.wikimedia.org+profiling

Domas and Tim are more recently playing with kcachegrind and things to get some even more fine-grained recordings and pretty visualizations.

...

There is no immediate need, but I wanted to shoot off an idea to consider. If we were able to capture the processes that do take a heavy server load and push them onto a distributed process pool, would it help wikipedia or mediawiki in general?

Well, that's basically what the apache cluster is... :)

...

Let say we determined that the code that creates a diff for two pages is a hog and can be put into the pool. We could use something like BOINC, http://boinc.berkeley.edu/, to standardize the pool. We can add the diff process to the pool as the server load gets heavy.

Well, diffs for instance are very rare compared to other tasks; they've never been an overall drain on resources, but "pathological" diffs on very large pages can be slow, which is frustrating for interactive performance. When the diff you're looking at takes forever to return (or times out and doesn't return at all) it's rather a pain.

(Note that diff has already been hugely sped up between caching and rewriting it as a C++ plugin.)

Offloading a potentially heavy task like diffing or thumbnail generation to yet another set of servers could have plusses or minuses; a big minus is that such queuing could negatively impact interactive performance. You don't just want that diff some day; you want it right when you click on it.

-- brion vibber (brion @ pobox.com)

Jonathan

20 Mar 20 Mar

2:43 a.m.

Brion Vibber wrote:

...

Jonathan wrote:

...
Is there a way analyze the wikipedia logs to figure out what processes takes the most time?

For some past stuff, see: http://www.google.com/search?q=site%3Ameta.wikimedia.org+profiling

Domas and Tim are more recently playing with kcachegrind and things to get some even more fine-grained recordings and pretty visualizations.

That is helpful.

I've wrote a virtual machine myself, and it appears much faster than Java and PHP. However, until a program like mediawiki is directly translated and put under the same load, it is hard to boast the stats. I do know from experience with vm development that there is room to advance the technology in use. Java and PHP are popular but still behind edge technology.

Jonathan

pitchou jaures

8:29 p.m.

salut excuse moi mais je nesuis pas anglophone

Jonathan dzonatas@dzonux.net a écrit : Brion Vibber wrote:

...

Jonathan wrote:

...
Is there a way analyze the wikipedia logs to figure out what processes takes the most time?

For some past stuff, see: http://www.google.com/search?q=site%3Ameta.wikimedia.org+profiling

Domas and Tim are more recently playing with kcachegrind and things to get some even more fine-grained recordings and pretty visualizations.

That is helpful.

Jonathan

_______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

--------------------------------- Nouveau : téléphonez moins cher avec Yahoo! Messenger ! Découvez les tarifs exceptionnels pour appeler la France et l'international.Téléchargez la version beta.

Jonathan

22 Mar 22 Mar

12:55 a.m.

Hello.

I did not realize the diversity in culture on wikitech-l..

pitchou jaures wrote:

...

salut excuse moi mais je nesuis pas anglophone

6774

Age (days ago)

6778

Last active (days ago)

wikitech-l@lists.wikimedia.org

6 comments

4 participants

tags (0)

participants (4)

Brion Vibber
Jonathan
Lars Aronsson
pitchou jaures