Hello,
Is there a way analyze the wikipedia logs to figure out what processes takes the most time? There is no immediate need, but I wanted to shoot off an idea to consider. If we were able to capture the processes that do take a heavy server load and push them onto a distributed process pool, would it help wikipedia or mediawiki in general? I'd imagine there would be a difference with process time over the speed of network traffic. Let say we determined that the code that creates a diff for two pages is a hog and can be put into the pool. We could use something like BOINC, http://boinc.berkeley.edu/, to standardize the pool. We can add the diff process to the pool as the server load gets heavy. The use of BOINC is more specific to research tasks, and it would need to be different for mediawiki. I just used the idea to keep this message short to get your feedback.
Thanks
Jonathan
Jonathan wrote:
Is there a way analyze the wikipedia logs to figure out what processes takes the most time? There is no immediate need, but I wanted to shoot off an idea to consider. If we were able to capture the processes that do take a heavy server load and push them onto a distributed process pool, would it help wikipedia or mediawiki in general? I'd imagine there
This sounds like a nice theory, but what you need first is the numbers. There are just sooo many more (100 times? 1000 times?) normal page views than diffs, history views or edits. And the normal page views are taken care of by caching proxies (Squid) already.
I don't know what the numbers are today, or what the hit-miss-ratio of the Squid cache is. It would be interesting to know. Are these statistics documented anywhere?
Page requests per day hasn't been documented after October 2004, http://stats.wikimedia.org/EN/TablesWikipediaEN.htm
Lars Aronsson wrote:
This sounds like a nice theory, but what you need first is the numbers. There are just sooo many more (100 times? 1000 times?) normal page views than diffs, history views or edits. And the normal page views are taken care of by caching proxies (Squid) already.
I don't know what the numbers are today, or what the hit-miss-ratio of the Squid cache is. It would be interesting to know. Are these statistics documented anywhere?
Page requests per day hasn't been documented after October 2004, http://stats.wikimedia.org/EN/TablesWikipediaEN.htm
The diff was just an example. If we could push the bots onto the pool, that would be another example. I don't know if that would be helpful.
With the numbers, we need to consider server load. Squid is one method to off-load the work a tier away from the server.
Anything that can be fetched from the database once, packaged together, and passed to a node to be processed can be off-loaded. The diff seem like an easy target to make an example out of it.
Jonathan
Jonathan wrote:
Is there a way analyze the wikipedia logs to figure out what processes takes the most time?
For some past stuff, see: http://www.google.com/search?q=site%3Ameta.wikimedia.org+profiling
Domas and Tim are more recently playing with kcachegrind and things to get some even more fine-grained recordings and pretty visualizations.
There is no immediate need, but I wanted to shoot off an idea to consider. If we were able to capture the processes that do take a heavy server load and push them onto a distributed process pool, would it help wikipedia or mediawiki in general?
Well, that's basically what the apache cluster is... :)
Let say we determined that the code that creates a diff for two pages is a hog and can be put into the pool. We could use something like BOINC, http://boinc.berkeley.edu/, to standardize the pool. We can add the diff process to the pool as the server load gets heavy.
Well, diffs for instance are very rare compared to other tasks; they've never been an overall drain on resources, but "pathological" diffs on very large pages can be slow, which is frustrating for interactive performance. When the diff you're looking at takes forever to return (or times out and doesn't return at all) it's rather a pain.
(Note that diff has already been hugely sped up between caching and rewriting it as a C++ plugin.)
Offloading a potentially heavy task like diffing or thumbnail generation to yet another set of servers could have plusses or minuses; a big minus is that such queuing could negatively impact interactive performance. You don't just want that diff some day; you want it right when you click on it.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
Jonathan wrote:
Is there a way analyze the wikipedia logs to figure out what processes takes the most time?
For some past stuff, see: http://www.google.com/search?q=site%3Ameta.wikimedia.org+profiling
Domas and Tim are more recently playing with kcachegrind and things to get some even more fine-grained recordings and pretty visualizations.
That is helpful.
I've wrote a virtual machine myself, and it appears much faster than Java and PHP. However, until a program like mediawiki is directly translated and put under the same load, it is hard to boast the stats. I do know from experience with vm development that there is room to advance the technology in use. Java and PHP are popular but still behind edge technology.
Jonathan
salut excuse moi mais je nesuis pas anglophone
Jonathan dzonatas@dzonux.net a écrit : Brion Vibber wrote:
Jonathan wrote:
Is there a way analyze the wikipedia logs to figure out what processes takes the most time?
For some past stuff, see: http://www.google.com/search?q=site%3Ameta.wikimedia.org+profiling
Domas and Tim are more recently playing with kcachegrind and things to get some even more fine-grained recordings and pretty visualizations.
That is helpful.
I've wrote a virtual machine myself, and it appears much faster than Java and PHP. However, until a program like mediawiki is directly translated and put under the same load, it is hard to boast the stats. I do know from experience with vm development that there is room to advance the technology in use. Java and PHP are popular but still behind edge technology.
Jonathan
_______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
--------------------------------- Nouveau : téléphonez moins cher avec Yahoo! Messenger ! Découvez les tarifs exceptionnels pour appeler la France et l'international.Téléchargez la version beta.
wikitech-l@lists.wikimedia.org