Date: Mar 17, 2006 10:24 AM Subject: [Wikitech-l] Distributed process pool To: wikitech-l@wikimedia.org
Hello,
Is there a way analyze the wikipedia logs to figure out what processes takes the most time? There is no immediate need, but I wanted to shoot off an idea to consider. If we were able to capture the processes that do take a heavy server load and push them onto a distributed process pool, would it help wikipedia or mediawiki in general? I'd imagine there would be a difference with process time over the speed of network traffic. Let say we determined that the code that creates a diff for two pages is a hog and can be put into the pool. We could use something like BOINC, http://boinc.berkeley.edu/, to standardize the pool. We can add the diff process to the pool as the server load gets heavy. The use of BOINC is more specific to research tasks, and it would need to be different for mediawiki. I just used the idea to keep this message short to get your feedback.
Jonathan,
Jared and I are actually working on a project to research and build a peer-based distributed hosting framework for large free-content sites like Wiki{m,p}edia. In a practical sense, we're using Wikipedia as a starting point for the inquiry.
If we get accurate statistics on what kinds of processes these are and what quantity of resources they consume, we might be able to integrate these considerations into the simulation environment we'll be using to evaluate various architechtures for distributed hosting.
I'm finally subscribing to the wikitech list, and I'll be interested in learning about anything related that doesn't come up there.
-Erik
wikitech-l@lists.wikimedia.org