Many of the things done for the statistical analysis of database dumps should be suitable for parallelization (e.g. break the dump into chunks, process the chunks in parallel and sum the results). You could talk to Erik Zachte. I don't know if his code has already been designed for parallel processing though.
Another option might be to look at the methods for compressing old revisions (is [1] still current?).
I make heavy use of parallel processing in my professional work (not related to wikis), but I can't really think of any projects I have at hand that would be accessible and completable in a month.
-Robert Rohde
[1] http://www.mediawiki.org/wiki/Manual:CompressOld.php
On Sun, Oct 24, 2010 at 5:42 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
This term I'm taking a course in high-performance computing http://cs.nyu.edu/courses/fall10/G22.2945-001/index.html, and I have to pick a topic for a final project. According to the assignment http://cs.nyu.edu/courses/fall10/G22.2945-001/final-project.pdf, "The only real requirement is that it be something in parallel." In the class, we covered
- Microoptimization of single-threaded code (efficient use of CPU cache, etc.)
- Multithreaded programming using OpenMP
- GPU programming using OpenCL
and will probably briefly cover distributed computing over multiple machines with MPI. I will have access to a high-performance cluster at NYU, including lots of CPU nodes and some high-end GPUs. Unlike most of the other people in the class, I don't have any interesting science projects I'm working on, so something useful to MediaWiki/Wikimedia/Wikipedia is my first thought. If anyone has any suggestions, please share. (If you have non-Wikimedia-related ones, I'd also be interested in hearing about them offlist.) They shouldn't be too ambitious, since I have to finish them in about a month, while doing work for three other courses and a bunch of other stuff.
My first thought was to write a GPU program to crack MediaWiki password hashes as quickly as possible, then use what we've studied in class about GPU architecture to design a hash function that would be as slow as possible to crack on a GPU relative to its PHP execution speed, as Tim suggested a while back. However, maybe there's something more interesting I could do.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l