This term I'm taking a course in high-performance computing http://cs.nyu.edu/courses/fall10/G22.2945-001/index.html, and I have to pick a topic for a final project. According to the assignment http://cs.nyu.edu/courses/fall10/G22.2945-001/final-project.pdf, "The only real requirement is that it be something in parallel." In the class, we covered
* Microoptimization of single-threaded code (efficient use of CPU cache, etc.) * Multithreaded programming using OpenMP * GPU programming using OpenCL
and will probably briefly cover distributed computing over multiple machines with MPI. I will have access to a high-performance cluster at NYU, including lots of CPU nodes and some high-end GPUs. Unlike most of the other people in the class, I don't have any interesting science projects I'm working on, so something useful to MediaWiki/Wikimedia/Wikipedia is my first thought. If anyone has any suggestions, please share. (If you have non-Wikimedia-related ones, I'd also be interested in hearing about them offlist.) They shouldn't be too ambitious, since I have to finish them in about a month, while doing work for three other courses and a bunch of other stuff.
My first thought was to write a GPU program to crack MediaWiki password hashes as quickly as possible, then use what we've studied in class about GPU architecture to design a hash function that would be as slow as possible to crack on a GPU relative to its PHP execution speed, as Tim suggested a while back. However, maybe there's something more interesting I could do.
On 10/24/2010 8:42 PM, Aryeh Gregor wrote:
My first thought was to write a GPU program to crack MediaWiki password hashes as quickly as possible, then use what we've studied in class about GPU architecture to design a hash function that would be as slow as possible to crack on a GPU relative to its PHP execution speed, as Tim suggested a while back. However, maybe there's something more interesting I could do.
Boring.
I want Wikipedia converted into facts in a representation system that supports modal, temporal, and "microtheory" reasoning. You know, in the "real" world, :James_T_Kirk is a :Fictional_Character, but in the Star Trek universe, he's a :Person.
Of course, you'd have to pick some chunk out of that big task that's doable. One thing I'd like is something that extracts the "meaning" of hyperlinks. For instance, if we look at
http://en.wikipedia.org/wiki/Bruce_Lee
We see a link to :Wong_Jack_Man, and in dbpedia right now, this is represented as a unidirectional hyperlink w/o semantics. Now, a smarter system could say
:Bruce_Lee :Had_A_Fight_With :Wong_Jack_Man.
Although wikipedia is a relatively difficult text to work with with typical BOW and NLP methods, it's got enough semantic structure that hybrid semantic-BOW/NLP methods ought to be able to work miracles. I think that the way hyperlinks are used in text could be used to learn templates for detecting named entity references. I think it also ought to be possible to build linguistic models for classification. For instance, if you're having trouble telling your Jaguars apart,
http://en.wikipedia.org/wiki/Jaguar_(disambiguation) http://en.wikipedia.org/wiki/Jaguar_%28disambiguation%29
and related documents might help you make a filter that can tell the difference between "jaguar the cat" and "jaguar the car".
On Mon, Oct 25, 2010 at 12:38 PM, Paul Houle paul@ontology2.com wrote:
I want Wikipedia converted into facts in a representation system that supports modal, temporal, and "microtheory" reasoning. You know, in the "real" world, :James_T_Kirk is a :Fictional_Character, but in the Star Trek universe, he's a :Person.
This sounds like it would take far more work to actually write the program in the first place than to parallelize it.
On Mon, Oct 25, 2010 at 4:47 PM, Platonides Platonides@gmail.com wrote:
Make the best dump compressor ever? :)
The page http://www.mediawiki.org/wiki/Dbzip2 is worth looking at just for the available options. Continuing dbzip2 is the first idea but not the only one. I'm sure many things can be dig from there. Also worth noting, Ariel has been doing the last en dumps in page batches.
Possible. It looks like dbzip2 has had a lot of optimization put into it already, so I don't know that there would be much low-hanging fruit for me to get. I'm not sure if that would be acceptable for a final project ("do lots of benchmarking and probably not end up improving it much at all"). Whereas rewriting it to use the GPU sounds like a suspiciously large project . . . plus I'm not sure GPUs would even be suited for it.
Make the best dump compressor ever? :)
The page http://www.mediawiki.org/wiki/Dbzip2 is worth looking at just for the available options. Continuing dbzip2 is the first idea but not the only one. I'm sure many things can be dig from there. Also worth noting, Ariel has been doing the last en dumps in page batches.
Στις 25-10-2010, ημέρα Δευ, και ώρα 22:47 +0200, ο/η Platonides έγραψε:
Make the best dump compressor ever? :)
The page http://www.mediawiki.org/wiki/Dbzip2 is worth looking at just for the available options. Continuing dbzip2 is the first idea but not the only one. I'm sure many things can be dig from there. Also worth noting, Ariel has been doing the last en dumps in page batches.
The last dump. (Just to be accurate.) As soon as I can get the "split by number of revisions" stuff tested I will kick off another round. Hopefully next week.
I am following this discussion and happy to bat around ideas if there's something that might be appropriate for your course, Aryeh.
Ariel
On Mon, Oct 25, 2010 at 6:09 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
I am following this discussion and happy to bat around ideas if there's something that might be appropriate for your course, Aryeh.
Well, is there? Anything that needs to be parallelized, or that's already parallelized but needs significant optimization work (e.g., scales much worse than linearly)?
On Mon, Oct 25, 2010 at 4:04 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
On Mon, Oct 25, 2010 at 6:09 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
I am following this discussion and happy to bat around ideas if there's something that might be appropriate for your course, Aryeh.
Well, is there? Anything that needs to be parallelized, or that's already parallelized but needs significant optimization work (e.g., scales much worse than linearly)?
(suppressing grumbles about diff engine for edit conflicts, probably an algorithm rather than speed issue)
I suspect that the MySQL engine is the one place where parallelism is most applicable, but I am not a MySQL internals guru. However, there is one associated with the project, and I believe on the list. Domas?
On Mon, Oct 25, 2010 at 7:15 PM, George Herbert george.herbert@gmail.com wrote:
(suppressing grumbles about diff engine for edit conflicts, probably an algorithm rather than speed issue)
Diffs are fast enough if you use wikidiff2, no?
I suspect that the MySQL engine is the one place where parallelism is most applicable, but I am not a MySQL internals guru. However, there is one associated with the project, and I believe on the list. Domas?
Yes, parallelism is very applicable to databases, which is why they've already had tons of effort invested by people who know a lot more about the topic than me, so I probably can't do much there.
To clarify, the subject needs to 1) be reasonably doable in a short timeframe, 2) not build on top of something that's already too optimized. It should probably either be a new project; or an effort to parallelize something that already exists, isn't parallel yet, and isn't too complicated. So far I have the password-cracking thing, maybe dbzip2, and maybe some unspecified thing involving dumps.
On Mon, Oct 25, 2010 at 4:24 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
On Mon, Oct 25, 2010 at 7:15 PM, George Herbert george.herbert@gmail.com wrote:
(suppressing grumbles about diff engine for edit conflicts, probably an algorithm rather than speed issue)
Diffs are fast enough if you use wikidiff2, no?
Yeah; what I really want is a diff algorithm that properly deconflicts more current edit conflict problems. It's not speed, it's smarts in the algorithm.
I suspect that the MySQL engine is the one place where parallelism is most applicable, but I am not a MySQL internals guru. However, there is one associated with the project, and I believe on the list. Domas?
Yes, parallelism is very applicable to databases, which is why they've already had tons of effort invested by people who know a lot more about the topic than me, so I probably can't do much there.
To clarify, the subject needs to 1) be reasonably doable in a short timeframe, 2) not build on top of something that's already too optimized. It should probably either be a new project; or an effort to parallelize something that already exists, isn't parallel yet, and isn't too complicated. So far I have the password-cracking thing, maybe dbzip2, and maybe some unspecified thing involving dumps.
I understand the context of the class and scope. I just couldn't think of a MW related topic other than that that would seem to need serious parallelism work.
Good luck.
Aryeh Gregor <Simetrical+wikilist <at> gmail.com> writes:
To clarify, the subject needs to 1) be reasonably doable in a short timeframe, 2) not build on top of something that's already too optimized. It should probably either be a new project; or an effort to parallelize something that already exists, isn't parallel yet, and isn't too complicated. So far I have the password-cracking thing, maybe dbzip2, and maybe some unspecified thing involving dumps.
Some PageRank-like metric to approximate Wikipedia article importance/quality? Parallelizing eigenvalue calculations has a rich literature.
Many of the things done for the statistical analysis of database dumps should be suitable for parallelization (e.g. break the dump into chunks, process the chunks in parallel and sum the results). You could talk to Erik Zachte. I don't know if his code has already been designed for parallel processing though.
Another option might be to look at the methods for compressing old revisions (is [1] still current?).
I make heavy use of parallel processing in my professional work (not related to wikis), but I can't really think of any projects I have at hand that would be accessible and completable in a month.
-Robert Rohde
[1] http://www.mediawiki.org/wiki/Manual:CompressOld.php
On Sun, Oct 24, 2010 at 5:42 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
This term I'm taking a course in high-performance computing http://cs.nyu.edu/courses/fall10/G22.2945-001/index.html, and I have to pick a topic for a final project. According to the assignment http://cs.nyu.edu/courses/fall10/G22.2945-001/final-project.pdf, "The only real requirement is that it be something in parallel." In the class, we covered
- Microoptimization of single-threaded code (efficient use of CPU cache, etc.)
- Multithreaded programming using OpenMP
- GPU programming using OpenCL
and will probably briefly cover distributed computing over multiple machines with MPI. I will have access to a high-performance cluster at NYU, including lots of CPU nodes and some high-end GPUs. Unlike most of the other people in the class, I don't have any interesting science projects I'm working on, so something useful to MediaWiki/Wikimedia/Wikipedia is my first thought. If anyone has any suggestions, please share. (If you have non-Wikimedia-related ones, I'd also be interested in hearing about them offlist.) They shouldn't be too ambitious, since I have to finish them in about a month, while doing work for three other courses and a bunch of other stuff.
My first thought was to write a GPU program to crack MediaWiki password hashes as quickly as possible, then use what we've studied in class about GPU architecture to design a hash function that would be as slow as possible to crack on a GPU relative to its PHP execution speed, as Tim suggested a while back. However, maybe there's something more interesting I could do.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Robert Rohde wrote:
Many of the things done for the statistical analysis of database dumps should be suitable for parallelization (e.g. break the dump into chunks, process the chunks in parallel and sum the results). You could talk to Erik Zachte. I don't know if his code has already been designed for parallel processing though.
I don't think it's a good candidate since you are presumably using compressed files, and its decompression linearises it (and is most likely the bottleneck, too).
Another option might be to look at the methods for compressing old revisions (is [1] still current?).
I make heavy use of parallel processing in my professional work (not related to wikis), but I can't really think of any projects I have at hand that would be accessible and completable in a month.
-Robert Rohde
It can be used, I am unsure if it is used by WMF.
Another thing that would be nice to have parallelised would be things like parser tests. That would need adding cotasks to php or so. The most similar extension I know is runkit which is the other way around: several php scopes instead of several threads in one scope.
Στις 26-10-2010, ημέρα Τρι, και ώρα 16:25 +0200, ο/η Platonides έγραψε:
Robert Rohde wrote:
Many of the things done for the statistical analysis of database dumps should be suitable for parallelization (e.g. break the dump into chunks, process the chunks in parallel and sum the results). You could talk to Erik Zachte. I don't know if his code has already been designed for parallel processing though.
I don't think it's a good candidate since you are presumably using compressed files, and its decompression linearises it (and is most likely the bottleneck, too).
If one were clever (and I have some code that would enable one to be clever), one could seek to some point in the (bzip2-compressed) file and uncompress from there before processing. Running a bunch of jobs each decompressing only their small piece then becomes feasible. I don't have code that does this for gz or 7z; afaik these do not do compression in discrete blocks.
Ariel
On Tue, Oct 26, 2010 at 8:25 AM, Ariel T. Glenn ariel@wikimedia.org wrote:
Στις 26-10-2010, ημέρα Τρι, και ώρα 16:25 +0200, ο/η Platonides έγραψε:
Robert Rohde wrote:
Many of the things done for the statistical analysis of database dumps should be suitable for parallelization (e.g. break the dump into chunks, process the chunks in parallel and sum the results). You could talk to Erik Zachte. I don't know if his code has already been designed for parallel processing though.
I don't think it's a good candidate since you are presumably using compressed files, and its decompression linearises it (and is most likely the bottleneck, too).
If one were clever (and I have some code that would enable one to be clever), one could seek to some point in the (bzip2-compressed) file and uncompress from there before processing. Running a bunch of jobs each decompressing only their small piece then becomes feasible. I don't have code that does this for gz or 7z; afaik these do not do compression in discrete blocks.
Actually the LMZA used by default in 7z can be partially parallelized with some strong limitations:
1) The location of block N is generally only located by finding the end of block N-1, so files have to be read serially. 2) The ability to decompress block N may or may not depend on already having decompressed blocks N-1, N-2, N-3, etc., depending on the details of the data stream.
Point 2 in particular tends to lead to a lot of conflicts that prevents parallelization. If block N happens to be independent of block N-1 then they can be done in parallel, but in general this will not be the case. The frequency of such conflicts depends a lot on the data stream and options given to the compressor.
Last year LMZA2 was introduced in 7z with the primary intent of improving parallelization. It actually produces slightly worse compression in general, but can be operated to guarantee that block N is independent of blocks N-1 ... N-k for a specified k, meaning that k+1 blocks can always be considered in parallel.
I believe that gzip has similar constraints to LMZA that make parallelization problematic, but I'm not sure about that.
Getting back to Wikimedia, it appears correct that the Wikistats code is designed to run from the compressed files (source linked from [1]). As you suggest, one could use the properties of .bz2 format to parallelize that. I would also observe that parsers tend to be relatively slow, while decompressors tend to be relatively fast. I wouldn't necessarily assume that the decompressing is the only bottleneck. I've run analyses on dumps that took longer to execute than it took to decompress the files. However, they probably didn't take that many times longer (i.e. if the process were parallelized in 2 to 4 simultaneous chunks, then the decompression would be the primary bottleneck again).
So it is probably true that if one wants to see a large increase in the speed of stats processing one needs to consider parallelizing both the decompression and the stats gathering.
-Robert Rohde
[1] http://stats.wikimedia.org/index_tabbed_new.html#fragment-14
Ariel T. Glenn wrote:
If one were clever (and I have some code that would enable one to be clever), one could seek to some point in the (bzip2-compressed) file and uncompress from there before processing. Running a bunch of jobs each decompressing only their small piece then becomes feasible. I don't have code that does this for gz or 7z; afaik these do not do compression in discrete blocks.
Ariel
The bzip2recover approach? I am not sure how much will be the gain after so much bit moving. Also, I was unable to continue from a flushed point, it may not be so easy. OTOH, if you already have an index and the blocks end at page boundaries (which is what I was doing), it becomes trivial. Remember that the previous block MUST continue up to the point where the next reader started processing inside the next block. And unlike what ttsiod said, you do encounter tags split between blocks in a normal compression.
Στις 27-10-2010, ημέρα Τετ, και ώρα 00:05 +0200, ο/η Ángel González έγραψε:
Ariel T. Glenn wrote:
If one were clever (and I have some code that would enable one to be clever), one could seek to some point in the (bzip2-compressed) file and uncompress from there before processing. Running a bunch of jobs each decompressing only their small piece then becomes feasible. I don't have code that does this for gz or 7z; afaik these do not do compression in discrete blocks.
Ariel
The bzip2recover approach? I am not sure how much will be the gain after so much bit moving. Also, I was unable to continue from a flushed point, it may not be so easy. OTOH, if you already have an index and the blocks end at page boundaries (which is what I was doing), it becomes trivial. Remember that the previous block MUST continue up to the point where the next reader started processing inside the next block. And unlike what ttsiod said, you do encounter tags split between blocks in a normal compression.
I am able (using python bindings to the bzip2 library and some fiddling) to seek to an arbitrary point, find the first block after the seek point, and uncompress it and the following blocks in sequence. That is sufficient for our work, when we are talking about 250GB size compressed files.
We process everything by pages, so we ensure that any reader reads only specified page ranges from the file. This avoids overlaps.
We don't build an index; we're only talking about parallelizing 10-20 jobs at once, not all 21 million pages, so building an index would not be worth it.
Ariel
Develop a new bot framework (may be interwiki processing to start with) for high performance GPU cluster (nvidia or AMD) similar to what boinc based projects does. nvdia is more popular while AMD has more cores for the same price
:)
Regards, Jyothis.
http://ml.wikipedia.org/wiki/User:Jyothis http://meta.wikimedia.org/wiki/User:Jyothis I am the first customer of http://www.netdotnet.com
woods are lovely dark and deep, but i have promises to keep and miles to go before i sleep and lines to go before I press sleep
completion date = (start date + ((estimated effort x 3.1415926) / resources) + ((total coffee breaks x 0.25) / 24)) + Effort in meetings
On Sun, Oct 24, 2010 at 8:42 PM, Aryeh Gregor <Simetrical+wikilist@gmail.comSimetrical%2Bwikilist@gmail.com
wrote:
This term I'm taking a course in high-performance computing http://cs.nyu.edu/courses/fall10/G22.2945-001/index.html, and I have to pick a topic for a final project. According to the assignment http://cs.nyu.edu/courses/fall10/G22.2945-001/final-project.pdf, "The only real requirement is that it be something in parallel." In the class, we covered
- Microoptimization of single-threaded code (efficient use of CPU cache,
etc.)
- Multithreaded programming using OpenMP
- GPU programming using OpenCL
and will probably briefly cover distributed computing over multiple machines with MPI. I will have access to a high-performance cluster at NYU, including lots of CPU nodes and some high-end GPUs. Unlike most of the other people in the class, I don't have any interesting science projects I'm working on, so something useful to MediaWiki/Wikimedia/Wikipedia is my first thought. If anyone has any suggestions, please share. (If you have non-Wikimedia-related ones, I'd also be interested in hearing about them offlist.) They shouldn't be too ambitious, since I have to finish them in about a month, while doing work for three other courses and a bunch of other stuff.
My first thought was to write a GPU program to crack MediaWiki password hashes as quickly as possible, then use what we've studied in class about GPU architecture to design a hash function that would be as slow as possible to crack on a GPU relative to its PHP execution speed, as Tim suggested a while back. However, maybe there's something more interesting I could do.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 24/10/10 17:42, Aryeh Gregor wrote:
This term I'm taking a course in high-performance computing http://cs.nyu.edu/courses/fall10/G22.2945-001/index.html, and I have to pick a topic for a final project. According to the assignment http://cs.nyu.edu/courses/fall10/G22.2945-001/final-project.pdf, "The only real requirement is that it be something in parallel." In the class, we covered
- Microoptimization of single-threaded code (efficient use of CPU cache, etc.)
- Multithreaded programming using OpenMP
- GPU programming using OpenCL
I've occasionally wondered how hard it would be possible to parallelize a parser. It's generally not done, despite the fact that parsers are so slow and useful.
Some file formats can certainly be parsed in a parallel way, if you partition them in the right way. For example, if you were parsing a CSV file, you could partition on the line breaks. You can't do that by scanning the whole file O(N) since that would defeat the purpose, but you can seek ahead to a suitable byte position, and then scan forwards for the next line break to partition at.
For more complex file formats, there are various approaches. Googling tells me that this is a well-studied problem for XML.
Obviously for an assessable project, you don't want to dig yourself into a hole too big to get out of. If you chose XML you could just follow the previous work. JavaScript might be tractable. Attempting to parse wikitext would be insane.
-- Tim Starling
wikitech-l@lists.wikimedia.org