On 24/10/10 17:42, Aryeh Gregor wrote:
This term I'm taking a course in high-performance computing http://cs.nyu.edu/courses/fall10/G22.2945-001/index.html, and I have to pick a topic for a final project. According to the assignment http://cs.nyu.edu/courses/fall10/G22.2945-001/final-project.pdf, "The only real requirement is that it be something in parallel." In the class, we covered
- Microoptimization of single-threaded code (efficient use of CPU cache, etc.)
- Multithreaded programming using OpenMP
- GPU programming using OpenCL
I've occasionally wondered how hard it would be possible to parallelize a parser. It's generally not done, despite the fact that parsers are so slow and useful.
Some file formats can certainly be parsed in a parallel way, if you partition them in the right way. For example, if you were parsing a CSV file, you could partition on the line breaks. You can't do that by scanning the whole file O(N) since that would defeat the purpose, but you can seek ahead to a suitable byte position, and then scan forwards for the next line break to partition at.
For more complex file formats, there are various approaches. Googling tells me that this is a well-studied problem for XML.
Obviously for an assessable project, you don't want to dig yourself into a hole too big to get out of. If you chose XML you could just follow the previous work. JavaScript might be tractable. Attempting to parse wikitext would be insane.
-- Tim Starling