Fri, 04 Mar 2011 20:17:19 +0100, Platonides platonides@gmail.com wrote:
Seb35 wrote:
Krinkle wrote:
How much is "too much memory" ?
We needed to transform and crop TIFF images, read an XML associated with a book containing the OCRized text of the digitized book, and create a DjVu with the images and the text layer.
For that we rent a server, I cannot remember exactly the hardware we choosed, but it was probably a 4-core (or 8-core) with 4GB (or 8GB) of RAM and 200-300GB of disk (and a server bandwith, useful to download the files from the FTP of the BnF, about 500 files by book (1 XML/page + TIFF multipage + some others) x 1416 books = 2-3 days of download on the server because of many small files).
From what I remember, "Too much memory" means my laptop (2-core 2.8GHz, 3GB of RAM) on which I developed the (Python) program had difficulies to load the whole XML file (with DOM). Then I tried with SAX and the work was done in some seconds without a lot of memory (I didn't used SAX before, but I ♥ SAX now :-)
We wrote a technical report about that, but didn't published it for now (perhaps a day, I hope), you can see http://commons.wikimedia.org/wiki/Commons:Biblioth%E8que_nationale_de_France for an "outreach" document and https://fisheye.toolserver.org/browse/Seb35/BnF_import for the Python program.
Seb35
It is important to use the right tools. As you mention, such big xmls need to be processed on-the-fly, not by loading them in memory. You mention a server with 4 or 8 cores. Was your program multithreaded (or otherwise running several instances)? Are those single-threaded 24h?
Also, those instances happened once, and are quite different, so it's probably better to ask about the needed resources when you know what you are next needing. What you mention doesn't seem too much for the toolserver. You should be able to use enough disk space, and the task could be run in the background, so cpu wouldn't need to affect other users (specially given that there are not fixed time constraints). Memory could be a problem, though, depending on the amount used and for how long. SGE can probably show some memory usage graphs from which to deduce the amount available for these kind of projects.
Thanks for all these responses, we will ask the next time before renting a server for such a purpose.
We use multi-threads (easy with Python, 4 threads after the program on FishEye, so it was probably a 4-core server), but most of the time was used by disk accesses, so the equivalent single-threaded time should be about x2 or x2,5 our 24h-time.
Seb35