Hello,
I'm from the French chapter and we need sometimes a lot of CPU power and/or a lot of memory for some projects. For now it happened two times: * the partnership with the BnF where we need CPU and disk space (for image and DjVu processing, we rent a server dedicated to that); * the treatment of the videos of our GLAM meeting (split the videos and OGV conversion, done on a personal computer with a lot of time).
So is it possible to use the toolserver in these cases? And/or ask for a particular use when important ressources are required? I didn't seen anything related to the ressources in the rules.
Thanks, ~ Seb35 [^_^]
Seb35 wrote:
I'm from the French chapter and we need sometimes a lot of CPU power and/or a lot of memory for some projects. For now it happened two times:
- the partnership with the BnF where we need CPU and disk space (for image
and DjVu processing, we rent a server dedicated to that);
- the treatment of the videos of our GLAM meeting (split the videos and
OGV conversion, done on a personal computer with a lot of time).
So is it possible to use the toolserver in these cases? And/or ask for a particular use when important ressources are required? I didn't seen anything related to the ressources in the rules.
It's difficult to know what "a lot" of CPU power or memory is from your post. Toolserver accounts have account limits (https://wiki.toolserver.org/view/Account_limits), so if you're staying within those limits, there's generally no problem. If you want to exceed those limits, you should talk to the Toolserver roots first (https://wiki.toolserver.org/view/System_administrators). There are places like /mnt/user-store that can be used for large media storage as well.
As always, the Toolserver resources that you use need to relate to Wikimedia in some way, but it sounds like both of your projects do. :-)
MZMcBride
Ok, thank you, I didn't find this page.
For the BnF project we needed in fact about one day of computation (most of the time was used by the disk accesses), but we thought it would be more (we optimized too by using SAX instead of DOM to read big XML files, it used too much memory with DOM too). For the video encoding to OGV (it's not me who done that), it was 4-5 hours for a single video but some time was used to swap (and there are 100 videos corresponding to the conferences).
Thank you for the response. Seb35
Fri, 04 Mar 2011 03:45:53 +0100, MZMcBride z@mzmcbride.com wrote:
Seb35 wrote:
I'm from the French chapter and we need sometimes a lot of CPU power and/or a lot of memory for some projects. For now it happened two times:
- the partnership with the BnF where we need CPU and disk space (for
image and DjVu processing, we rent a server dedicated to that);
- the treatment of the videos of our GLAM meeting (split the videos and
OGV conversion, done on a personal computer with a lot of time).
So is it possible to use the toolserver in these cases? And/or ask for a particular use when important ressources are required? I didn't seen anything related to the ressources in the rules.
It's difficult to know what "a lot" of CPU power or memory is from your post. Toolserver accounts have account limits (https://wiki.toolserver.org/view/Account_limits), so if you're staying within those limits, there's generally no problem. If you want to exceed those limits, you should talk to the Toolserver roots first (https://wiki.toolserver.org/view/System_administrators). There are places like /mnt/user-store that can be used for large media storage as well.
As always, the Toolserver resources that you use need to relate to Wikimedia in some way, but it sounds like both of your projects do. :-)
MZMcBride
On March 4 2011, Seb35 wrote:
Fri, 04 Mar 2011 03:45:53 +0100, MZMcBride z@mzmcbride.com wrote:
Seb35 wrote:
I'm from the French chapter and we need sometimes a lot of CPU power and/or a lot of memory for some projects. For now it happened two times:
It's difficult to know what "a lot" of CPU power or memory is from your post. Toolserver accounts have account limits (https://wiki.toolserver.org/view/Account_limits), so if you're staying within those limits, there's generally no problem. If you want to exceed those limits, you should talk to the Toolserver roots first (https://wiki.toolserver.org/view/System_administrators). There are places like /mnt/user-store that can be used for large media storage as well.
As always, the Toolserver resources that you use need to relate to Wikimedia in some way, but it sounds like both of your projects do. :-)
MZMcBride
Ok, thank you, I didn't find this page.
For the BnF project we needed in fact about one day of computation (most of the time was used by the disk accesses), but we thought it would be more (we optimized too by using SAX instead of DOM to read big XML files, it used too much memory with DOM too). For the video encoding to OGV (it's not me who done that), it was 4-5 hours for a single video but some time was used to swap (and there are 100 videos corresponding to the conferences).
Thank you for the response. Seb35
Hi Seb35,
"One day" or "4-5 hours" still don't mean a lot in terms of technical requirements. One day of computing with what equipment ? With 24 hours of runtime a small difference can make a big difference. What kind of server server/setup did this run on ?
How much is "too much memory" ?
-- Krinkle
Fri, 04 Mar 2011 14:53:50 +0100, Krinkle krinklemail@gmail.com wrote:
On March 4 2011, Seb35 wrote:
Fri, 04 Mar 2011 03:45:53 +0100, MZMcBride z@mzmcbride.com wrote:
Seb35 wrote:
I'm from the French chapter and we need sometimes a lot of CPU power and/or a lot of memory for some projects. For now it happened two times:
It's difficult to know what "a lot" of CPU power or memory is from your post. Toolserver accounts have account limits (https://wiki.toolserver.org/view/Account_limits), so if you're staying within those limits, there's generally no problem. If you want to exceed those limits, you should talk to the Toolserver roots first (https://wiki.toolserver.org/view/System_administrators). There are places like /mnt/user-store that can be used for large media storage as well.
As always, the Toolserver resources that you use need to relate to Wikimedia in some way, but it sounds like both of your projects do. :-)
MZMcBride
Ok, thank you, I didn't find this page.
For the BnF project we needed in fact about one day of computation (most of the time was used by the disk accesses), but we thought it would be more (we optimized too by using SAX instead of DOM to read big XML files, it used too much memory with DOM too). For the video encoding to OGV (it's not me who done that), it was 4-5 hours for a single video but some time was used to swap (and there are 100 videos corresponding to the conferences).
Thank you for the response. Seb35
Hi Seb35,
"One day" or "4-5 hours" still don't mean a lot in terms of technical requirements. One day of computing with what equipment ? With 24 hours of runtime a small difference can make a big difference. What kind of server server/setup did this run on ?
How much is "too much memory" ?
We needed to transform and crop TIFF images, read an XML associated with a book containing the OCRized text of the digitized book, and create a DjVu with the images and the text layer.
For that we rent a server, I cannot remember exactly the hardware we choosed, but it was probably a 4-core (or 8-core) with 4GB (or 8GB) of RAM and 200-300GB of disk (and a server bandwith, useful to download the files from the FTP of the BnF, about 500 files by book (1 XML/page + TIFF multipage + some others) x 1416 books = 2-3 days of download on the server because of many small files).
From what I remember, "Too much memory" means my laptop (2-core 2.8GHz, 3GB of RAM) on which I developed the (Python) program had difficulies to load the whole XML file (with DOM). Then I tried with SAX and the work was done in some seconds without a lot of memory (I didn't used SAX before, but I ♥ SAX now :-)
We wrote a technical report about that, but didn't published it for now (perhaps a day, I hope), you can see http://commons.wikimedia.org/wiki/Commons:Bibliothèque_nationale_de_France for an "outreach" document and https://fisheye.toolserver.org/browse/Seb35/BnF_import for the Python program.
Seb35
Seb35 wrote:
Krinkle wrote:
How much is "too much memory" ?
We needed to transform and crop TIFF images, read an XML associated with a book containing the OCRized text of the digitized book, and create a DjVu with the images and the text layer.
For that we rent a server, I cannot remember exactly the hardware we choosed, but it was probably a 4-core (or 8-core) with 4GB (or 8GB) of RAM and 200-300GB of disk (and a server bandwith, useful to download the files from the FTP of the BnF, about 500 files by book (1 XML/page + TIFF multipage + some others) x 1416 books = 2-3 days of download on the server because of many small files).
From what I remember, "Too much memory" means my laptop (2-core 2.8GHz, 3GB of RAM) on which I developed the (Python) program had difficulies to load the whole XML file (with DOM). Then I tried with SAX and the work was done in some seconds without a lot of memory (I didn't used SAX before, but I ♥ SAX now :-)
We wrote a technical report about that, but didn't published it for now (perhaps a day, I hope), you can see http://commons.wikimedia.org/wiki/Commons:Bibliothèque_nationale_de_France for an "outreach" document and https://fisheye.toolserver.org/browse/Seb35/BnF_import for the Python program.
Seb35
It is important to use the right tools. As you mention, such big xmls need to be processed on-the-fly, not by loading them in memory. You mention a server with 4 or 8 cores. Was your program multithreaded (or otherwise running several instances)? Are those single-threaded 24h?
Also, those instances happened once, and are quite different, so it's probably better to ask about the needed resources when you know what you are next needing. What you mention doesn't seem too much for the toolserver. You should be able to use enough disk space, and the task could be run in the background, so cpu wouldn't need to affect other users (specially given that there are not fixed time constraints). Memory could be a problem, though, depending on the amount used and for how long. SGE can probably show some memory usage graphs from which to deduce the amount available for these kind of projects.
Fri, 04 Mar 2011 20:17:19 +0100, Platonides platonides@gmail.com wrote:
Seb35 wrote:
Krinkle wrote:
How much is "too much memory" ?
We needed to transform and crop TIFF images, read an XML associated with a book containing the OCRized text of the digitized book, and create a DjVu with the images and the text layer.
For that we rent a server, I cannot remember exactly the hardware we choosed, but it was probably a 4-core (or 8-core) with 4GB (or 8GB) of RAM and 200-300GB of disk (and a server bandwith, useful to download the files from the FTP of the BnF, about 500 files by book (1 XML/page + TIFF multipage + some others) x 1416 books = 2-3 days of download on the server because of many small files).
From what I remember, "Too much memory" means my laptop (2-core 2.8GHz, 3GB of RAM) on which I developed the (Python) program had difficulies to load the whole XML file (with DOM). Then I tried with SAX and the work was done in some seconds without a lot of memory (I didn't used SAX before, but I ♥ SAX now :-)
We wrote a technical report about that, but didn't published it for now (perhaps a day, I hope), you can see http://commons.wikimedia.org/wiki/Commons:Biblioth%E8que_nationale_de_France for an "outreach" document and https://fisheye.toolserver.org/browse/Seb35/BnF_import for the Python program.
Seb35
It is important to use the right tools. As you mention, such big xmls need to be processed on-the-fly, not by loading them in memory. You mention a server with 4 or 8 cores. Was your program multithreaded (or otherwise running several instances)? Are those single-threaded 24h?
Also, those instances happened once, and are quite different, so it's probably better to ask about the needed resources when you know what you are next needing. What you mention doesn't seem too much for the toolserver. You should be able to use enough disk space, and the task could be run in the background, so cpu wouldn't need to affect other users (specially given that there are not fixed time constraints). Memory could be a problem, though, depending on the amount used and for how long. SGE can probably show some memory usage graphs from which to deduce the amount available for these kind of projects.
Thanks for all these responses, we will ask the next time before renting a server for such a purpose.
We use multi-threads (easy with Python, 4 threads after the program on FishEye, so it was probably a 4-core server), but most of the time was used by disk accesses, so the equivalent single-threaded time should be about x2 or x2,5 our 24h-time.
Seb35
On Fri, Mar 4, 2011 at 2:37 PM, Seb35 seb35wikipedia@gmail.com wrote:
Thanks for all these responses, we will ask the next time before renting a server for such a purpose.
Account approval can sometimes take a while, often weeks. If you're thinking you'll likely use the toolserver in the future, you might want to apply for an account now. It doesn't sound like the resources you need would be any problem at all -- we might offer considerably better hardware than whoever you were renting from, too.
toolserver-l@lists.wikimedia.org