Hi everybody!
I created the following doc: https://wikitech.wikimedia.org/wiki/Analytics/Tutorials/Analytics_Client_Nod...
It contains two FAQ: - How do I ensure that there is enough space on disk before storing big datasets/files ? - How do I check the space used by my files/data on stat/notebook hosts ?
Please read them and let me know if anything is not clear or missing. We have plenty of space on stat100X hosts, but we tend to cluster on single machines like stat1007 for some reason, ending up in fighting for resources.
On a related note, we are going to work on unifying stat/notebook puppet configs in https://phabricator.wikimedia.org/T243934, so eventually all Analytics clients will be exactly the same.
Thanks!
Luca (on behalf of the Analytics team)
Looks great Luca! Handy commands...
On Tue, Feb 18, 2020 at 8:53 AM Luca Toscano ltoscano@wikimedia.org wrote:
Hi everybody!
I created the following doc: https://wikitech.wikimedia.org/wiki/Analytics/Tutorials/Analytics_Client_Nod...
It contains two FAQ:
- How do I ensure that there is enough space on disk before storing big
datasets/files ?
- How do I check the space used by my files/data on stat/notebook hosts ?
Please read them and let me know if anything is not clear or missing. We have plenty of space on stat100X hosts, but we tend to cluster on single machines like stat1007 for some reason, ending up in fighting for resources.
On a related note, we are going to work on unifying stat/notebook puppet configs in https://phabricator.wikimedia.org/T243934, so eventually all Analytics clients will be exactly the same.
Thanks!
Luca (on behalf of the Analytics team)
Research-Internal mailing list Research-Internal@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/research-internal
Thanks for this Luca.
I tend to use stat1007 because I know that machine has a lot of ram/cpu and HDFS access. From other statsX I'm not sure which of them have what resources (I know at least one of them doesn't have HDFS access). There is a table where I can look at a summary of resources per machine?
Thanks again.
On Tue, Feb 18, 2020 at 8:53 AM Luca Toscano ltoscano@wikimedia.org wrote:
Hi everybody!
I created the following doc: https://wikitech.wikimedia.org/wiki/Analytics/Tutorials/Analytics_Client_Nod...
It contains two FAQ:
- How do I ensure that there is enough space on disk before storing big
datasets/files ?
- How do I check the space used by my files/data on stat/notebook hosts ?
Please read them and let me know if anything is not clear or missing. We have plenty of space on stat100X hosts, but we tend to cluster on single machines like stat1007 for some reason, ending up in fighting for resources.
On a related note, we are going to work on unifying stat/notebook puppet configs in https://phabricator.wikimedia.org/T243934, so eventually all Analytics clients will be exactly the same.
Thanks!
Luca (on behalf of the Analytics team)
Research-Internal mailing list Research-Internal@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/research-internal
Hey Diego,
added a section at the end of the page with the info requested, let me know if anything is missing :)
Luca
Il giorno mar 18 feb 2020 alle ore 17:37 Diego Saez-Trumper < diego@wikimedia.org> ha scritto:
Thanks for this Luca.
I tend to use stat1007 because I know that machine has a lot of ram/cpu and HDFS access. From other statsX I'm not sure which of them have what resources (I know at least one of them doesn't have HDFS access). There is a table where I can look at a summary of resources per machine?
Thanks again.
On Tue, Feb 18, 2020 at 8:53 AM Luca Toscano ltoscano@wikimedia.org wrote:
Hi everybody!
I created the following doc: https://wikitech.wikimedia.org/wiki/Analytics/Tutorials/Analytics_Client_Nod...
It contains two FAQ:
- How do I ensure that there is enough space on disk before storing big
datasets/files ?
- How do I check the space used by my files/data on stat/notebook hosts ?
Please read them and let me know if anything is not clear or missing. We have plenty of space on stat100X hosts, but we tend to cluster on single machines like stat1007 for some reason, ending up in fighting for resources.
On a related note, we are going to work on unifying stat/notebook puppet configs in https://phabricator.wikimedia.org/T243934, so eventually all Analytics clients will be exactly the same.
Thanks!
Luca (on behalf of the Analytics team)
Research-Internal mailing list Research-Internal@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/research-internal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I added a 'GPU?' column too. :) THANKS LUCA!
On Tue, Feb 18, 2020 at 11:51 AM Luca Toscano ltoscano@wikimedia.org wrote:
Hey Diego,
added a section at the end of the page with the info requested, let me know if anything is missing :)
Luca
Il giorno mar 18 feb 2020 alle ore 17:37 Diego Saez-Trumper < diego@wikimedia.org> ha scritto:
Thanks for this Luca.
I tend to use stat1007 because I know that machine has a lot of ram/cpu and HDFS access. From other statsX I'm not sure which of them have what resources (I know at least one of them doesn't have HDFS access). There is a table where I can look at a summary of resources per machine?
Thanks again.
On Tue, Feb 18, 2020 at 8:53 AM Luca Toscano ltoscano@wikimedia.org wrote:
Hi everybody!
I created the following doc: https://wikitech.wikimedia.org/wiki/Analytics/Tutorials/Analytics_Client_Nod...
It contains two FAQ:
- How do I ensure that there is enough space on disk before storing big
datasets/files ?
- How do I check the space used by my files/data on stat/notebook hosts ?
Please read them and let me know if anything is not clear or missing. We have plenty of space on stat100X hosts, but we tend to cluster on single machines like stat1007 for some reason, ending up in fighting for resources.
On a related note, we are going to work on unifying stat/notebook puppet configs in https://phabricator.wikimedia.org/T243934, so eventually all Analytics clients will be exactly the same.
Thanks!
Luca (on behalf of the Analytics team)
Research-Internal mailing list Research-Internal@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/research-internal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Research-Internal mailing list Research-Internal@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/research-internal
Thanks for pulling together these directions Luca! I did a little clean-up and will try to remember to do so more routinely.
Adding to what Diego said, I also started using stat1007 because it has the most access to resources (dumps, Hadoop, MariaDB), and then my virtual environments, config files, etc. are there and so I tend to do all of my work on stat1007 even when the other stat machines might work for other projects. Putting the GPU on stat1005 helped me diversify a little but I'm very excited to hear that the stat machines will be more standardized so it matters less which machine I choose. While I have no desire to be spread out across the machines (a few projects on stat1004, a few on stat1005, etc.) because then I'll certainly lose track of where different projects are, I would be open to trying to choose another host as my "main" workspace.
Best, Isaac
On Tue, Feb 18, 2020 at 10:53 AM Andrew Otto otto@wikimedia.org wrote:
I added a 'GPU?' column too. :) THANKS LUCA!
On Tue, Feb 18, 2020 at 11:51 AM Luca Toscano ltoscano@wikimedia.org wrote:
Hey Diego,
added a section at the end of the page with the info requested, let me know if anything is missing :)
Luca
Il giorno mar 18 feb 2020 alle ore 17:37 Diego Saez-Trumper < diego@wikimedia.org> ha scritto:
Thanks for this Luca.
I tend to use stat1007 because I know that machine has a lot of ram/cpu and HDFS access. From other statsX I'm not sure which of them have what resources (I know at least one of them doesn't have HDFS access). There is a table where I can look at a summary of resources per machine?
Thanks again.
On Tue, Feb 18, 2020 at 8:53 AM Luca Toscano ltoscano@wikimedia.org wrote:
Hi everybody!
I created the following doc: https://wikitech.wikimedia.org/wiki/Analytics/Tutorials/Analytics_Client_Nod...
It contains two FAQ:
- How do I ensure that there is enough space on disk before storing big
datasets/files ?
- How do I check the space used by my files/data on stat/notebook hosts
?
Please read them and let me know if anything is not clear or missing. We have plenty of space on stat100X hosts, but we tend to cluster on single machines like stat1007 for some reason, ending up in fighting for resources.
On a related note, we are going to work on unifying stat/notebook puppet configs in https://phabricator.wikimedia.org/T243934, so eventually all Analytics clients will be exactly the same.
Thanks!
Luca (on behalf of the Analytics team)
Research-Internal mailing list Research-Internal@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/research-internal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Research-Internal mailing list Research-Internal@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/research-internal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Thank you very much, Luca!
To make this nice documentation easier to discover, I moved it to Analytics/Systems/Clients https://wikitech.wikimedia.org/wiki/Analytics/Systems/Clients along with the other information on the clients from Analytics/Data access.
On Tue, 18 Feb 2020 at 17:11, Isaac Johnson isaac@wikimedia.org wrote:
Thanks for pulling together these directions Luca! I did a little clean-up and will try to remember to do so more routinely.
Adding to what Diego said, I also started using stat1007 because it has the most access to resources (dumps, Hadoop, MariaDB), and then my virtual environments, config files, etc. are there and so I tend to do all of my work on stat1007 even when the other stat machines might work for other projects. Putting the GPU on stat1005 helped me diversify a little but I'm very excited to hear that the stat machines will be more standardized so it matters less which machine I choose. While I have no desire to be spread out across the machines (a few projects on stat1004, a few on stat1005, etc.) because then I'll certainly lose track of where different projects are, I would be open to trying to choose another host as my "main" workspace.
Best, Isaac
On Tue, Feb 18, 2020 at 10:53 AM Andrew Otto otto@wikimedia.org wrote:
I added a 'GPU?' column too. :) THANKS LUCA!
On Tue, Feb 18, 2020 at 11:51 AM Luca Toscano ltoscano@wikimedia.org wrote:
Hey Diego,
added a section at the end of the page with the info requested, let me know if anything is missing :)
Luca
Il giorno mar 18 feb 2020 alle ore 17:37 Diego Saez-Trumper < diego@wikimedia.org> ha scritto:
Thanks for this Luca.
I tend to use stat1007 because I know that machine has a lot of ram/cpu and HDFS access. From other statsX I'm not sure which of them have what resources (I know at least one of them doesn't have HDFS access). There is a table where I can look at a summary of resources per machine?
Thanks again.
On Tue, Feb 18, 2020 at 8:53 AM Luca Toscano ltoscano@wikimedia.org wrote:
Hi everybody!
I created the following doc: https://wikitech.wikimedia.org/wiki/Analytics/Tutorials/Analytics_Client_Nod...
It contains two FAQ:
- How do I ensure that there is enough space on disk before storing
big datasets/files ?
- How do I check the space used by my files/data on stat/notebook
hosts ?
Please read them and let me know if anything is not clear or missing. We have plenty of space on stat100X hosts, but we tend to cluster on single machines like stat1007 for some reason, ending up in fighting for resources.
On a related note, we are going to work on unifying stat/notebook puppet configs in https://phabricator.wikimedia.org/T243934, so eventually all Analytics clients will be exactly the same.
Thanks!
Luca (on behalf of the Analytics team)
Research-Internal mailing list Research-Internal@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/research-internal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Research-Internal mailing list Research-Internal@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/research-internal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Isaac Johnson (he/him/his) -- Research Scientist -- Wikimedia Foundation
Great job Luca. Thank you very much.
I have started to diversify all WMDE Analytics jobs (mainly Wikidata related things) across the stat100* machines. While I still mainly use stat1007, two modules of the WDCM https://wikitech.wikimedia.org/wiki/Wikidata_Concepts_Monitor system are already migrated to stat1004.
Best, Goran
Goran S. Milovanović, PhD Data Scientist, Software Department Wikimedia Deutschland
------------------------------------------------ "It's not the size of the dog in the fight, it's the size of the fight in the dog." - Mark Twain ------------------------------------------------
On Wed, Feb 19, 2020 at 4:33 AM Neil Shah-Quinn nshahquinn@wikimedia.org wrote:
Thank you very much, Luca!
To make this nice documentation easier to discover, I moved it to Analytics/Systems/Clients https://wikitech.wikimedia.org/wiki/Analytics/Systems/Clients along with the other information on the clients from Analytics/Data access.
On Tue, 18 Feb 2020 at 17:11, Isaac Johnson isaac@wikimedia.org wrote:
Thanks for pulling together these directions Luca! I did a little clean-up and will try to remember to do so more routinely.
Adding to what Diego said, I also started using stat1007 because it has the most access to resources (dumps, Hadoop, MariaDB), and then my virtual environments, config files, etc. are there and so I tend to do all of my work on stat1007 even when the other stat machines might work for other projects. Putting the GPU on stat1005 helped me diversify a little but I'm very excited to hear that the stat machines will be more standardized so it matters less which machine I choose. While I have no desire to be spread out across the machines (a few projects on stat1004, a few on stat1005, etc.) because then I'll certainly lose track of where different projects are, I would be open to trying to choose another host as my "main" workspace.
Best, Isaac
On Tue, Feb 18, 2020 at 10:53 AM Andrew Otto otto@wikimedia.org wrote:
I added a 'GPU?' column too. :) THANKS LUCA!
On Tue, Feb 18, 2020 at 11:51 AM Luca Toscano ltoscano@wikimedia.org wrote:
Hey Diego,
added a section at the end of the page with the info requested, let me know if anything is missing :)
Luca
Il giorno mar 18 feb 2020 alle ore 17:37 Diego Saez-Trumper < diego@wikimedia.org> ha scritto:
Thanks for this Luca.
I tend to use stat1007 because I know that machine has a lot of ram/cpu and HDFS access. From other statsX I'm not sure which of them have what resources (I know at least one of them doesn't have HDFS access). There is a table where I can look at a summary of resources per machine?
Thanks again.
On Tue, Feb 18, 2020 at 8:53 AM Luca Toscano ltoscano@wikimedia.org wrote:
Hi everybody!
I created the following doc: https://wikitech.wikimedia.org/wiki/Analytics/Tutorials/Analytics_Client_Nod...
It contains two FAQ:
- How do I ensure that there is enough space on disk before storing
big datasets/files ?
- How do I check the space used by my files/data on stat/notebook
hosts ?
Please read them and let me know if anything is not clear or missing. We have plenty of space on stat100X hosts, but we tend to cluster on single machines like stat1007 for some reason, ending up in fighting for resources.
On a related note, we are going to work on unifying stat/notebook puppet configs in https://phabricator.wikimedia.org/T243934, so eventually all Analytics clients will be exactly the same.
Thanks!
Luca (on behalf of the Analytics team)
Research-Internal mailing list Research-Internal@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/research-internal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Research-Internal mailing list Research-Internal@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/research-internal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Isaac Johnson (he/him/his) -- Research Scientist -- Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
This is awesome! Thank you team!
On Tue, Feb 25, 2020 at 7:35 AM Goran Milovanovic < goran.milovanovic_ext@wikimedia.de> wrote:
Great job Luca. Thank you very much.
I have started to diversify all WMDE Analytics jobs (mainly Wikidata related things) across the stat100* machines. While I still mainly use stat1007, two modules of the WDCM https://wikitech.wikimedia.org/wiki/Wikidata_Concepts_Monitor system are already migrated to stat1004.
Best, Goran
Goran S. Milovanović, PhD Data Scientist, Software Department Wikimedia Deutschland
"It's not the size of the dog in the fight, it's the size of the fight in the dog."
- Mark Twain
On Wed, Feb 19, 2020 at 4:33 AM Neil Shah-Quinn nshahquinn@wikimedia.org wrote:
Thank you very much, Luca!
To make this nice documentation easier to discover, I moved it to Analytics/Systems/Clients https://wikitech.wikimedia.org/wiki/Analytics/Systems/Clients along with the other information on the clients from Analytics/Data access.
On Tue, 18 Feb 2020 at 17:11, Isaac Johnson isaac@wikimedia.org wrote:
Thanks for pulling together these directions Luca! I did a little clean-up and will try to remember to do so more routinely.
Adding to what Diego said, I also started using stat1007 because it has the most access to resources (dumps, Hadoop, MariaDB), and then my virtual environments, config files, etc. are there and so I tend to do all of my work on stat1007 even when the other stat machines might work for other projects. Putting the GPU on stat1005 helped me diversify a little but I'm very excited to hear that the stat machines will be more standardized so it matters less which machine I choose. While I have no desire to be spread out across the machines (a few projects on stat1004, a few on stat1005, etc.) because then I'll certainly lose track of where different projects are, I would be open to trying to choose another host as my "main" workspace.
Best, Isaac
On Tue, Feb 18, 2020 at 10:53 AM Andrew Otto otto@wikimedia.org wrote:
I added a 'GPU?' column too. :) THANKS LUCA!
On Tue, Feb 18, 2020 at 11:51 AM Luca Toscano ltoscano@wikimedia.org wrote:
Hey Diego,
added a section at the end of the page with the info requested, let me know if anything is missing :)
Luca
Il giorno mar 18 feb 2020 alle ore 17:37 Diego Saez-Trumper < diego@wikimedia.org> ha scritto:
Thanks for this Luca.
I tend to use stat1007 because I know that machine has a lot of ram/cpu and HDFS access. From other statsX I'm not sure which of them have what resources (I know at least one of them doesn't have HDFS access). There is a table where I can look at a summary of resources per machine?
Thanks again.
On Tue, Feb 18, 2020 at 8:53 AM Luca Toscano ltoscano@wikimedia.org wrote:
> Hi everybody! > > I created the following doc: > https://wikitech.wikimedia.org/wiki/Analytics/Tutorials/Analytics_Client_Nod... > > It contains two FAQ: > - How do I ensure that there is enough space on disk before storing > big datasets/files ? > - How do I check the space used by my files/data on stat/notebook > hosts ? > > Please read them and let me know if anything is not clear or > missing. We have plenty of space on stat100X hosts, but we tend to cluster > on single machines like stat1007 for some reason, ending up in fighting for > resources. > > On a related note, we are going to work on unifying stat/notebook > puppet configs in https://phabricator.wikimedia.org/T243934, so > eventually all Analytics clients will be exactly the same. > > Thanks! > > Luca (on behalf of the Analytics team) > > > _______________________________________________ > Research-Internal mailing list > Research-Internal@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/research-internal > _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Research-Internal mailing list Research-Internal@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/research-internal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Isaac Johnson (he/him/his) -- Research Scientist -- Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics