FYI
---------- Forwarded message ---------- From: Ariel Glenn WMF ariel@wikimedia.org Date: Mon, Sep 12, 2016 at 9:07 AM Subject: [Research-Internal] Fwd: Dumps Rewrite getting underway (help needed!) To: research-internal@lists.wikimedia.org
---------- Forwarded message ---------- From: Ariel Glenn WMF ariel@wikimedia.org Date: Mon, Sep 5, 2016 at 2:35 PM Subject: Dumps Rewrite getting underway (help needed!) To: Wikipedia Xmldatadumps-l Xmldatadumps-l@lists.wikimedia.org
Hello folks,
I know a number of you have subscribed to the Dumps Rewrite project ( https://phabricator.wikimedia.org/tag/dumps-rewrite/) but I bet none of you actually watch it or any of its tasks. So here's a heads up.
I'm getting started on work on the job scheduler/workflow manager piece; this would accept lists of dump tasks (in the current setup, "dump stubs for el wikipedia"), call a callback to turn each of them into small jobs that can be completed in less than an hour, submit and monitor these jobs with retries, dependencies etc, call a callback to recombine the outputs of the jobs, and notify some caller on success of te whole operation.
First up is evaluating existing packages and choosing one to use as a foundation. Please contribute! See the following tasks:
https://phabricator.wikimedia.org/T143205: Draft usage scenarios for job/workflow manager https://phabricator.wikimedia.org/T143205 https://phabricator.wikimedia.org/T143206: List requirements needed for task/job/workflow manager https://phabricator.wikimedia.org/T143206 https://phabricator.wikimedia.org/T143207: Evaluate software packages for job/task/workflow management https://phabricator.wikimedia.org/T143207
Also, can someone please forward this on to analytics-l and research-l? I'm not on those lists but they will no doubt have a lot of useful expertise here.
Thanks!
Ariel
_______________________________________________ Research-Internal mailing list Research-Internal@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/research-internal
Ariel:
I have updated couple tickets, I think much of this work can be avoided by using infrastructure we already have deployed and working at scale, namely hadoop and oozie (scheduling platform).
More so, we just wrapped up our first round of code to reconstruct the edit history from mediawiki database taking advantage of scala and parallel computations in spark/yarn. All this work can be applied to shortcut much of the work you have tasked on phabricator for dumps. Using the same methodology you can rebuild not only dumps going forward but probably all dumps existing to date.
Thanks,
Nuria
On Tue, Sep 13, 2016 at 5:00 AM, Leila Zia leila@wikimedia.org wrote:
FYI
---------- Forwarded message ---------- From: Ariel Glenn WMF ariel@wikimedia.org Date: Mon, Sep 12, 2016 at 9:07 AM Subject: [Research-Internal] Fwd: Dumps Rewrite getting underway (help needed!) To: research-internal@lists.wikimedia.org
---------- Forwarded message ---------- From: Ariel Glenn WMF ariel@wikimedia.org Date: Mon, Sep 5, 2016 at 2:35 PM Subject: Dumps Rewrite getting underway (help needed!) To: Wikipedia Xmldatadumps-l Xmldatadumps-l@lists.wikimedia.org
Hello folks,
I know a number of you have subscribed to the Dumps Rewrite project ( https://phabricator.wikimedia.org/tag/dumps-rewrite/) but I bet none of you actually watch it or any of its tasks. So here's a heads up.
I'm getting started on work on the job scheduler/workflow manager piece; this would accept lists of dump tasks (in the current setup, "dump stubs for el wikipedia"), call a callback to turn each of them into small jobs that can be completed in less than an hour, submit and monitor these jobs with retries, dependencies etc, call a callback to recombine the outputs of the jobs, and notify some caller on success of te whole operation.
First up is evaluating existing packages and choosing one to use as a foundation. Please contribute! See the following tasks:
https://phabricator.wikimedia.org/T143205: Draft usage scenarios for job/workflow manager https://phabricator.wikimedia.org/T143205 https://phabricator.wikimedia.org/T143206: List requirements needed for task/job/workflow manager https://phabricator.wikimedia.org/T143206 https://phabricator.wikimedia.org/T143207: Evaluate software packages for job/task/workflow management https://phabricator.wikimedia.org/T143207
Also, can someone please forward this on to analytics-l and research-l? I'm not on those lists but they will no doubt have a lot of useful expertise here.
Thanks!
Ariel
Research-Internal mailing list Research-Internal@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/research-internal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics