Wpipe

List overview All Threads
Download

newer

older

Update of SGE on Thursday 5. July

OpenStreetMap on the toolserver...

Magnus Manske

19 Jul 2012 19 Jul '12

10:24 a.m.

Occasionally, "toolserver people" (both programmers and users) talk about joining up tools. Wouldn't it be great if we could use one or several toolserver tools, and "mash-up" their output to create something new and useful? And wouldn't it be even better if the users could do this directly, across tools, without programmers hard-connecting tools?

Some tools already support machine-readable output, and some tools already use others to perform a specific function. But these are hardcoded, out formats are often crude (tabbed text, not that there's anything wrong with that in principle), runtimes add up, and so on.

So, I went ahead and, as a first step towards a pipeline setup called wpipe ("w" for "wiki", as you no doubt have guessed), implemented an asset tracker. Here, an asset is a dataset, ideally a JSON file (I also created a simple structure to hold lists of wiki pages, along with arbitrary metadata). Each asset is tracked in a database, accessible via a unique numeric identifier, and its data is stored in a file. Assets can be created and queried via toolserver command line, as well as via a web interface.

The usual steps in asset creation involve: 1. reserve (gets a new, unique ID) 2. start (set a flag that the asset data creation has begun) 3. done (store data, either by creating a file directly, or passing the data to be stored) 4. fail (if there was an error during data creation)

Some points: * All assets and associated data are public. * Creation and last-access time (write or read the actual data) are tracked, so unused assets can be removed to conserve storage. * Currently, data creation is limited to toolserver IP adresses (yes, I know, it can be gamed. Like the rest of the wiki world.) * The suggested JSON format should be flexible enough for most tools dealing with lists of wiki pages, but any text-based format will work for specialist uses. * The asset system can be used by command-line tools and web tools alike. * Existing tools should be simple to adapt; if a tool takes a list of page names, and language/project information, then using asset IDs as an alternative source should be straightforward. * A pipeline of tools could be started asynchronously, and their progress could be tracked via JavaScript; once a tool has finished, the next one in the chain could be run, all from the user's browser.

The main web API, and documentation page, is here:

http://toolserver.org/~magnus/wpipe/

That page also links to a generic "asset browser" : http://toolserver.org/~magnus/wpipe/asset_info.php

As an appetizer, and feasibility demo, I adapted my own "CatScan rewrite" to use assets as an optional output. This can be done by copying the normal "CatScan rewrite" URL parameters and pasting them here:

http://toolserver.org/~magnus/wpipe/toolwrap.php

This is intended as a generic starter page for command-line-enabled tools. At the moment, only "CatScan rewrite" is available, but I plan to add others. If you have tools you would like to add, I'll be happy to help you set that up. Right now, a tool is started by this page via "nohup &"; that could change to the job submission system, if that's possible from the web servers, but right now it seems overly complicated (runtime estimation? memory estimation? sql server access? whatnot) The web page then returns the reserved output asset ID, while the actual tool is running; another tool could thus be "watching" asynchronously, by pulling the status every few seconds.

Of course, this whole shebang doesn't make sense unless others are willing to join in, with work on this core, or at least by enabling some of their tools; so please, if you are even slightly interested in a generic data exchange mechanism between tools, potentially leading to a pipeline-able ecosystem, by all means step forward!

Cheers, Magnus

Show replies by date

Petr Onderka

19 Jul 19 Jul

10:44 a.m.

...

Currently, data creation is limited to toolserver IP adresses (yes,

I know, it can be gamed. Like the rest of the wiki world.)

This doesn't seem to be working. I was able to create asset 12 directly from my home computer.

Also, action=fail doesn't work: "Error : Unknown action fail" It seems it's not listed in the if in index.php starting at line 23.

Petr Onderka [[en:User:Svick]]

Magnus Manske

10:51 a.m.

On Thu, Jul 19, 2012 at 4:44 PM, Petr Onderka gsvick@gmail.com wrote:

...

...

Currently, data creation is limited to toolserver IP adresses (yes,

I know, it can be gamed. Like the rest of the wiki world.)

This doesn't seem to be working. I was able to create asset 12 directly from my home computer.

I'll look into that. Not a big issue, I just don't want to make it too easy for the script kiddies to store warez here...

...

Also, action=fail doesn't work: "Error : Unknown action fail" It seems it's not listed in the if in index.php starting at line 23.

Thanks, should be fixed now.

Cheers, Magnus

Platonides

2:06 p.m.

I'm not convinced about its utility. What tools would need combining? If I just need the results of a SQL query, it may be easier for me than using this system. Maybe a better interface would help.

The use case I see more interesting is for taking a tool which outputs a list of pages and provide for input of another tool. Some page/user to work with seems to be the most common input. Maybe we should just standarize the input parameters and let some tools chain to another, simply using a special format parameter.

For instance I usually use names like: art, lang, project such as: $_REQUEST += array('art'=>'', 'lang'=>'en', 'project'=>'wikipedia' );

Magnus Manske wrote:

...

Right now, a tool is started by this page via "nohup &"; that could change to the job submission system, if that's possible from the web servers, but right now it seems overly complicated (runtime estimation? memory estimation? sql server access? whatnot) The web page then returns the reserved output asset ID, while the actual tool is running; another tool could thus be "watching" asynchronously, by pulling the status every few seconds.

Yes, it can be called. I use it in a script for scheduling a cleanup of the created temporary files.

The relevant code:

...

$dt = new DateTime( "now", new DateTimeZone( "UTC" ) ); $tmpdir = dirname( __FILE__ ) . "/tmp"; @mkdir( $tmpdir, 0711 ); $shell = "mktemp -d --tmpdir=" . escapeshellarg($tmpdir) . " catdown.XXXXXXXX";

$tmpdir2 = trim( `$shell` ); // Program the folder destruction // Note that qsub is 'slow' to return, so we perform it in the background $dt->add( new DateInterval( "PT1H" ) ); exec( "SGE_ROOT=/sge/GE qsub -a " . $dt->format("YmdHi.s") . " -wd " . escapeshellarg( $tmpdir ) . " -j y -b y /bin/rm -r " . escapeshellarg( $tmpdir2 ) . " 2>&1 &" );

Magnus Manske

3:34 p.m.

On Thu, Jul 19, 2012 at 8:06 PM, Platonides platonides@gmail.com wrote:

...

I'm not convinced about its utility. What tools would need combining? If I just need the results of a SQL query, it may be easier for me than using this system. Maybe a better interface would help.

The use case I see more interesting is for taking a tool which outputs a list of pages and provide for input of another tool. Some page/user to work with seems to be the most common input. Maybe we should just standarize the input parameters and let some tools chain to another, simply using a special format parameter.

For instance I usually use names like: art, lang, project such as: $_REQUEST += array('art'=>'', 'lang'=>'en', 'project'=>'wikipedia' );

I believe we mean the same thing; maybe I didn't describe the asset thing very well.

It's not for a "single page run" of some tool; one reason I chose my CatScan rewrite as demo ist that it can run for a long time (two-digit number of minutes), and generate a vast list of results (tens of thousands of pages), depending on the query. The idea is that (a) you're not "blocking" while waiting for that to finish, before you can do something else; (b) you can access the results of the run again, maybe if the subsequent tool fails, or you want to try a different filter or subset, or a different subsequent tool altogether; (c) you can define new data sources, maybe a tool where you just paste in page titles, or another tool that gets the newest 1.000 articles, or 1.000 random articles, or the last 1.000 articles you edited, or /insert crazy idea here/, and all subsequent tools will just run with it.

And you can chain tools together via a single number; no file path that the other guy doesn't have access to, no sql query that runs for a few minutes every time (that is, /if/ your tool can be reduced to that...), no massive paste orgy, no loss of meta-data between tools.

I also envision longer chains: Give me all articles that are in both these two category trees; remove the ones that have images (except template symbol icons, if possible); remove the ones that have language links; remove the ones that had an edit less than a month old; render that as wikitext. There's a subject-specific "needs work" finder from simple components. UNIX philosophy at its finest :-)

...

Magnus Manske wrote:

...
Right now, a tool is started by this page via "nohup &"; that could change to the job submission system, if that's possible from the web servers, but right now it seems overly complicated (runtime estimation? memory estimation? sql server access? whatnot) The web page then returns the reserved output asset ID, while the actual tool is running; another tool could thus be "watching" asynchronously, by pulling the status every few seconds.

Yes, it can be called. I use it in a script for scheduling a cleanup of the created temporary files.

The relevant code:

...
$dt = new DateTime( "now", new DateTimeZone( "UTC" ) ); $tmpdir = dirname( __FILE__ ) . "/tmp"; @mkdir( $tmpdir, 0711 ); $shell = "mktemp -d --tmpdir=" . escapeshellarg($tmpdir) . " catdown.XXXXXXXX";

$tmpdir2 = trim( `$shell` ); // Program the folder destruction // Note that qsub is 'slow' to return, so we perform it in the background $dt->add( new DateInterval( "PT1H" ) ); exec( "SGE_ROOT=/sge/GE qsub -a " . $dt->format("YmdHi.s") . " -wd " . escapeshellarg( $tmpdir ) . " -j y -b y /bin/rm -r " . escapeshellarg( $tmpdir2 ) . " 2>&1 &" );

Thanks, that looks interesting. I'll play with it, thou I still face the problem of estimating resource requirements for a tool by a generic wrapper. /Shudder/

Cheers, Magnus

Dr. Trigon

22 Jul 22 Jul

4:51 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

To me as someone always struggling with sql queries (since I am not an expert at all) this sound somehow promising. May be there will be a comprehensive set of queries going to pipes one day. Then I would simply have to pick up the pipe and connect it further. (I like this flavour of UNIX philosophy ;)

Greetings DrTrigon

On 19.07.2012 22:34, Magnus Manske wrote:

...

On Thu, Jul 19, 2012 at 8:06 PM, Platonides platonides@gmail.com wrote:

...
I'm not convinced about its utility. What tools would need combining? If I just need the results of a SQL query, it may be easier for me than using this system. Maybe a better interface would help.

The use case I see more interesting is for taking a tool which outputs a list of pages and provide for input of another tool. Some page/user to work with seems to be the most common input. Maybe we should just standarize the input parameters and let some tools chain to another, simply using a special format parameter.

For instance I usually use names like: art, lang, project such as: $_REQUEST += array('art'=>'', 'lang'=>'en', 'project'=>'wikipedia' );

I believe we mean the same thing; maybe I didn't describe the asset thing very well.

It's not for a "single page run" of some tool; one reason I chose my CatScan rewrite as demo ist that it can run for a long time (two-digit number of minutes), and generate a vast list of results (tens of thousands of pages), depending on the query. The idea is that (a) you're not "blocking" while waiting for that to finish, before you can do something else; (b) you can access the results of the run again, maybe if the subsequent tool fails, or you want to try a different filter or subset, or a different subsequent tool altogether; (c) you can define new data sources, maybe a tool where you just paste in page titles, or another tool that gets the newest 1.000 articles, or 1.000 random articles, or the last 1.000 articles you edited, or /insert crazy idea here/, and all subsequent tools will just run with it.

And you can chain tools together via a single number; no file path that the other guy doesn't have access to, no sql query that runs for a few minutes every time (that is, /if/ your tool can be reduced to that...), no massive paste orgy, no loss of meta-data between tools.

I also envision longer chains: Give me all articles that are in both these two category trees; remove the ones that have images (except template symbol icons, if possible); remove the ones that have language links; remove the ones that had an edit less than a month old; render that as wikitext. There's a subject-specific "needs work" finder from simple components. UNIX philosophy at its finest :-)

...
Magnus Manske wrote:

...
Right now, a tool is started by this page via "nohup &"; that could change to the job submission system, if that's possible from the web servers, but right now it seems overly complicated (runtime estimation? memory estimation? sql server access? whatnot) The web page then returns the reserved output asset ID, while the actual tool is running; another tool could thus be "watching" asynchronously, by pulling the status every few seconds.

Yes, it can be called. I use it in a script for scheduling a cleanup of the created temporary files.

The relevant code:

...
$dt = new DateTime( "now", new DateTimeZone( "UTC" ) ); $tmpdir = dirname( __FILE__ ) . "/tmp"; @mkdir( $tmpdir, 0711 ); $shell = "mktemp -d --tmpdir=" . escapeshellarg($tmpdir) . " catdown.XXXXXXXX";

$tmpdir2 = trim( `$shell` ); // Program the folder destruction // Note that qsub is 'slow' to return, so we perform it in the background $dt->add( new DateInterval( "PT1H" ) ); exec( "SGE_ROOT=/sge/GE qsub -a " . $dt->format("YmdHi.s") . " -wd " . escapeshellarg( $tmpdir ) . " -j y -b y /bin/rm -r " . escapeshellarg( $tmpdir2 ) . " 2>&1 &" );

Thanks, that looks interesting. I'll play with it, thou I still face the problem of estimating resource requirements for a tool by a generic wrapper. /Shudder/

Cheers, Magnus

_______________________________________________ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAlALzSAACgkQAXWvBxzBrDD+TgCfWRF59s6oaGaRANJW+NscTix3 Jl8AoOIoaqBPwV/NWw4TeIZhqvj14/Qx =t1Fk -----END PGP SIGNATURE-----

Magnus Manske

8 a.m.

You can access the asset metadata in the "u_magnus_wpipe_p" database, table "asset", if you wish. The files are stored in /mnt/user-store/wpipe/

Cheers, Magnus

On Sun, Jul 22, 2012 at 10:51 AM, Dr. Trigon dr.trigon@surfeu.ch wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

To me as someone always struggling with sql queries (since I am not an expert at all) this sound somehow promising. May be there will be a comprehensive set of queries going to pipes one day. Then I would simply have to pick up the pipe and connect it further. (I like this flavour of UNIX philosophy ;)

Greetings DrTrigon

On 19.07.2012 22:34, Magnus Manske wrote:

...
On Thu, Jul 19, 2012 at 8:06 PM, Platonides platonides@gmail.com wrote:

...
I'm not convinced about its utility. What tools would need combining? If I just need the results of a SQL query, it may be easier for me than using this system. Maybe a better interface would help.

The use case I see more interesting is for taking a tool which outputs a list of pages and provide for input of another tool. Some page/user to work with seems to be the most common input. Maybe we should just standarize the input parameters and let some tools chain to another, simply using a special format parameter.

For instance I usually use names like: art, lang, project such as: $_REQUEST += array('art'=>'', 'lang'=>'en', 'project'=>'wikipedia' );

I believe we mean the same thing; maybe I didn't describe the asset thing very well.

It's not for a "single page run" of some tool; one reason I chose my CatScan rewrite as demo ist that it can run for a long time (two-digit number of minutes), and generate a vast list of results (tens of thousands of pages), depending on the query. The idea is that (a) you're not "blocking" while waiting for that to finish, before you can do something else; (b) you can access the results of the run again, maybe if the subsequent tool fails, or you want to try a different filter or subset, or a different subsequent tool altogether; (c) you can define new data sources, maybe a tool where you just paste in page titles, or another tool that gets the newest 1.000 articles, or 1.000 random articles, or the last 1.000 articles you edited, or /insert crazy idea here/, and all subsequent tools will just run with it.

And you can chain tools together via a single number; no file path that the other guy doesn't have access to, no sql query that runs for a few minutes every time (that is, /if/ your tool can be reduced to that...), no massive paste orgy, no loss of meta-data between tools.

I also envision longer chains: Give me all articles that are in both these two category trees; remove the ones that have images (except template symbol icons, if possible); remove the ones that have language links; remove the ones that had an edit less than a month old; render that as wikitext. There's a subject-specific "needs work" finder from simple components. UNIX philosophy at its finest :-)

...
Magnus Manske wrote:

...
Right now, a tool is started by this page via "nohup &"; that could change to the job submission system, if that's possible from the web servers, but right now it seems overly complicated (runtime estimation? memory estimation? sql server access? whatnot) The web page then returns the reserved output asset ID, while the actual tool is running; another tool could thus be "watching" asynchronously, by pulling the status every few seconds.

Yes, it can be called. I use it in a script for scheduling a cleanup of the created temporary files.

The relevant code:

...
$dt = new DateTime( "now", new DateTimeZone( "UTC" ) ); $tmpdir = dirname( __FILE__ ) . "/tmp"; @mkdir( $tmpdir, 0711 ); $shell = "mktemp -d --tmpdir=" . escapeshellarg($tmpdir) . " catdown.XXXXXXXX";

$tmpdir2 = trim( `$shell` ); // Program the folder destruction // Note that qsub is 'slow' to return, so we perform it in the background $dt->add( new DateInterval( "PT1H" ) ); exec( "SGE_ROOT=/sge/GE qsub -a " . $dt->format("YmdHi.s") . " -wd " . escapeshellarg( $tmpdir ) . " -j y -b y /bin/rm -r " . escapeshellarg( $tmpdir2 ) . " 2>&1 &" );

Thanks, that looks interesting. I'll play with it, thou I still face the problem of estimating resource requirements for a tool by a generic wrapper. /Shudder/

Cheers, Magnus

_______________________________________________ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAlALzSAACgkQAXWvBxzBrDD+TgCfWRF59s6oaGaRANJW+NscTix3 Jl8AoOIoaqBPwV/NWw4TeIZhqvj14/Qx =t1Fk -----END PGP SIGNATURE-----

Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

4379

Age (days ago)

4382

Last active (days ago)

toolserver-l@lists.wikimedia.org

6 comments

4 participants

tags (0)

participants (4)

Dr. Trigon
Magnus Manske
Petr Onderka
Platonides