Occasionally, "toolserver people" (both programmers and users) talk
about joining up tools. Wouldn't it be great if we could use one or
several toolserver tools, and "mash-up" their output to create
something new and useful? And wouldn't it be even better if the users
could do this directly, across tools, without programmers
hard-connecting tools?
Some tools already support machine-readable output, and some tools
already use others to perform a specific function. But these are
hardcoded, out formats are often crude (tabbed text, not that there's
anything wrong with that in principle), runtimes add up, and so on.
So, I went ahead and, as a first step towards a pipeline setup called
wpipe ("w" for "wiki", as you no doubt have guessed), implemented an
asset tracker. Here, an asset is a dataset, ideally a JSON file (I
also created a simple structure to hold lists of wiki pages, along
with arbitrary metadata). Each asset is tracked in a database,
accessible via a unique numeric identifier, and its data is stored in
a file. Assets can be created and queried via toolserver command line,
as well as via a web interface.
The usual steps in asset creation involve:
1. reserve (gets a new, unique ID)
2. start (set a flag that the asset data creation has begun)
3. done (store data, either by creating a file directly, or passing
the data to be stored)
4. fail (if there was an error during data creation)
Some points:
* All assets and associated data are public.
* Creation and last-access time (write or read the actual data) are
tracked, so unused assets can be removed to conserve storage.
* Currently, data creation is limited to toolserver IP adresses (yes,
I know, it can be gamed. Like the rest of the wiki world.)
* The suggested JSON format should be flexible enough for most tools
dealing with lists of wiki pages, but any text-based format will work
for specialist uses.
* The asset system can be used by command-line tools and web tools alike.
* Existing tools should be simple to adapt; if a tool takes a list of
page names, and language/project information, then using asset IDs as
an alternative source should be straightforward.
* A pipeline of tools could be started asynchronously, and their
progress could be tracked via JavaScript; once a tool has finished,
the next one in the chain could be run, all from the user's browser.
The main web API, and documentation page, is here:
http://toolserver.org/~magnus/wpipe/
That page also links to a generic "asset browser" :
http://toolserver.org/~magnus/wpipe/asset_info.php
As an appetizer, and feasibility demo, I adapted my own "CatScan
rewrite" to use assets as an optional output. This can be done by
copying the normal "CatScan rewrite" URL parameters and pasting them
here:
http://toolserver.org/~magnus/wpipe/toolwrap.php
This is intended as a generic starter page for command-line-enabled
tools. At the moment, only "CatScan rewrite" is available, but I plan
to add others. If you have tools you would like to add, I'll be happy
to help you set that up.
Right now, a tool is started by this page via "nohup &"; that could
change to the job submission system, if that's possible from the web
servers, but right now it seems overly complicated (runtime
estimation? memory estimation? sql server access? whatnot)
The web page then returns the reserved output asset ID, while the
actual tool is running; another tool could thus be "watching"
asynchronously, by pulling the status every few seconds.
Of course, this whole shebang doesn't make sense unless others are
willing to join in, with work on this core, or at least by enabling
some of their tools; so please, if you are even slightly interested in
a generic data exchange mechanism between tools, potentially leading
to a pipeline-able ecosystem, by all means step forward!
Cheers,
Magnus