<<O>>  Difference Topic DAKS (r1.5 - 14 Sep 2006 - ShawnBowers)

META TOPICPARENT ParticipatingTeams

Data and Knowledge Systems

<!-- >div class="twikiToc"& -->
Line: 28 to 28

Kepler implementation of the Challenge Workflow. We implemented the Challenge workflow in Kepler as shown below. Actors labeled AlignWarp, ResliceWarp, SoftMean, Slicer, and Convert correspond to the five stages of the Challenge workflow. The actors labeled CollectionReader and CollectionWriter import data into the workflow and save the output/trace of the workflow, respectively. The actor ReplicationCollection creates two additional copies of the products of SoftMean so that downstream actors will execute three times, once for each desired slice of the average image.

Changed:
<
<
>
>

How the collection-oriented actors (coactors) work. Our collection-oriented workflow framework provides generic support for operating over nested collections (i.e., trees) of scientific data. Coactors differ from conventional Kepler actors (such as those used in the RWS solution to this Provenance Challenge) in that rather than operating on flat, homogeneous streams of tokens, coactors operate on trees of heterogenous data. A coactor is invoked whenever a subtree of the input stream matching certain criteria (e.g., the declared scope of the coactor) is received. During an invocation, the coactor may optionally add or delete nodes within the subtree upon which it was invoked. The figure below illustrates how the AlignWarp actor operates on an AnatomyImage collection, adding a WarpParamSet to this collection. All data received by AlignWarp outside of its scope passes through the coactor transparently.

Changed:
<
<
>
>

In Kepler, collections are serialized and streamed through coactors. Because actor execution is pipelined based on each actor’s scope, this approach enables concurrent processing of nested data collections as shown below. The figure illustrates how delimiter tokens (in blue and green) are used to bracket nested collections of associated data (in white), metadata (in red) and actor parameters (not shown).

Changed:
<
<
>
>

Input collections drive workflow execution. The collection-oriented implementation of the Challenge workflow may be configured to operate on different numbers of input anatomy images, not by modifying the workflow definition, but by customizing the input to the workflow. We tested our provenance system using two different input data sets represented by two XML files. The first input file, input1.xml, corresponds exactly to the Challenge workflow and contains four AnatomyImage collections within a single ImageCollection collection (see tree representation below). The second input file, input2.xml, contains three ImageCollections comprising four, three, and two AnatomyImage collections respectively. In other words, our implementation can operate on varying numbers of anatomy images within a single run of the workflow. Moreover, parameter values for particular actors also may be embedded within the workflow input to override default parameter values for particular sub-collections of data (note Parameter elements in the two input XML files).

Changed:
<
<
>
>

Provenance Trace

The results of each run of the Challenge workflow (including input data, intermediate and final data products, as well as provenance) were recorded in a trace file by the CollectionWriter actor and may be downloaded here: trace1.xml, trace2.xml. Trace files are implemented in XML using the same schema used for workflow input files read by CollectionReader. (An execution of a collection-oriented workflow may be thought of as a process of incrementally elaborating the input XML document.) The figure below shows the beginning of such a trace, highlighting how little additional information must be added to the trace file to record data lineage and invocation dependencies.

Changed:
<
<
>
>

As illustrated above, data and invocation dependencies are represented in the trace as special XML elements describing the provenance of other elements. Insertion and deletion elements record the actor, actor invocation count, and direct data dependencies associated with event that created or removed the element following it in the document. InvocationDependency elements record which invocations of preceding actors created data or modified collections used in the current actor invocation. Insertion, deletion, and invocation dependency information is passed through the workflow as special tokens during workflow execution. Coactors declare data dependencies explicitly during execution, whereas invocation dependencies are inferred and inserted into the token stream by the framework automatically. The figure below illustrates two data dependencies graphically.

Changed:
<
<
>
>

From collection-oriented execution traces we can construct data-lineage graphs. Vertices in a data-lineage graph represent input, output, and intermediate data and collection items. Edges denote item dependencies, which are further labeled with the actor invocations involved in the creation or modification of the item. In general, collection-oriented traces, and their corresponding data-lineage graphs, enable a wide range of queries over both process and data dependencies.

Line: 77 to 77

Each of the predicates above are implemented as primitive operations within the provenance query engine. The graph that results from running this query is shown below.

Changed:
<
<
>
>

The following query computes and draws the corresponding data-lineage graph for the second trace, described above. Since there are three separate image collections used in the execution (i.e., the three collections are pipelined through the workflow), the result consists of three independent Atlas X Graphic objects.

Line: 92 to 92

The graph that results from running this query is shown below. Note that in this example, only a subset of the input images are used to derive the corresponding output graphics.

Changed:
<
<
>
>

2. Find the process that led to Atlas X Graphic, excluding everything prior to the averaging of images with softmean. This query is similar to query 1 above, but we filter out edges of the data-lineage graph that correspond to invocations occuring prior to SoftMean computations. The following query computes and draws the corresponding data-lineage graph for the first trace.

Line: 107 to 107

The filtering step is performed using the filterBeforeActor operation provided by the query engine. The graph that results from running this query is shown below.

Changed:
<
<
>
>

3. Find the Stage 3, 4 and 5 details of the process that led to Atlas X Graphic. Note that the result of this query is identical to query 2. Here we show an alternative method for computing the result. Instead of filtering out data-lineage edges that correspond to invocations prior to SoftMean, we select those edges denoting invocations after ResliceWarp.

 <<O>>  Difference Topic DAKS (r1.4 - 13 Sep 2006 - LucMoreau)
Changed:
<
<
META TOPICPARENT WebHome
>
>
META TOPICPARENT ParticipatingTeams

Data and Knowledge Systems

<!-- >div class="twikiToc"& -->
 <<O>>  Difference Topic DAKS (r1.3 - 13 Sep 2006 - ShawnBowers)

META TOPICPARENT WebHome

Data and Knowledge Systems

<!-- >div class="twikiToc"& -->
Line: 7 to 7

  • Short team name: DAKS

Changed:
<
<
  • Participant names: Shawn Bowers, Tim McPhillips, Bertram Ludaescher in collaboration with Norbert Podhorszki and Ilkay Altintas.
>
>
  • Participant names: Shawn Bowers, Tim McPhillips, and Bertram Ludaescher, in collaboration with Norbert Podhorszki and Ilkay Altintas.

  • Project Overview: The Data and Knowledge Systems (DAKS) group at UC Davis is developing the Collection-Oriented Workflow paradigm and implementing this approach in the Kepler Workflow System. See McPhillips & Bowers (2005) and McPhillips et al (2006) listed below for more information.

Line: 32 to 32

How the collection-oriented actors (coactors) work. Our collection-oriented workflow framework provides generic support for operating over nested collections (i.e., trees) of scientific data. Coactors differ from conventional Kepler actors (such as those used in the RWS solution to this Provenance Challenge) in that rather than operating on flat, homogeneous streams of tokens, coactors operate on trees of heterogenous data. A coactor is invoked whenever a subtree of the input stream matching certain criteria (e.g., the declared scope of the coactor) is received. During an invocation, the coactor may optionally add or delete nodes within the subtree upon which it was invoked. The figure below illustrates how the AlignWarp actor operates on an AnatomyImage collection, adding a WarpParamSet to this collection. All data received by AlignWarp outside of its scope passes through the coactor transparently.

Changed:
<
<
>
>

In Kepler, collections are serialized and streamed through coactors. Because actor execution is pipelined based on each actor’s scope, this approach enables concurrent processing of nested data collections as shown below. The figure illustrates how delimiter tokens (in blue and green) are used to bracket nested collections of associated data (in white), metadata (in red) and actor parameters (not shown).

Changed:
<
<
Input collections drive workflow execution. The collection-oriented implementation of the Challenge workflow may be configured to operate on different numbers of input anatomy images, not by modifying the workflow definition, but by customizing the input to the workflow. We tested our provenance system using two different input data sets represented by two XML files. The first input file, input1.xml, corresponds exactly to the Challenge workflow and contains four AnatomyImage collections within a single ImageCollection collection (see tree representation below). The second input file, input2.xml, contains three ImageCollections comprising four, three, and two AnatomyImage collections respectively. In other words, our implementation can operate on varying numbers of anatomy images within a single run of the workflow. Moreover, parameter values for particular actors also may be embedded within the workflow input in order to override default parameter values for particular sub-collections of data (note Parameter elements in the two input XML files).
>
>
Input collections drive workflow execution. The collection-oriented implementation of the Challenge workflow may be configured to operate on different numbers of input anatomy images, not by modifying the workflow definition, but by customizing the input to the workflow. We tested our provenance system using two different input data sets represented by two XML files. The first input file, input1.xml, corresponds exactly to the Challenge workflow and contains four AnatomyImage collections within a single ImageCollection collection (see tree representation below). The second input file, input2.xml, contains three ImageCollections comprising four, three, and two AnatomyImage collections respectively. In other words, our implementation can operate on varying numbers of anatomy images within a single run of the workflow. Moreover, parameter values for particular actors also may be embedded within the workflow input to override default parameter values for particular sub-collections of data (note Parameter elements in the two input XML files).

Provenance Trace

Changed:
<
<
The results of each run of the Challenge workflow (including input data, intermediate and final data products, as well as provenance) were recorded in a trace file by the CollectionWriter actor and may be downloaded here: trace1.xml, trace2.xml. Traces files are implemented in XML using the same schema used for workflow input files read by CollectionReader. (An execution of a collection-oriented workflow may be thought of as a process of incrementally elaborating the input XML document.) The figure below shows the beginning of such a trace, highlighting how little additional information must be added to the trace file in order to record data lineage and invocation dependencies.
>
>
The results of each run of the Challenge workflow (including input data, intermediate and final data products, as well as provenance) were recorded in a trace file by the CollectionWriter actor and may be downloaded here: trace1.xml, trace2.xml. Trace files are implemented in XML using the same schema used for workflow input files read by CollectionReader. (An execution of a collection-oriented workflow may be thought of as a process of incrementally elaborating the input XML document.) The figure below shows the beginning of such a trace, highlighting how little additional information must be added to the trace file to record data lineage and invocation dependencies.

Changed:
<
<
As illustrated above, data and invocation dependencies are represented in the trace as special XML elements describing the provenance of other elements. Insertion elements record the actor, actor invocation count, and direct data dependencies associated with event that created the element following it in the document. InvocationDependency elements record which invocations of preceding actors created data or modified collections used in the current actor invocation. Insertion and invocation dependency information is passed through the workflow as special tokens during workflow execution. Coactors declare data dependencies explicitly during execution, whereas invocation dependencies are inferred and inserted into the token stream by the framework automatically. The figure below illustrates two data dependencies graphically.
>
>
As illustrated above, data and invocation dependencies are represented in the trace as special XML elements describing the provenance of other elements. Insertion and deletion elements record the actor, actor invocation count, and direct data dependencies associated with event that created or removed the element following it in the document. InvocationDependency elements record which invocations of preceding actors created data or modified collections used in the current actor invocation. Insertion, deletion, and invocation dependency information is passed through the workflow as special tokens during workflow execution. Coactors declare data dependencies explicitly during execution, whereas invocation dependencies are inferred and inserted into the token stream by the framework automatically. The figure below illustrates two data dependencies graphically.

Changed:
<
<
>
>

Added:
>
>
From collection-oriented execution traces we can construct data-lineage graphs. Vertices in a data-lineage graph represent input, output, and intermediate data and collection items. Edges denote item dependencies, which are further labeled with the actor invocations involved in the creation or modification of the item. In general, collection-oriented traces, and their corresponding data-lineage graphs, enable a wide range of queries over both process and data dependencies.

Provenance Queries

Added:
>
>
We have implemented a prototype system for querying collection-oriented execution traces. The system is written in Prolog and can manage and query multiple execution traces. The system provides a number of primitive operations for accessing and querying execution traces, some of which are demonstrated below.

Changed:
<
<
We have implemented a query engine prototype for collection-oriented traces in Prolog. The engine can accept multiple traces (traces can be "added" and "dropped" from the engine), and provides various operations for accessing traces and inferred provenance dependency graphs.
>
>
Core Provenance Queries

Added:
>
>
1. Find the process that led to Atlas X Graphic / everything that caused Atlas X Graphic to be as it is. This should tell us the new brain images from which the averaged atlas was generated, the warping performed, etc. To answer this query, we return the subset of edges of the data-lineage graph that correspond to paths beginning with the desired Atlas X Graphic data node. Given the Atlas X Graphic data node (AtlasXGraphic) and the trace (Trace), the following expression gives the corresponding edges of the data-lineage graph.

Changed:
<
<
Core Provenance Queries
>
>
   lineageEdges(Trace, [AtlasXGraphic], Edges)

Added:
>
>
The lineageEdges predicate, a primitive query operator provided by our system, computes the set of edges that define paths starting from each of the given set of nodes. The following query (1) obtains the first trace (with the trace id '1'), (2) obtains the desired Atlas X Graphic output node of the trace (with the node id of '341'), (3) computes the corresponding portion of the data-lineage graph, and (4) draws the resulting graph edges.

Changed:
<
<
1. Find the process that led to Atlas X Graphic / everything that caused Atlas X Graphic to be as it is. This should tell us the new brain images from which the averaged atlas was generated, the warping performed etc.
>
>
   ?- traceId('1', Trace), 
      nodeForId(Trace, '341', Node),
      lineageEdges(Trace, [Node], Edges),
      drawTraceEdges(Edges, 'pq1', gif).

Changed:
<
<
For this question, we return the portion of the provenance lineage graph that "ends" at the Atlas X Graphic node of the trace. Given the Atlas X Graphic node Node and the trace Trace, the following expression returns the corresponding portion of the graph:
>
>
Each of the predicates above are implemented as primitive operations within the provenance query engine. The graph that results from running this query is shown below.

The following query computes and draws the corresponding data-lineage graph for the second trace, described above. Since there are three separate image collections used in the execution (i.e., the three collections are pipelined through the workflow), the result consists of three independent Atlas X Graphic objects.



Changed:
<
<
lineage_edges(Trace, [Node], Edges)
>
>
?- traceId('2', Trace), nodeForId(Trace, '973', Node1), nodeForId(Trace, '1093', Node2), nodeForId(Trace, '1193', Node3), lineageEdges(Trace, [Node3, Node2, Node1], Edges), drawTraceEdges(Edges, 'pq1_trace2', gif).

Changed:
<
<
where lineage_edges is a provided operation by the provenance query engine that computes the set of edges from the given set of nodes.
>
>
The graph that results from running this query is shown below. Note that in this example, only a subset of the input images are used to derive the corresponding output graphics.


Changed:
<
<
The following selects the first trace and the Atlas X Graphic node of the trace, computes the corresponding provenance graph, and draws the edges of the graph.
>
>
2. Find the process that led to Atlas X Graphic, excluding everything prior to the averaging of images with softmean. This query is similar to query 1 above, but we filter out edges of the data-lineage graph that correspond to invocations occuring prior to SoftMean computations. The following query computes and draws the corresponding data-lineage graph for the first trace.


Changed:
<
<
q1 :- trace_id('1', Trace), node_for_id(Trace, '341', Node), lineage_edges(Trace, [Node], Edges), draw_trace_edges(Edges, 'pq1', gif).
>
>
?- traceId('1', Trace), nodeForId(Trace, '341', Node), lineageEdges(Trace, [Node], Edges), filterBeforeActor(Trace, Edges, 'SoftMean', FilteredEdges?), drawTraceEdges(FilteredEdges?, 'pq2', gif).

Changed:
<
<
The resulting graph is shown below.
>
>
The filtering step is performed using the filterBeforeActor operation provided by the query engine. The graph that results from running this query is shown below.

Changed:
<
<
>
>

Changed:
<
<
The second trace we experimented with consists of three atlas graphic sets. The following query computes and draws the corresponding provenance graph for the second trace.
>
>
3. Find the Stage 3, 4 and 5 details of the process that led to Atlas X Graphic. Note that the result of this query is identical to query 2. Here we show an alternative method for computing the result. Instead of filtering out data-lineage edges that correspond to invocations prior to SoftMean, we select those edges denoting invocations after ResliceWarp.


Changed:
<
<
pq1_trace2 :- trace_id('2', Trace), node_for_id(Trace, '973', Node1), node_for_id(Trace, '1093', Node2), node_for_id(Trace, '1193', Node3), lineage_edges(Trace, [Node3, Node2, Node1], Edges), draw_trace_edges(Edges, 'pq1_trace2', gif).
>
>
?- traceId('1', Trace), nodeForId(Trace, '341', Node), lineageEdges(Trace, [Node], Edges), selectAfterActor(Trace, Edges, 'ResliceWarp', FilteredEdges?), drawTraceEdges(FilteredEdges?, 'pq3', gif).

Changed:
<
<
The resulting graph is shown below.
>
>
The selection step is performed using the selectAfterActor operation provided by the query engine. The graph that results from running this query is identical to the one for query 2.

Changed:
<
<
>
>
4. Find all invocations of procedure align_warp using a twelfth order nonlinear 1365 parameter model (see model menu describing possible values of parameter "-m 12" of align_warp) that ran on a Monday. The following query returns the set of invocations of AlignWarp having the given parameter.

Changed:
<
<

>
>
   ?- traceId(TraceId, Trace), 
      traceInvocParam(Trace, 'warpParameters', '-m 12', 'AlignWarp', Invoc).

Changed:
<
<
2. Find the process that led to Atlas X Graphic, excluding everything prior to the averaging of images with softmean.
>
>
The query uses the traceInvocParam primitive operation. This operation uses embedded parameter tokens in the trace (i.e., input stream) to reconstruct the parameters applied to particular actor invocations. Note that in our prototype we do not currently assign metadata to traces, however, such metadata would be simple to add. The result of running this query on our two traces is:

Changed:
<
<
For this question, we again return a portion of the provenance lineage graph. Here, we filter out the edges that occur prior to the SoftMean computation using the filter_before_actor operation.
>
>
   TRACE = 1   ACTOR = AlignWarp   INVOCATION = 1
   TRACE = 1   ACTOR = AlignWarp   INVOCATION = 2
   TRACE = 1   ACTOR = AlignWarp   INVOCATION = 3
   TRACE = 1   ACTOR = AlignWarp   INVOCATION = 4
   TRACE = 2   ACTOR = AlignWarp   INVOCATION = 5
   TRACE = 2   ACTOR = AlignWarp   INVOCATION = 6
   TRACE = 2   ACTOR = AlignWarp   INVOCATION = 7
   TRACE = 2   ACTOR = AlignWarp   INVOCATION = 8

Note that only two of the image collections of the second trace use the given parameter. In addition, the parameter is used for only two of the anatomy images in one of the input image collections.

5. Find all Atlas Graphic images output from workflows where at least one of the input Anatomy Headers had an entry global maximum=4095. The contents of a header file can be extracted as text using the scanheader AIR utility. The following query (1) selects input nodes of the trace, (2) checks that the input node is of type ImageHeader, (3) checks that the header meets the given criteria, and (4) obtains the output nodes of type AtlasGraphic of the trace.



Changed:
<
<
pq2 :- trace_id('1', Trace), node_for_id(Trace, '341', Node), lineage_edges(Trace, [Node], Edges), filter_before_actor(Trace, Edges, 'SoftMean', FilteredEdges?), draw_trace_edges(FilteredEdges?, 'pq2', gif).
>
>
?- traceId(TraceId?, Trace), traceInputNode(Trace, X), nodeType(X, 'ImageHeader'), headerQuery(X), traceOutputNode(Trace, AtlasGraphic?), nodeType(AtlasGraphic?, 'AtlasGraphic').

Changed:
<
<
The result of this query is shown below.
>
>
The traceInputNode, nodeType, and traceOutputNode predicates are primitives of the query engine. Here, we assume that headerQuery is a user-supplied predicate that applies the global maximum check. Note that we did not add the capability of calling external applications to our current prototype. We envision the ability to call such external functions as part of a broader data management facility (e.g., within Kepler), as opposed to a provenance task. In our prototype, we wrote headerQuery to succeed for one header from each trace. The result of running this query on both traces is:

Changed:
<
<
>
>
   TRACE = 1   TYPE = AtlasGraphic   TOKEN = 341   OBJECT = 68
   TRACE = 1   TYPE = AtlasGraphic   TOKEN = 349   OBJECT = 70
   TRACE = 1   TYPE = AtlasGraphic   TOKEN = 357   OBJECT = 72
   TRACE = 2   TYPE = AtlasGraphic   TOKEN = 1093   OBJECT = 225
   TRACE = 2   TYPE = AtlasGraphic   TOKEN = 1101   OBJECT = 227
   TRACE = 2   TYPE = AtlasGraphic   TOKEN = 1109   OBJECT = 229
   TRACE = 2   TYPE = AtlasGraphic   TOKEN = 1193   OBJECT = 242
   TRACE = 2   TYPE = AtlasGraphic   TOKEN = 1202   OBJECT = 244
   TRACE = 2   TYPE = AtlasGraphic   TOKEN = 1210   OBJECT = 246
   TRACE = 2   TYPE = AtlasGraphic   TOKEN = 973   OBJECT = 199
   TRACE = 2   TYPE = AtlasGraphic   TOKEN = 981   OBJECT = 201
   TRACE = 2   TYPE = AtlasGraphic   TOKEN = 989   OBJECT = 203

Changed:
<
<

>
>
We note that the particular wording of this query assumes that all output graphics depend on all input images and headers. For our second trace, one can easily verify that this assumption is incorrect. Alternatively, it is possible to rewrite this query using primitive operations of the query engine so that only proper derivations are returned.

Changed:
<
<
3. Find the Stage 3, 4 and 5 details of the process that led to Atlas X Graphic.
>
>
6. Find all output averaged images of softmean (average) procedures, where the warped images taken as input were align_warped using a twelfth order nonlinear 1365 parameter model, i.e. "where softmean was preceded in the workflow, directly or indirectly, by an align_warp procedure with argument -m 12." The following query (1) obtains output averaged images of SoftMean invocations, (2) obtains the set of lineage edges leading from the averaged images, and (3) ensures that at least one edge corresponds to an AlignWarp invocation with the appropriate parameter model.

Changed:
<
<
Note that this question is identical to question 2. However, here we provide a different method for computing the result. In particular, instead of filtering out edges prior to SoftMean, we select only edges that occur after to the ResliceWarp computation using the select_after_actor operation.
>
>
   ?- traceId(TraceId, Trace), 
      actorInvocation(Trace, 'SoftMean', _, _, AveragedImage), 
      nodeType(AveragedImage, 'Image'),
      lineageEdges(Trace, [AveragedImage], Edges), 
      member((_, _, 'AlignWarp', Invoc), Edges), 
      traceInvocParam(Trace, 'warpParameters', '-m 12', 'AlignWarp', Invoc).

Added:
>
>
This query uses the actorInvocation operation, which returns input nodes, output nodes, and invocation counts for a given actor within a trace. The result of running this query over both traces is:


Changed:
<
<
pq3 :- trace_id('1', Trace), node_for_id(Trace, '341', Node), % AtlasXGraphic? lineage_edges(Trace, [Node], Edges), select_after_actor(Trace, Edges, 'ResliceWarp', FilteredEdges?), draw_trace_edges(FilteredEdges?, 'pq3', gif).
>
>
TRACE = 1 TYPE = Image TOKEN = 311 OBJECT = 65 TRACE = 2 TYPE = Image TOKEN = 1065 OBJECT = 222 TRACE = 2 TYPE = Image TOKEN = 1165 OBJECT = 239

Deleted:
<
<
The result of the query is identical to query 2 above.

Changed:
<
<

>
>
7. A user has run the workflow twice, in the second instance replacing each procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs. The exact level of detail in the difference that is detected by a system is up to each participant. We left this query open due to the ambiguity of what a proper result should be. It is not clear, at least for this example, what type of result would be useful for a user (i.e., a scientist). For example, to guage the difference between workflow executions, one may simply want to perform a "diff" on run outputs (which is partially supported in our current prototype for underlying data objects). Alternatively, we can imagine that users may want to compare particular data derivation paths across the runs (which is again possible at the object-level within our system), or compare different actor invocation patterns.

Added:
>
>
8. A user has annotated some anatomy images with a key-value pair center=UChicago. Find the outputs of align_warp where the inputs are annotated with center=UChicago. The following query leverages collection-oriented metadata to select the anatomy images with the given key-value pair. The query (1) obtains invocations of AlignWarp, (2) obtains input image nodes and corresponding output nodes of AlignWarp invocations, and (3) checks that the given input image has the appropriate metadata.

Changed:
<
<
For each query, if your system can support your query, provide a description of how you implement the query, what result is returned; otherwise, explain whether the query is in the remit of your system.
>
>
   ?- traceId(TraceId, Trace), 
      traceInvoc(Trace, 'AlignWarp', Invoc),
      actorInvocation(Trace, 'AlignWarp', Invoc, InputNode, OutputNode), 
      nodeType(InputNode, 'Image'), 
      nodeMetadata(Trace, 'center', 'UChicago', InputNode).

Changed:
<
<
Also, make sure you complete the ProvenanceQueriesMatrix.
>
>
This query uses the actorMetadata primitive operation to check that the given node has the correct key-value pair metadata. The result of running this query on the two traces is:

Added:
>
>
   TRACE = 1   ACTOR = AlignWarp   INVOCATION = 1   TYPE = WarpParamSet   TOKEN = 245   OBJECT = 53
   TRACE = 2   ACTOR = AlignWarp   INVOCATION = 1   TYPE = WarpParamSet   TOKEN = 851   OBJECT = 176

Changed:
<
<

Suggested Workflow Variants

>
>
9. A user has annotated some atlas graphics with key-value pair where the key is studyModality. Find all the graphical atlas sets that have metadata annotation studyModality with values speech, visual or audio, and return all other annotations to these files. For this query, we assume that the user-annotation is given as collection-oriented metadata within the trace (i.e., the metadata is available in the trace as opposed to only being available via a data management subsystem). We also assume that a Graphic Atlas "set" consists of all Atlas Graphics derived from an invocation of Softmean. Thus, the Atlas X, Y, and Z Graphics generated from Softmean correspond to a single set, and there are three such sets generated by our second example trace.

The following query computes the sets of Atlas Graphics, where at least one Atlas Graphic has the desired metadata annotation.

   ?- traceId(TraceId, Trace), 
      traceInvoc(Trace, 'SoftMean', Invoc),
      graphicAtlasSet(Trace, Invoc, GraphicSet). 

The graphicAtlasSet is defined specifically for this query as follows:

   graphicAtlasSet(Trace, Invoc, GraphicSet) :-
      setof(G, graphicAtlas(Trace, G, Invoc), GraphicSet),
      member(Graphic, GraphicSet), 
      member(Modality, ['speech', 'visual', 'audio']), 
      nodeMetadata(Trace, 'studyModality', Modality, Graphic).

   graphicAtlas(Trace, AtlasGraphic, SoftMeanInvoc) :-
      traceOutputNode(Trace, AtlasGraphic), 
      nodeType(AtlasGraphic, 'AtlasGraphic'), 
      lineageEdges(Trace, [AtlasGraphic], Edges), 
      member((_, _, 'SoftMean', SoftMeanInvoc), Edges).

The complexity of this query is due to the generation of sets of Atlas Graphics, which is similar to peforming a group-by operation in SQL, and then further filtering groups by corresponding metadata values. Note that above we do not further return the given metadata values of the graphics (although one easily could). The result of running the above query on the two traces is:

   TRACE = 1   TOKEN SET = {341, 349, 357}
   TRACE = 2   TOKEN SET = {1093, 1101, 1109}
   TRACE = 2   TOKEN SET = {1193, 1202, 1210}

Changed:
<
<
Suggest variants of the workflow that can exhibit capabilities that your system support.
>
>
Note that as shown above, only two of the input image collections for the second trace have matching Atlas Graphic sets.

Suggested Queries

Changed:
<
<
Suggest significant queries that your system can support and are not in the proposed list of queries, and how you have implemented/would implement them. These queries may be with regards to a variant of the workflow suggested above.
>
>
One of the benefits of our approach (and similarly with the RWS approach) is its ability to support various data-lineage queries (as opposed to process-oriented queries). The following two examples demonstrate more "scientist-oriented" queries over data lineage.

Added:
>
>
10. Find all of the intermediate (not input or output) Images used to derive the Atlas X Graphic. (A variant is to find the "closest" Image on the derivation path from the given output.) The following query (1) obtains one of the Atlas X Graphic outputs for the second trace, (2) obtains the lineage edges from the Atlas X Graphic, (3) selects an image that was used to derive the Atlas X Graphic, and (3) checks that the image was not an input to the workflow.

   ?- traceId('2', Trace), 
      nodeForId(Trace, '1093', Node), 
      lineageEdges(Trace, [Node], Edges), 
      member((_, DepNode, _, _), Edges), 
      nodeType(DepNode, 'Image'), 
      setof(N, traceInputNode(Trace, N), InputNodes), 
      \+ member(DepNode, InputNodes).

The result of running this query is:

   TYPE = Image   TOKEN = 1001   OBJECT = 205
   TYPE = Image   TOKEN = 1018   OBJECT = 208
   TYPE = Image   TOKEN = 1065   OBJECT = 222

Deleted:
<
<

Categorisation of queries


Changed:
<
<
According to your provenance approach, you may be able to provide a categorisation of queries. Can you elaborate on the categorisation and its rationale.
>
>
11. Find all of the input Images used to derive the Atlas X Graphic. Note that this query is of particular importance for the second trace, where not all input images were used to derive output graphics. The following query (1) obtains an Atlas X Graphic for the second trace, (2) obtains the lineage edges from the Atlas X Graphic, (3) selects an Image that was used to derive the Atlas X Graphic, and (3) checks that the Image was an input to the workflow.

Added:
>
>
   ?- traceId('2', Trace), 
      nodeForId(Trace, '1093', Node), 
      lineageEdges(Trace, [Node], Edges), 
      member((_, DepNode, _, _), Edges), 
      nodeType(DepNode, 'Image'), 
      setof(D, traceInputNode(Trace, D), InputNodes), 
      member(DepNode, InputNodes).

Changed:
<
<

Conclusions

>
>
The result of running this query is:

Changed:
<
<
Provide here your conclusions on the challenge, and issues that you like to see discussed at a face to face meeting.
>
>
   TYPE = Image   TOKEN = 927   OBJECT = 190
   TYPE = Image   TOKEN = 932   OBJECT = 192
   TYPE = Image   TOKEN = 937   OBJECT = 194
   TYPE = Image   TOKEN = 942   OBJECT = 196

We note that a number of similar types of data-lineage queries are defined in (Bowers et al, 2006).

Suggested Workflow Variants


Changed:
<
<
-- LucMoreau - 31 May 2006
>
>
Our approach can support a variety of workflow constructs, including pipelining and partial data dependencies (e.g., as illustrated in the second example trace), as well as concurrent actor execution and cyclic workflow graphs (looping and iteration). We have found that workflows in bioinformatics typically exhibit some or all these features, which we would like to see in future scientific-workflow provenance challenges.

 <<O>>  Difference Topic DAKS (r1.2 - 13 Sep 2006 - TimothyMcPhillips)

META TOPICPARENT WebHome
Changed:
<
<

Provenance Challenge Template

>
>

Data and Knowledge Systems


<!-- >div class="twikiToc"& -->

Participating Team

Changed:
<
<
  • Short team name: DAKS

  • Participant names: Shawn Bowers, Tim McPhillips, Bertram Ludaescher in collaboration with Norbert Podhorszki and Ilkay Altintas.
>
>
  • Short team name: DAKS

Changed:
<
<
  • Project URL:
>
>
  • Participant names: Shawn Bowers, Tim McPhillips, Bertram Ludaescher in collaboration with Norbert Podhorszki and Ilkay Altintas.

Changed:
<
<
  • Project Overview: The Data and Knowledge Systems (DAKS) group at UC Davis is developing the Collection-Oriented Workflow paradigm and implementing this approach in the Kepler Workflow System. Our framework provides support for workflows that operate over nested collections (trees) of data. Actors in a collection-oriented workflow receive input trees
>
>
  • Project Overview: The Data and Knowledge Systems (DAKS) group at UC Davis is developing the Collection-Oriented Workflow paradigm and implementing this approach in the Kepler Workflow System. See McPhillips & Bowers (2005) and McPhillips et al (2006) listed below for more information.

Changed:
<
<
See publications (McPhillips & Bowers, 2005) and (McPhillips et al, 2006) listed below for more information.
>
>
  • Provenance-specific Overview: Among other benefits, collection-oriented workflows enable comprehensive data and process lineage information to be recorded and passed through the workflow along with data. We demonstrate this capability in this challenge. Our approach is an adaptation of the RWS provenance model described in Bowers et al (2006) listed below. Our approach takes advantage of the collection-oriented workflow framework to:
    • Automatically infer state-reset events based on the declared scope of actors.
    • Minimize the number of provenance-relevant events that must be recorded.
    • Simplify association of workflow runs with data provenance by storing workflow inputs, outputs, and dependency information in a single, self-contained trace file.
    • Support science-oriented provenance queries, emphasizing data dependencies (lineage) as well as process details.
    • Decouple provenance representation from particular scientific workflow technologies (e.g., Kepler).

Changed:
<
<
  • Provenance-specific Overview: Among other benefits, collection-oriented workflows enable comprehensive data and process lineage information to be recorded and passed through the workflow along with data. We demonstrate this capability in this challenge. Our approach is an adaptation of the RWS provenance model described in the publication (Bowers et al, 2006) listed below.

  • Relevant Publications:
>
>
  • Relevant Publications:

Changed:
<
<
>
>

Workflow Representation

Changed:
<
<
Provide here a description of how you have encoded the Challenge workflow.
>
>
Kepler implementation of the Challenge Workflow. We implemented the Challenge workflow in Kepler as shown below. Actors labeled AlignWarp, ResliceWarp, SoftMean, Slicer, and Convert correspond to the five stages of the Challenge workflow. The actors labeled CollectionReader and CollectionWriter import data into the workflow and save the output/trace of the workflow, respectively. The actor ReplicationCollection creates two additional copies of the products of SoftMean so that downstream actors will execute three times, once for each desired slice of the average image.

How the collection-oriented actors (coactors) work. Our collection-oriented workflow framework provides generic support for operating over nested collections (i.e., trees) of scientific data. Coactors differ from conventional Kepler actors (such as those used in the RWS solution to this Provenance Challenge) in that rather than operating on flat, homogeneous streams of tokens, coactors operate on trees of heterogenous data. A coactor is invoked whenever a subtree of the input stream matching certain criteria (e.g., the declared scope of the coactor) is received. During an invocation, the coactor may optionally add or delete nodes within the subtree upon which it was invoked. The figure below illustrates how the AlignWarp actor operates on an AnatomyImage collection, adding a WarpParamSet to this collection. All data received by AlignWarp outside of its scope passes through the coactor transparently.

In Kepler, collections are serialized and streamed through coactors. Because actor execution is pipelined based on each actor’s scope, this approach enables concurrent processing of nested data collections as shown below. The figure illustrates how delimiter tokens (in blue and green) are used to bracket nested collections of associated data (in white), metadata (in red) and actor parameters (not shown).

Input collections drive workflow execution. The collection-oriented implementation of the Challenge workflow may be configured to operate on different numbers of input anatomy images, not by modifying the workflow definition, but by customizing the input to the workflow. We tested our provenance system using two different input data sets represented by two XML files. The first input file, input1.xml, corresponds exactly to the Challenge workflow and contains four AnatomyImage collections within a single ImageCollection collection (see tree representation below). The second input file, input2.xml, contains three ImageCollections comprising four, three, and two AnatomyImage collections respectively. In other words, our implementation can operate on varying numbers of anatomy images within a single run of the workflow. Moreover, parameter values for particular actors also may be embedded within the workflow input in order to override default parameter values for particular sub-collections of data (note Parameter elements in the two input XML files).


Provenance Trace

Changed:
<
<
Upload a representation of the information you captured when executing the workflow. Explain the structure (provide pointers to documents describing your schemas etc.)
>
>
The results of each run of the Challenge workflow (including input data, intermediate and final data products, as well as provenance) were recorded in a trace file by the CollectionWriter actor and may be downloaded here: trace1.xml, trace2.xml. Traces files are implemented in XML using the same schema used for workflow input files read by CollectionReader. (An execution of a collection-oriented workflow may be thought of as a process of incrementally elaborating the input XML document.) The figure below shows the beginning of such a trace, highlighting how little additional information must be added to the trace file in order to record data lineage and invocation dependencies.

As illustrated above, data and invocation dependencies are represented in the trace as special XML elements describing the provenance of other elements. Insertion elements record the actor, actor invocation count, and direct data dependencies associated with event that created the element following it in the document. InvocationDependency elements record which invocations of preceding actors created data or modified collections used in the current actor invocation. Insertion and invocation dependency information is passed through the workflow as special tokens during workflow execution. Coactors declare data dependencies explicitly during execution, whereas invocation dependencies are inferred and inserted into the token stream by the framework automatically. The figure below illustrates two data dependencies graphically.


Provenance Queries

 <<O>>  Difference Topic DAKS (r1.1 - 12 Sep 2006 - ShawnBowers)
Line: 1 to 1
Added:
>
>
META TOPICPARENT WebHome

Provenance Challenge Template

<!-- >div class="twikiToc"&

-->

Participating Team

  • Short team name: DAKS

  • Participant names: Shawn Bowers, Tim McPhillips, Bertram Ludaescher in collaboration with Norbert Podhorszki and Ilkay Altintas.

  • Project URL:

  • Project Overview: The Data and Knowledge Systems (DAKS) group at UC Davis is developing the Collection-Oriented Workflow paradigm and implementing this approach in the Kepler Workflow System. Our framework provides support for workflows that operate over nested collections (trees) of data. Actors in a collection-oriented workflow receive input trees

See publications (McPhillips & Bowers, 2005) and (McPhillips et al, 2006) listed below for more information.

  • Provenance-specific Overview: Among other benefits, collection-oriented workflows enable comprehensive data and process lineage information to be recorded and passed through the workflow along with data. We demonstrate this capability in this challenge. Our approach is an adaptation of the RWS provenance model described in the publication (Bowers et al, 2006) listed below.

Workflow Representation

Provide here a description of how you have encoded the Challenge workflow.

Provenance Trace

Upload a representation of the information you captured when executing the workflow. Explain the structure (provide pointers to documents describing your schemas etc.)

Provenance Queries

We have implemented a query engine prototype for collection-oriented traces in Prolog. The engine can accept multiple traces (traces can be "added" and "dropped" from the engine), and provides various operations for accessing traces and inferred provenance dependency graphs.

Core Provenance Queries

1. Find the process that led to Atlas X Graphic / everything that caused Atlas X Graphic to be as it is. This should tell us the new brain images from which the averaged atlas was generated, the warping performed etc.

For this question, we return the portion of the provenance lineage graph that "ends" at the Atlas X Graphic node of the trace. Given the Atlas X Graphic node Node and the trace Trace, the following expression returns the corresponding portion of the graph:

   lineage_edges(Trace, [Node], Edges)

where lineage_edges is a provided operation by the provenance query engine that computes the set of edges from the given set of nodes.

The following selects the first trace and the Atlas X Graphic node of the trace, computes the corresponding provenance graph, and draws the edges of the graph.

   q1 :-
      trace_id('1', Trace), 
      node_for_id(Trace, '341', Node),
      lineage_edges(Trace, [Node], Edges), 
      draw_trace_edges(Edges, 'pq1', gif).

The resulting graph is shown below.

The second trace we experimented with consists of three atlas graphic sets. The following query computes and draws the corresponding provenance graph for the second trace.

   pq1_trace2 :-
      trace_id('2', Trace), 
      node_for_id(Trace, '973', Node1),   
      node_for_id(Trace, '1093', Node2),   
      node_for_id(Trace, '1193', Node3),   
      lineage_edges(Trace, [Node3, Node2, Node1], Edges), 
      draw_trace_edges(Edges, 'pq1_trace2', gif).

The resulting graph is shown below.


2. Find the process that led to Atlas X Graphic, excluding everything prior to the averaging of images with softmean.

For this question, we again return a portion of the provenance lineage graph. Here, we filter out the edges that occur prior to the SoftMean computation using the filter_before_actor operation.

   pq2 :- 
      trace_id('1', Trace), 
      node_for_id(Trace, '341', Node),
      lineage_edges(Trace, [Node], Edges), 
      filter_before_actor(Trace, Edges, 'SoftMean', FilteredEdges), 
      draw_trace_edges(FilteredEdges, 'pq2', gif).

The result of this query is shown below.


3. Find the Stage 3, 4 and 5 details of the process that led to Atlas X Graphic.

Note that this question is identical to question 2. However, here we provide a different method for computing the result. In particular, instead of filtering out edges prior to SoftMean, we select only edges that occur after to the ResliceWarp computation using the select_after_actor operation.

   pq3 :- 
      trace_id('1', Trace), 
      node_for_id(Trace, '341', Node),   % AtlasXGraphic
      lineage_edges(Trace, [Node], Edges), 
      select_after_actor(Trace, Edges, 'ResliceWarp', FilteredEdges), 
      draw_trace_edges(FilteredEdges, 'pq3', gif).

The result of the query is identical to query 2 above.


For each query, if your system can support your query, provide a description of how you implement the query, what result is returned; otherwise, explain whether the query is in the remit of your system.

Also, make sure you complete the ProvenanceQueriesMatrix.

Suggested Workflow Variants

Suggest variants of the workflow that can exhibit capabilities that your system support.

Suggested Queries

Suggest significant queries that your system can support and are not in the proposed list of queries, and how you have implemented/would implement them. These queries may be with regards to a variant of the workflow suggested above.

Categorisation of queries

According to your provenance approach, you may be able to provide a categorisation of queries. Can you elaborate on the categorisation and its rationale.

Conclusions

Provide here your conclusions on the challenge, and issues that you like to see discussed at a face to face meeting.

-- LucMoreau - 31 May 2006

View topic | Diffs | r1.5 | > | r1.4 | > | r1.3 | More
Revision r1.1 - 12 Sep 2006 - 20:43 - ShawnBowers
Revision r1.5 - 14 Sep 2006 - 18:32 - ShawnBowers