<<O>>  Difference Topic UPenn (r1.14 - 20 Oct 2006 - SarahCohenBoulakia)

META TOPICPARENT ParticipatingTeams

Provenance Challenge Template

Participating Team

Changed:
<
<
  • Short team name: Database Group, University of Pennsylvania
>
>
  • Short team name: ZOOM, Database Group, University of Pennsylvania

  • Participant names: Sarah Cohen-Boulakia, Shirley Cohen, Susan Davidson, Olivier Biton and Thunyarat (Bam) Amornpetchkul
Changed:
<
<
>
>

  • Project Overview: This project aims to provide a formal model of provenance for scientific workflows which is both simple and general (i.e. can be used with existing workflow systems, such as Ptolemy/Kepler and myGrid) and sufficiently expressive to answer the provenance queries encountered in case studies. Interestingly, the proposed model not only takes into account the chained and complex structures of scientific workflows, but also allows for reasoning about provenance at different levels of abstraction through user views.
  • Provenance-specific Overview: See above.
  • Relevant Publications: Shirley Cohen, Sarah Cohen-Boulakia and Susan Davidson. Towards a Model of Scientific workflows and User Views. Proceedings of DILS'06, Data Integration for the Life Sciences, Springer-Verlag, Lecture Notes in Bioinformatics (LNBI) Num. 4075, pp. 264-279.
 <<O>>  Difference Topic UPenn (r1.13 - 03 Oct 2006 - SarahCohenBoulakia)

META TOPICPARENT ParticipatingTeams

Provenance Challenge Template

 <<O>>  Difference Topic UPenn (r1.12 - 13 Sep 2006 - LucMoreau)
Changed:
<
<
META TOPICPARENT WebHome
>
>
META TOPICPARENT ParticipatingTeams

Provenance Challenge Template

Participating Team

 <<O>>  Difference Topic UPenn (r1.11 - 11 Sep 2006 - SarahCohenBoulakia)

META TOPICPARENT WebHome

Provenance Challenge Template

Line: 17 to 17

Provenance Trace

Changed:
<
<
Our model is composed of base tables which provide minimal information about the workflow execution, the workflow specification, and the relationships between the specification and execution. The model is represented though the E/R diagram below. The tables IMMCONTAINS and USERVIEW will be defined in the section Worflow Variation. The other tables and their schemas are listed below. We list the type of each column (VARCHAR are strings) and the primary keys (in bold). For each table, we give an example based on the Challenge workflow.
>
>
Our model is composed of base tables which provide minimal information about the workflow execution, the workflow specification, and the relationships between the specification and execution. The model is represented though the E/R diagram below. The tables IMMCONTAINS, IMMSTEPCONTAINS and USERVIEW will be defined in the section Worflow Variants. The other tables and their schemas are listed below. We list the type of each column (VARCHAR are strings) and the primary keys (in bold). For each table, we give an example based on the Challenge workflow.

er.jpg

Figure 1 – E/R schema for the model of Provenance

Changed:
<
<
Note: The content of the tables is provided at the end of this page (see images tables1.png to tables4.png).
>
>
Note: The content of the tables is provided at the end of this page (see images tablesData-1.jpg to tablesData-4.jpg).

  • Data
Data (ID int, name string, type string) relates the unique identifier of a data item to its name and its type.
Line: 515 to 515

box3 reslice
Changed:
<
<
  • virtualInstanceOf
instanceOf does not contain information about composite steps. virtualInstanceOf computes instanceOf for composite steps and contains information about both base and composite steps.
>
>
  • instanceOf
The table instanceOf contains information for composite steps too. For example stepBox1-1 is the first instance of the box box1; it will contain the steps 1 and 5 (cf. table immStepContains and view containsStep).

  • immStepContains
This table defines direct steps containment, accordingly with immContains but for instances.

  • containsStep
For each step class, contains gives all the step classes it transitively contains.


Changed:
<
<
CREATE VIEW virtualInstanceOf
>
>
CREATE VIEW

AS
Changed:
<
<
SELECT step step, stepClass, ts FROM instanceOf UNION select min(step) || '-' || compoStepClass, compoStepClass, min(ts) from contains, instanceOf where contains.stepClass = instanceOf.stepClass group By compoStepClass
STEP      STEPCLASS   TS
-------------------------------         
1         align_warp  8/7/2006   
1-box1    box1        8/7/2006   
<nop>
>
>
SELECT DISTINCT CONNECT_BY_ROOT compoStep compoStep, step FROM immStepContains CONNECT BY PRIOR step=compoStep ORDER BY compoStep

Added:
>
>
COMPOSTEP            STEP                 
---------------------------------
stepBox1-1           1                    
stepBox1-1           5                    
stepBox1-2           2                    
stepBox1-2           6                    
stepBox1-3           3                    
stepBox1-3           7                    
stepBox1-4           4                    
stepBox1-4           8                    
stepBox2-1           10                   
stepBox2-1           13                   
stepBox2-2           11                   
stepBox2-2           14                   
stepBox2-3           12                   
stepBox2-3           15                   
stepBox3-1           1                    
stepBox3-1           10                   
stepBox3-1           13                   
stepBox3-1           5                    
stepBox3-1           9                    
stepBox3-1           stepBox1-1           
stepBox3-1           stepBox2-1           

  • userView
userView(usr string, stepClass string) indicates the lowest step-class level that a user can see. Remember that the union of all the step-classes a user can see must cover the whole workflow (user views are valid).
Line: 565 to 588

-- First union operand: "For each containment, the input is an input of -- the higher level step class if and only if this input was not an output -- from another of its sub steps
Changed:
<
<
SELECT inst2.step, input.dataId, min(input.ts) ts
>
>
SELECT containsStep.compoStep step, input.dataId, min(input.ts) ts

FROM input,
Changed:
<
<
instanceOf inst1, contains, virtualInstanceOf inst2
>
>
containsStep

WHERE
Changed:
<
<
input.step = inst1.step AND inst1.stepClass = contains.stepClass AND contains.compoStepClass = inst2.stepClass
>
>
input.step = containsStep.step

AND NOT EXISTS ( SELECT o.step
Changed:
<
<
FROM output o, instanceOf inst3, contains c2
>
>
FROM output o, containsStep c2

WHERE o.dataId = input.dataId
Changed:
<
<
AND o.step = inst3.step AND inst3.stepClass = c2.stepClass AND c2.compoStepClass = contains.compoStepClass
>
>
AND o.step = c2.step AND c2.compoStep = containsStep.compoStep

)
Changed:
<
<
GROUP BY inst2.step, input.dataId
>
>
GROUP BY containsStep.compoStep, input.dataId

-- Second union operand -- We want to keep the direct input relations UNION SELECT step, dataId, ts
Changed:
<
<
FROM input
>
>
FROM input;

Changed:
<
<
As an example, the step 10-box2 takes DATAID 23 (Atlas Image) and DATAID 24 (Atlas Header) as inputs.
>
>
As an example, the step stepBox2-1 takes DATAID 23 (Atlas Image) and DATAID 24 (Atlas Header) as inputs.

STEP    DATAID  TS    
Changed:
<
<

10-box2 23 8/17/2006 10-box2 24 8/17/2006
>
>

stepBox2-1 23 8/17/2006 stepBox2-1 24 8/17/2006

  • Coutput
Line: 605 to 623

AS -- First union operand: "For each containment, the output is an output of -- the higher level step class if and only if this output is used as an
Changed:
<
<
-- input for another step class which is not contained in the same step class
>
>
-- input for another step which is not contained in the same step

-- or if this is a final output of the workflow
Changed:
<
<
SELECT inst2.step, output.dataId, output.ts
>
>
SELECT containsStep.compoStep step, output.dataId, output.ts

FROM output,
Changed:
<
<
instanceOf inst1, contains, virtualInstanceOf inst2
>
>
containsStep

WHERE
Changed:
<
<
output.step = inst1.step AND inst1.stepClass = contains.stepClass AND contains.compoStepClass = inst2.stepClass
>
>
output.step = containsStep.step

AND EXISTS (
Changed:
<
<
/* Step class is used as an entry of * at least one step * class which is not a substep of this step class */
>
>
/* Step class is used as an entry of at least one step * which is not a substep of this step class */

SELECT i.step
Changed:
<
<
FROM input i, instanceOf inst3
>
>
FROM input i

WHERE i.dataId = output.dataId
Deleted:
<
<
AND i.step = inst3.step

AND NOT EXISTS (
Changed:
<
<
SELECT contains.compoStepClass FROM contains WHERE contains.stepClass = inst3.stepClass AND contains.compoStepClass = inst2.stepClass
>
>
SELECT cont2.compoStep FROM containsStep cont2 WHERE cont2.step = i.step AND cont2.compoStep = containsStep.compoStep

) UNION -- Step class is a final output?
Line: 649 to 661

SELECT step, dataId, ts FROM output
Changed:
<
<
As an example, step 10-box2 produced data ID 28 to 30 (Atlas Graphic X, Y and Z) on 08/20/2006, 08/21/2006, and 08/22/2006 respectively.
>
>
As an example, step stepBox2-1 produced data ID 28 (Atlas Graphic X) on 08/20/2006.

STEP         DATAID  TS  
---------------------------- 
Changed:
<
<
10-box2 28 8/20/2006 10-box2 29 8/21/2006 10-box2 30 8/22/2006
>
>
stepBox2-1 28 8/20/2006

  • UProcess
Line: 674 to 684

input.ts time FROM userView,
Changed:
<
<
virtualInstanceOf instanceOf,
>
>
instanceOf,

cinput input, data dataInput, coutput output,
Line: 686 to 696

AND input.step = output.step AND output.dataId = dataOutput.dataId AND input.ts <= output.ts
Deleted:
<
<
--ORDER BY usr, step, input

As an example, let us consider three user views and their perspectives on stage 1 of the workflow (align_warp).
Changed:
<
<
The user ‘uAdmin’ can see all the steps of the workflow. First, he can see the wrap param images produced (DATAID 11 to 14). Then, he can see that only data ID 1, 2, 9, and 10 are used as inputs to step 1.
>
>
The user ‘uAdmin’ can see all the steps of the workflow; and thus can see all data.

Changed:
<
<
The user ‘uBio’ sees stages 1 and 2 of the workflow as a single black box. First, he cannot see data whose DATAID are 11 to 13. Second, for him DATAIDs 1 through 10 are used to produce DATAIDs 15 through 22 as outputs.
>
>
The user ‘uBio’ sees align_warp and reslice calls as single boxes (step class box1), thus he cannot see data whose ids are 11 to 14 (Warp Params 1 to 4). He also sees slicer and convert as single boxes (step class box2), thus he cannot see data with ids 25 to 27 (Atlas X Slice, Atlas Y Slice and Atlas Z Slice).

Changed:
<
<
The user ‘uBlackBox’ cannot see the details of step 1 either. He is in the same case as UBio except that he cannot see data whose DATAIDs are 15 through 22. The only outputs of the workflow that he can see are the final outputs that is, DATAID 28 through 30.
>
>
The user ‘uBlackBox’ cannot see any detail of the workflow. He sees everything as a big box which entries are data id 1 to 10 and outputs are data id 28 to 30.

The implications of user views when it comes to query answering, are discussed in the following section.

Suggested Queries

Changed:
<
<
In this section, we answer query 1 considering user views and we introduce two other queries to emphasize the benefit of our approach. We will show that user views allow
  • (i) to consider security levels (when the user is not allowed (a) to see the detail of a box and (b) to know which are the inputs used to produce a given output), and
  • (ii) to focus on high-level information (when the user (a) is able to know which are the inputs used to produce a given output but (b) is not interested in knowing the details about the intermediate steps and data items).

The next query and the first version of new query will deal with (i) while the second version of new query will deal with (ii).

>
>
In this section, we answer query 1 considering user views and we introduce another query to emphasize the benefit of our approach.

  • Query1 for each user
    • Query for user “uBlackbox”
Line: 722 to 728

      • Result (SQL):
STEP       STEPCLASS   INPUT    INPUTNAME            OUTPUT   OUTPUTNAME        
Changed:
<
<

1-box3 box3 1 Anatomy Image1 28 Atlas X Graphic 1-box3 box3 10 Reference Header 28 Atlas X Graphic 1-box3 box3 2 Anatomy Header1 28 Atlas X Graphic 1-box3 box3 3 Anatomy Image2 28 Atlas X Graphic 1-box3 box3 4 Anatomy Header2 28 Atlas X Graphic 1-box3 box3 5 Anatomy Image3 28 Atlas X Graphic 1-box3 box3 6 Anatomy Header3 28 Atlas X Graphic 1-box3 box3 7 Anatomy Image4 28 Atlas X Graphic 1-box3 box3 8 Anatomy Header4 28 Atlas X Graphic 1-box3 box3 9 Reference Image 28 Atlas X Graphic
>
>
stepBox3-1 box3 1 Anatomy Image1 23 Atlas Image stepBox3-1 box3 1 Anatomy Image1 24 Atlas Header stepBox3-1 box3 1 Anatomy Image1 28 Atlas X Graphic stepBox3-1 box3 10 Reference Header 23 Atlas Image stepBox3-1 box3 10 Reference Header 24 Atlas Header stepBox3-1 box3 10 Reference Header 28 Atlas X Graphic stepBox3-1 box3 17 Resliced Image2 23 Atlas Image stepBox3-1 box3 17 Resliced Image2 24 Atlas Header stepBox3-1 box3 17 Resliced Image2 28 Atlas X Graphic stepBox3-1 box3 18 Resliced Header2 23 Atlas Image stepBox3-1 box3 18 Resliced Header2 24 Atlas Header stepBox3-1 box3 18 Resliced Header2 28 Atlas X Graphic stepBox3-1 box3 19 Resliced Image3 23 Atlas Image stepBox3-1 box3 19 Resliced Image3 24 Atlas Header stepBox3-1 box3 19 Resliced Image3 28 Atlas X Graphic stepBox3-1 box3 2 Anatomy Header1 23 Atlas Image stepBox3-1 box3 2 Anatomy Header1 24 Atlas Header stepBox3-1 box3 2 Anatomy Header1 28 Atlas X Graphic stepBox3-1 box3 20 Resliced Header3 23 Atlas Image stepBox3-1 box3 20 Resliced Header3 24 Atlas Header stepBox3-1 box3 20 Resliced Header3 28 Atlas X Graphic stepBox3-1 box3 21 Resliced Image4 23 Atlas Image stepBox3-1 box3 21 Resliced Image4 24 Atlas Header stepBox3-1 box3 21 Resliced Image4 28 Atlas X Graphic stepBox3-1 box3 22 Resliced Header4 23 Atlas Image stepBox3-1 box3 22 Resliced Header4 24 Atlas Header stepBox3-1 box3 22 Resliced Header4 28 Atlas X Graphic stepBox3-1 box3 9 Reference Image 23 Atlas Image stepBox3-1 box3 9 Reference Image 24 Atlas Header stepBox3-1 box3 9 Reference Image 28 Atlas X Graphic

  • Results obtained in a user friendly format:
Query1_UBlack.jpg
Line: 743 to 768

      • Result:
STEP   STEPCLASS INPUT INPUTNAME        OUTPUT OUTPUTNAME      
Deleted:
<
<
1-box1 box1 1 Anatomy Image1 15 Resliced Image1 1-box1 box1 1 Anatomy Image1 16 Resliced Header1 1-box1 box1 1 Anatomy Image1 17 Resliced Image2 1-box1 box1 1 Anatomy Image1 18 Resliced Header2 1-box1 box1 1 Anatomy Image1 19 Resliced Image3 1-box1 box1 1 Anatomy Image1 20 Resliced Header3 1-box1 box1 1 Anatomy Image1 21 Resliced Image4 1-box1 box1 1 Anatomy Image1 22 Resliced Header4 1-box1 box1 10 Reference Header 15 Resliced Image1 1-box1 box1 10 Reference Header 16 Resliced Header1 1-box1 box1 10 Reference Header 17 Resliced Image2 1-box1 box1 10 Reference Header 18 Resliced Header2 1-box1 box1 10 Reference Header 19 Resliced Image3 1-box1 box1 10 Reference Header 20 Resliced Header3 1-box1 box1 10 Reference Header 21 Resliced Image4 1-box1 box1 10 Reference Header 22 Resliced Header4 1-box1 box1 2 Anatomy Header1 15 Resliced Image1 1-box1 box1 2 Anatomy Header1 16 Resliced Header1 1-box1 box1 2 Anatomy Header1 17 Resliced Image2 1-box1 box1 2 Anatomy Header1 18 Resliced Header2 1-box1 box1 2 Anatomy Header1 19 Resliced Image3 1-box1 box1 2 Anatomy Header1 20 Resliced Header3 1-box1 box1 2 Anatomy Header1 21 Resliced Image4 1-box1 box1 2 Anatomy Header1 22 Resliced Header4 1-box1 box1 3 Anatomy Image2 15 Resliced Image1 1-box1 box1 3 Anatomy Image2 16 Resliced Header1 1-box1 box1 3 Anatomy Image2 17 Resliced Image2 1-box1 box1 3 Anatomy Image2 18 Resliced Header2 1-box1 box1 3 Anatomy Image2 19 Resliced Image3 1-box1 box1 3 Anatomy Image2 20 Resliced Header3 1-box1 box1 3 Anatomy Image2 21 Resliced Image4 1-box1 box1 3 Anatomy Image2 22 Resliced Header4 1-box1 box1 4 Anatomy Header2 15 Resliced Image1 1-box1 box1 4 Anatomy Header2 16 Resliced Header1 1-box1 box1 4 Anatomy Header2 17 Resliced Image2 1-box1 box1 4 Anatomy Header2 18 Resliced Header2 1-box1 box1 4 Anatomy Header2 19 Resliced Image3 1-box1 box1 4 Anatomy Header2 20 Resliced Header3 1-box1 box1 4 Anatomy Header2 21 Resliced Image4 1-box1 box1 4 Anatomy Header2 22 Resliced Header4 1-box1 box1 5 Anatomy Image3 15 Resliced Image1 1-box1 box1 5 Anatomy Image3 16 Resliced Header1 1-box1 box1 5 Anatomy Image3 17 Resliced Image2 1-box1 box1 5 Anatomy Image3 18 Resliced Header2 1-box1 box1 5 Anatomy Image3 19 Resliced Image3 1-box1 box1 5 Anatomy Image3 20 Resliced Header3 1-box1 box1 5 Anatomy Image3 21 Resliced Image4 1-box1 box1 5 Anatomy Image3 22 Resliced Header4 1-box1 box1 6 Anatomy Header3 15 Resliced Image1 1-box1 box1 6 Anatomy Header3 16 Resliced Header1 1-box1 box1 6 Anatomy Header3 17 Resliced Image2 1-box1 box1 6 Anatomy Header3 18 Resliced Header2 1-box1 box1 6 Anatomy Header3 19 Resliced Image3 1-box1 box1 6 Anatomy Header3 20 Resliced Header3 1-box1 box1 6 Anatomy Header3 21 Resliced Image4 1-box1 box1 6 Anatomy Header3 22 Resliced Header4 1-box1 box1 7 Anatomy Image4 15 Resliced Image1 1-box1 box1 7 Anatomy Image4 16 Resliced Header1 1-box1 box1 7 Anatomy Image4 17 Resliced Image2 1-box1 box1 7 Anatomy Image4 18 Resliced Header2 1-box1 box1 7 Anatomy Image4 19 Resliced Image3 1-box1 box1 7 Anatomy Image4 20 Resliced Header3 1-box1 box1 7 Anatomy Image4 21 Resliced Image4 1-box1 box1 7 Anatomy Image4 22 Resliced Header4 1-box1 box1 8 Anatomy Header4 15 Resliced Image1 1-box1 box1 8 Anatomy Header4 16 Resliced Header1 1-box1 box1 8 Anatomy Header4 17 Resliced Image2 1-box1 box1 8 Anatomy Header4 18 Resliced Header2 1-box1 box1 8 Anatomy Header4 19 Resliced Image3 1-box1 box1 8 Anatomy Header4 20 Resliced Header3 1-box1 box1 8 Anatomy Header4 21 Resliced Image4 1-box1 box1 8 Anatomy Header4 22 Resliced Header4 1-box1 box1 9 Reference Image 15 Resliced Image1 1-box1 box1 9 Reference Image 16 Resliced Header1 1-box1 box1 9 Reference Image 17 Resliced Image2 1-box1 box1 9 Reference Image 18 Resliced Header2 1-box1 box1 9 Reference Image 19 Resliced Image3 1-box1 box1 9 Reference Image 20 Resliced Header3 1-box1 box1 9 Reference Image 21 Resliced Image4 1-box1 box1 9 Reference Image 22 Resliced Header4 10-box2box2 23 Atlas Image 28 Atlas X Graphic 10-box2box2 24 Atlas Header 28 Atlas X Graphic

9 softmean 15 Resliced Image1 23 Atlas Image 9 softmean 15 Resliced Image1 24 Atlas Header 9 softmean 16 Resliced Header1 23 Atlas Image
Line: 841 to 784

9 softmean 21 Resliced Image4 24 Atlas Header 9 softmean 22 Resliced Header4 23 Atlas Image 9 softmean 22 Resliced Header4 24 Atlas Header
Added:
>
>
stepBox1-1 box1 1 Anatomy Image1 15 Resliced Image1 stepBox1-1 box1 1 Anatomy Image1 16 Resliced Header1 stepBox1-1 box1 10 Reference Header 15 Resliced Image1 stepBox1-1 box1 10 Reference Header 16 Resliced Header1 stepBox1-1 box1 2 Anatomy Header1 15 Resliced Image1 stepBox1-1 box1 2 Anatomy Header1 16 Resliced Header1 stepBox1-1 box1 9 Reference Image 15 Resliced Image1 stepBox1-1 box1 9 Reference Image 16 Resliced Header1 stepBox1-2 box1 10 Reference Header 17 Resliced Image2 stepBox1-2 box1 10 Reference Header 18 Resliced Header2 stepBox1-2 box1 3 Anatomy Image2 17 Resliced Image2 stepBox1-2 box1 3 Anatomy Image2 18 Resliced Header2 stepBox1-2 box1 4 Anatomy Header2 17 Resliced Image2 stepBox1-2 box1 4 Anatomy Header2 18 Resliced Header2 stepBox1-2 box1 9 Reference Image 17 Resliced Image2 stepBox1-2 box1 9 Reference Image 18 Resliced Header2 stepBox1-3 box1 10 Reference Header 19 Resliced Image3 stepBox1-3 box1 10 Reference Header 20 Resliced Header3 stepBox1-3 box1 5 Anatomy Image3 19 Resliced Image3 stepBox1-3 box1 5 Anatomy Image3 20 Resliced Header3 stepBox1-3 box1 6 Anatomy Header3 19 Resliced Image3 stepBox1-3 box1 6 Anatomy Header3 20 Resliced Header3 stepBox1-3 box1 9 Reference Image 19 Resliced Image3 stepBox1-3 box1 9 Reference Image 20 Resliced Header3 stepBox1-4 box1 10 Reference Header 21 Resliced Image4 stepBox1-4 box1 10 Reference Header 22 Resliced Header4 stepBox1-4 box1 7 Anatomy Image4 21 Resliced Image4 stepBox1-4 box1 7 Anatomy Image4 22 Resliced Header4 stepBox1-4 box1 8 Anatomy Header4 21 Resliced Image4 stepBox1-4 box1 8 Anatomy Header4 22 Resliced Header4 stepBox1-4 box1 9 Reference Image 21 Resliced Image4 stepBox1-4 box1 9 Reference Image 22 Resliced Header4 stepBox2-1 box2 23 Atlas Image 28 Atlas X Graphic stepBox2-1 box2 24 Atlas Header 28 Atlas X Graphic

      • Results obtained in a user friendly format:
Query1_UBio.jpg
Changed:
<
<
      • Note: For security reasons, uBio does not know that only DATAIDs 1, 2, 9 and 10 are used by 1.align_wrap.
>
>
      • Note: For security reasons, uBio does not know that only DATAIDs 1, 2, 9 and 10 are used by 1.align_warp.

He thinks that DATAIDs 1 through 10 are necessary. The same can be said for the steps inside box1 and box2.
    • Query for user “uAdmin”
      • Query: As previously with WHERE users = ‘uAdmin’
Line: 898 to 876

      • Note: All the data are obtained (same results than for the very first version of query1).
Changed:
<
<
  • New Query: What are the inputs that “Resliced Image1” (ID 15) depends on?
We will consider two versions of this query. The first version (security) considers that for security reasons, the users are not allowed to know the details about the inputs used to produced a given output. The second version (high-level) considers that the users are allowed to know the details about the inputs used to produced a given output. However, in both situations intermediate data and steps will not be known by the users according to their user view.
>
>
  • New Query: Find everything that caused “Resliced Image1” (ID 15) to be as it is?

Changed:
<
<
    • First version (security) of the query for ‘uBlackBox’
>
>
    • Query for ‘uBlackBox’

      • Query:

Changed:
<
<
SELECT DISTINCT input
>
>
SELECT *

FROM uProcess upc WHERE usr = 'uBlackBox' START WITH outputName = 'Resliced Image1'
Changed:
<
<
CONNECT BY PRIOR upc.output = upc.input;
>
>
CONNECT BY PRIOR upc.input = upc.output;

Changed:
<
<
      • Result: no rows selected
(uBlackBox can’t even see ‘Resliced Image1’)
>
>
      • Result: no rows selected (uBlackBox cannot even see ‘Resliced Image1’)

Changed:
<
<
    • First version of the query for ‘uBio’
>
>
    • Query for ‘uBio’

      • Query: Idem with ‘uBio’
      • Result:

Changed:
<
<
INPUT
      1. 10
>
>
USR STEP STEPCLASS INPUT INPUTNAME OUTPUT OUTPUTNAME TIME uBio stepBox1-1 box1 1 Anatomy Image1 15 Resliced Image1 8/7/2006 uBio stepBox1-1 box1 10 Reference Header 15 Resliced Image1 8/7/2006 uBio stepBox1-1 box1 2 Anatomy Header1 15 Resliced Image1 8/7/2006 uBio stepBox1-1 box1 9 Reference Image 15 Resliced Image1 8/7/2006

Changed:
<
<
Note that as expected, a lot of inputs are provided since this user sees the two first stages as a black box. Consequently, he cannot distinguish which inputs are used to produce a given output. This can be interesting for security reason.
>
>
      • Results obtained in a user friendly format:
extraQuery_UBio.jpg

Changed:
<
<
    • First version (security) of the query for ‘uAdmin’
>
>
    • Query for ‘uAdmin’

      • Query: Idem with ‘uAdmin’
      • Result:

Changed:
<
<
INPUT
      1. 10 11
>
>
USR STEP STEPCLASS INPUT INPUTNAME OUTPUT OUTPUTNAME TIME uAdmin 5 reslice 11 Warp Parameters1 15 Resliced Image1 8/12/2006 uAdmin 1 align_warp 1 Anatomy Image1 11 Warp Parameters1 8/7/2006 uAdmin 1 align_warp 10 Reference Header 11 Warp Parameters1 8/7/2006 uAdmin 1 align_warp 2 Anatomy Header1 11 Warp Parameters1 8/7/2006 uAdmin 1 align_warp 9 Reference Image 11 Warp Parameters1 8/7/2006

Changed:
<
<
Since uAdmin can see everything, he knows that only DATAIDs 1, 2 and 9 through 11 have been used as inputs.
>
>
      • Results obtained in a user friendly format:
extraQuery_UAdmin.jpg

Note that uAdmin can see that data id 11 (Warp Params 1) is part of the provenance while uBio cannot see it.


Deleted:
<
<
    • Second version (high-level) of the query for ‘uBio’
      • Query:
SELECT DISTINCT input
     FROM uProcess
       WHERE usr='uAdmin'
     START WITH output = (
               SELECT dataId FROM data WHERE name = 'Resliced Image1'
               )
     CONNECT BY PRIOR input = output
INTERSECT
SELECT DISTINCT input, inputName
     FROM uProcess
       WHERE usr='uBio'
     START WITH output = (
               SELECT dataId FROM data WHERE name = 'Resliced Image1'
               )
     CONNECT BY PRIOR input = output
      • Result:
INPUT
----------
         1
         2
         9
        10
Note that as expected, only the right inputs are provided and none intermediate results are provided.

Categorisation of queries

Line: 1004 to 947

-- SarahCohenBoulakia - 11 Sep 2006

Changed:
<
<
  • tables1.png:
    tables1.png

  • tables2.png:
    tables2.png
>
>
  • tablesData-1.jpg:
    tablesData-1.jpg

Changed:
<
<
  • tables3.png:
    tables3.png
>
>
  • tablesData-2.jpg:
    tablesData-2.jpg

Changed:
<
<
  • tables4.png:
    tables4.png
>
>
  • tablesData-3.jpg:
    tablesData-3.jpg

META FILEATTACHMENT wf.png attr="" comment="" date="1157908776" path="C:\Documents and Settings\Sarah\Mes documents\wf.png" size="84256" user="SarahCohenBoulakia" version="1.1"
Changed:
<
<
META FILEATTACHMENT er.jpg attr="" comment="" date="1157909001" path="C:\Documents and Settings\Sarah\Mes documents\er.jpg" size="163515" user="SarahCohenBoulakia" version="1.1"
>
>
META FILEATTACHMENT er.jpg attr="" comment="" date="1158007331" path="D:\Penn\provenance\propOlivier_challenge\dbDoc\er.jpg" size="159937" user="SarahCohenBoulakia" version="1.2"

META FILEATTACHMENT tables1.png attr="h" comment="" date="1157910807" path="C:\Documents and Settings\Sarah\Mes documents\Mes images\tables1.png" size="1178542" user="SarahCohenBoulakia" version="1.1"
META FILEATTACHMENT tables2.png attr="" comment="" date="1157911122" path="C:\Documents and Settings\Sarah\Mes documents\Mes images\tables2.png" size="547280" user="SarahCohenBoulakia" version="1.1"
META FILEATTACHMENT tables3.png attr="" comment="" date="1157911188" path="C:\Documents and Settings\Sarah\Mes documents\Mes images\tables3.png" size="1178125" user="SarahCohenBoulakia" version="1.1"
META FILEATTACHMENT tables4.png attr="" comment="" date="1157911255" path="C:\Documents and Settings\Sarah\Mes documents\Mes images\tables4.png" size="1087598" user="SarahCohenBoulakia" version="1.1"
META FILEATTACHMENT query1.jpg attr="" comment="" date="1157938018" path="C:\Documents and Settings\Sarah\Mes documents\Mes images\query1.jpg" size="221199" user="SarahCohenBoulakia" version="1.1"
Changed:
<
<
META FILEATTACHMENT Query1_UBio.jpg attr="" comment="" date="1157938098" path="C:\Documents and Settings\Sarah\Mes documents\Mes images\Query1_UBio.jpg" size="180867" user="SarahCohenBoulakia" version="1.1"
META FILEATTACHMENT Query1_UBlack.jpg attr="" comment="" date="1157938124" path="C:\Documents and Settings\Sarah\Mes documents\Mes images\Query1_UBlack.jpg" size="113527" user="SarahCohenBoulakia" version="1.1"
>
>
META FILEATTACHMENT Query1_UBio.jpg attr="" comment="" date="1158011728" path="D:\Penn\provenance\propOlivier_challenge\res\Query1_UBio.jpg" size="198964" user="SarahCohenBoulakia" version="1.2"
META FILEATTACHMENT Query1_UBlack.jpg attr="" comment="" date="1158011755" path="D:\Penn\provenance\propOlivier_challenge\res\Query1_UBlack.jpg" size="145191" user="SarahCohenBoulakia" version="1.2"

META FILEATTACHMENT query2.jpg attr="" comment="" date="1157938207" path="C:\Documents and Settings\Sarah\Mes documents\Mes images\query2.jpg" size="145721" user="SarahCohenBoulakia" version="1.1"
META FILEATTACHMENT query4.jpg attr="" comment="" date="1157938238" path="C:\Documents and Settings\Sarah\Mes documents\Mes images\query4.jpg" size="101298" user="SarahCohenBoulakia" version="1.1"
META FILEATTACHMENT query3.jpg attr="" comment="" date="1157938270" path="C:\Documents and Settings\Sarah\Mes documents\Mes images\query3.jpg" size="144123" user="SarahCohenBoulakia" version="1.1"
Added:
>
>
META FILEATTACHMENT extraQuery_UBio.jpg attr="" comment="" date="1158012715" path="D:\Penn\provenance\propOlivier_challenge\res\extraQuery_UBio.jpg" size="79355" user="SarahCohenBoulakia" version="1.1"
META FILEATTACHMENT extraQuery_UAdmin.jpg attr="" comment="" date="1158012735" path="D:\Penn\provenance\propOlivier_challenge\res\extraQuery_UAdmin.jpg" size="97942" user="SarahCohenBoulakia" version="1.1"
META FILEATTACHMENT tablesData-1.jpg attr="" comment="" date="1158013866" path="D:\Penn\provenance\propOlivier_challenge\res\tablesData-1.jpg" size="213555" user="SarahCohenBoulakia" version="1.1"
META FILEATTACHMENT tablesData-2.jpg attr="" comment="" date="1158013888" path="D:\Penn\provenance\propOlivier_challenge\res\tablesData-2.jpg" size="204399" user="SarahCohenBoulakia" version="1.1"
META FILEATTACHMENT tablesData-3.jpg attr="" comment="" date="1158013906" path="D:\Penn\provenance\propOlivier_challenge\res\tablesData-3.jpg" size="159136" user="SarahCohenBoulakia" version="1.1"
 <<O>>  Difference Topic UPenn (r1.10 - 11 Sep 2006 - SarahCohenBoulakia)

META TOPICPARENT WebHome

Provenance Challenge Template

Line: 904 to 904

    • First version (security) of the query for ‘uBlackBox’
      • Query:

Changed:
<
<
SELECT *
>
>
SELECT DISTINCT input

FROM uProcess upc WHERE usr = 'uBlackBox' START WITH outputName = 'Resliced Image1'
Line: 949 to 949

    • Second version (high-level) of the query for ‘uBio’
      • Query:

Changed:
<
<
SELECT DISTINCT input, inputName
>
>
SELECT DISTINCT input

FROM uProcess WHERE usr='uAdmin' START WITH output = (
 <<O>>  Difference Topic UPenn (r1.9 - 11 Sep 2006 - SarahCohenBoulakia)

META TOPICPARENT WebHome

Provenance Challenge Template

Line: 13 to 13

Workflow Representation

Changed:
<
<
Our aim is to provide a framework to represent minimal information necessary such that we can reason about provenance in scientific workflows. With this in mind, we distinguish between the specification of the workflow and the execution of the workflow. On the one hand, the workflow specification is composed of step-classes (alternatively called procedures, tasks, actors, processes, boxes) such as reslice in the challenge workflow. On the other hand, an execution of a workflow generates a partial order of steps, each of which has a set of input and output data objects. Each step is an instance of a step-class, and the input-output flow of data and class associated with each step must conform to the workflow specification. An example of step in the challenge workflow is “8.reslice” which is an instance of the step-class “reslice”. The proposed model of provenance for workflows can represent the relationship between a step-class (specification) and a step (execution). This is necessary for keeping track of the data and parameters used by each step of a workflow execution as well as to trace the data produced. The model is represented within a relational framework extended with transitive closure. It is implemented under Oracle 10.i and Java. An object layer together with a user interface (JDBC component) is provided. More details on our model and implementation of the Challenge workflow are provided in the following section.
>
>
Our aim is to provide a framework to represent minimal information necessary such that we can reason about provenance in scientific workflows. With this in mind, we distinguish between the specification of the workflow and the execution of the workflow. On the one hand, the workflow specification is composed of step-classes (alternatively called procedures, tasks, actors, processes, boxes) such as reslice in the challenge workflow. On the other hand, an execution of a workflow generates a partial order of steps, each of which has a set of input and output data objects. Each step is an instance of a step-class, and the input-output flow of data and class associated with each step must conform to the workflow specification. An example of step in the challenge workflow is “8.reslice” which is an instance of the step-class “reslice”. The proposed model of provenance for workflows can represent the relationship between a step-class (specification) and a step (execution). This is necessary for keeping track of the data and parameters used by each step of a workflow execution as well as to trace the data produced. The model is represented within a relational framework extended with transitive closure. It is implemented under Oracle 10.g and Java. An object layer together with a user interface (JDBC component) is provided. More details on our model and implementation of the Challenge workflow are provided in the following section.

Provenance Trace

Line: 95 to 95

  • stageInstance
Changed:
<
<
Note: The notion of stage was not initially in our model and we think it needs to be better defined. However, we added it to be able to answer some of the queries in the challenge.
>
>
Note: The notion of stage was not initially in our model and we think it needs to be better defined. We will go back to this point. However, we added it to be able to answer some of the queries in the challenge.

stageInstance(step string, stage string) stores the relationship between a stage and a step. As an example, steps 1 through 4 are in stage 1.


Line: 144 to 144

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
UPenn team thumbs up thumbs up thumbs up thumbs up thumbs up thumbs up smile thumbs up thumbs up
Changed:
<
<
For each query, we provide the corresponding SQL query and the results obtained. We also provide these results in a user-friendly format (as screenshots of our JAVA user interface).
>
>
For each query, we provide the corresponding SQL query and the results obtained. We also provide these results in a user-friendly format (as screenshots of our Java user interface).

  • Query1
    • Query: Find the process that led to Atlas X Graphic / everything that caused Atlas X Graphic to be as it is. This should tell us the new brain images from which the averaged atlas was generated, the warping performed etc.
Line: 158 to 158

CONNECT BY PRIOR input = output ORDER BY step
Changed:
<
<
    • Note: This query exploits the Connect by prior operator of Oracle which implements the transitive closure function. Transitive closure is necessary for returning all the data (recursively) that have been used to compute Atlas X Graphic.
>
>
    • Note: This query exploits the Connect by prior operator of Oracle which implements the transitive closure function. Transitive closure is necessary for returning all the data (recursively) that have been used to compute Atlas X Graphic. Basically, we use the Process table and search reccursively for the data used (output produced and input used, by mean of "connect by prior input=output").

    • Result (SQL):
STEP STEPCLASS   INPUT    INPUTNAME            OUTPUT OUTPUTNAME         
Line: 229 to 229

CONNECT BY PRIOR input = output )
Added:
>
>
    • Note: We follow the same process as previously but we remove (Minus operator) all the data used by "softmean".

    • Result (SQL):
STEP    STEPCLASS   INPUT INPUTNAME            OUTPUT OUTPUTNAME   
Line: 272 to 273

) CONNECT BY PRIOR input = output
Added:
>
>
    • Note: We follow the same idea than in query1 while focusing on stages 3, 4, and 5.

    • Result (SQL):
STEP  STEPCLASS INPUT INPUTNAME         OUTPUT  OUTPUTNAME     
Line: 298 to 300

    • Result (User friendly view):
Query3.jpg
Added:
>
>
    • Note: The same results as query2 have been found. Query 2 and 3 are synonyms!

  • Query4
    • Query: Find all invocations of procedure align_warp using a twelfth order nonlinear 1365 parameter model (see model menu describing possible values of parameter "-m 12" of align_warp) that ran on a Monday.
Line: 323 to 326

) AND rtrim(to_char(time, 'DAY')) = 'MONDAY'
Added:
>
>
    • Note: We search for information about the step-class "align-warp" (first "Where" clause) such that there is a step whose parameter is "order=12" and "model=1365".

    • Result (SQL):
STEP  STEPCLASS    INPUT  INPUTNAME         OUTPUT  OUTPUTNAME        TIME 
Line: 357 to 361

) AND type = 'Atlas Graphic'
Added:
>
>
  • Note: The dataAttributes table is used to find all Anatomy header data whose entry 'global maxiamum' is equal to 4095. Then, the same precess as previously is used (i.e. looking recursively for data).

  • Result (SQL):
NAME 
Line: 398 to 403

AND pred.orig = output.step AND output.dataId = data.dataId
Added:
>
>
Note that we understand "-m12" as a search for parameter whose order is equal to 12. We could also have chosen to search for a parameter whose description was like "-m 12".

    • Result (SQL):
ORIG                 NAME  
Line: 421 to 428

AND dataAttributes.value = 'UChicago' AND process.output = data.dataId
Added:
>
>
    • Note: This query basically consists in a join between dataAttributes (which contains information about annotations), Process, and Data.

    • Result:
NAME     
Line: 440 to 448

AND dataAttributes.attribute = 'studyModality' AND dataAttributes.value IN ('speech', 'visual', 'audio')
Added:
>
>
    • Note: Once again, this query consists in finding annotations in the dataAttributes table and to make joins to get other information.

    • Result:
DATAID               NAME                 ATTRIBUTE            VALUE   
Line: 476 to 485

  • immContains
immContains(compoStepClass string, stepClass string) indicates the immediate containment between step-classes.
Changed:
<
<
As an example, box 1 covers all the step-class align_warp and reslice
>
>
As an example, box 1 covers all the step-class align_warp and reslice, and box3 contains box1.

COMPOSTEPCLASS       STEPCLASS 
---------------------------------           
Line: 985 to 994

We have shown how our model is able to represent the challenge workflow and to answer the proposed queries. We have also shown how the notion of a user view can allow the user to manage the complexity of a workflow through higher levels of abstraction and layering.

Changed:
<
<
We would be interested in discussing with the workshop attendees about three points: (i) the meaning of a stage in a workflow (and in particular, how to interpret this notion with respect to user views), (ii) the biological significance of the procedures that can compose a workflow (in the workflow provided, what are the “significant” steps for a biologist?) and (iii) the (general) problem of computing a difference between two workflows (cf. query 7).
>
>
We would be interested in discussing with the workshop attendees about three points: (i) the meaning of a stage in a workflow (and in particular, how to interpret this notion with respect to user views), (ii) the biological significance of the procedures that can compose a workflow (in the workflow provided, what are the “significant” steps for a scientist?) and (iii) the (general) problem of computing a difference between two workflows (cf. query 7).

Aknowledgments

This work is supported by NSF grants 0513778, 0415810, and 0612177*.
 <<O>>  Difference Topic UPenn (r1.8 - 11 Sep 2006 - SarahCohenBoulakia)

META TOPICPARENT WebHome

Provenance Challenge Template

Changed:
<
<

Participating Team (Under construction)

>
>

Participating Team


  • Short team name: Database Group, University of Pennsylvania
Changed:
<
<
  • Participant names: Sarah Cohen-Boulakia, Shirley Cohen, Susan Davidson, Thunyarat (Bam) Amornpetchkul, and Olivier Biton
>
>
  • Participant names: Sarah Cohen-Boulakia, Shirley Cohen, Susan Davidson, Olivier Biton and Thunyarat (Bam) Amornpetchkul

  • Project URL: http://db.cis.upenn.edu
  • Project Overview: This project aims to provide a formal model of provenance for scientific workflows which is both simple and general (i.e. can be used with existing workflow systems, such as Ptolemy/Kepler and myGrid) and sufficiently expressive to answer the provenance queries encountered in case studies. Interestingly, the proposed model not only takes into account the chained and complex structures of scientific workflows, but also allows for reasoning about provenance at different levels of abstraction through user views.
  • Provenance-specific Overview: See above.
Line: 146 to 146

For each query, we provide the corresponding SQL query and the results obtained. We also provide these results in a user-friendly format (as screenshots of our JAVA user interface).

Changed:
<
<
  • Query1
>
>
  • Query1

    • Query: Find the process that led to Atlas X Graphic / everything that caused Atlas X Graphic to be as it is. This should tell us the new brain images from which the averaged atlas was generated, the warping performed etc.
SELECT DISTINCT step, stepClass, input, inputName, output, outputName 
Line: 211 to 211

    • Result (user friendly view): query1.jpg
Changed:
<
<
* Query2
>
>
* Query2

    • Query: Find the process that led to Atlas X Graphic, excluding everything prior to the averaging of images with softmean.
SELECT DISTINCT step, stepClass, input, inputName, output, outputName            
Line: 256 to 256

    • Result (User friendly view): query2.jpg
Changed:
<
<
  • Query3
>
>
  • Query3

    • Query: Find the Stage 3, 4 and 5 details of the process that led to Atlas X Graphic.
SELECT DISTINCT step, stepClass, input, inputName, output, outputName 
Line: 299 to 299

    • Result (User friendly view):
Query3.jpg
Changed:
<
<
  • Query4
>
>
  • Query4

    • Query: Find all invocations of procedure align_warp using a twelfth order nonlinear 1365 parameter model (see model menu describing possible values of parameter "-m 12" of align_warp) that ran on a Monday.
SELECT step, stepClass, input, inputName, output, outputName, time    
Line: 332 to 332

1 align_warp 10 Reference Header 11 Warp Parameters1 8/7/2006 1 align_warp 9 Reference Image 11 Warp Parameters1 8/7/2006
Changed:
<
<
    • Result (User friendly view): query4.jpg
>
>
    • Result (User friendly view):

Added:
>
>
query4.jpg

Changed:
<
<
  • Query5
>
>
  • Query5

    • Query: Find all Atlas Graphic images outputted from workflows where at least one of the input Anatomy Headers had an entry global maximum=4095.
SELECT name                                                             
Line: 365 to 366

Atlas Z Graphic
Changed:
<
<
  • Query6
    • _Query: Find all output averaged images of softmean (average) procedures, where the warped images taken as input were align-warped using a twelfth order nonlinear 1365 parameter model, i.e. "where softmean was preceded in the workflow, directly or indirectly, by an align-warp procedure with argument -m 12." _
>
>
  • Query6
    • Query: _Find all output averaged images of softmean (average) procedures, where the warped images taken as input were align-warped using a twelfth order nonlinear 1365 parameter model, i.e. "where softmean was preceded in the workflow, directly or indirectly, by an align-warp procedure with argument -m 12." _

SELECT distinct pred.orig,  data.name               
  FROM                                              
Line: 405 to 406

9 Atlas Image
Changed:
<
<
  • Query7
>
>
  • Query7

    • Query:_A user has run the workflow twice, in the second instance replacing each procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs. The exact level of detail in the difference that is detected by a system is up to each participant. _
Deleted:
<
<

    • Note: Computing the difference between two workflows or two workflow executions will be possible in our next implementation. Query 7 is a very simple example of the problem. However, when the difference between the data used by a step or produced by a step is considered, the problem becomes much harder. We would be very interested in talking about this problem with the workshop attendees.
Changed:
<
<
  • Query8
>
>
  • Query8

    • Query: _A user has annotated some anatomy images with a key-value pair center=UChicago. Find the outputs of align-warp where the inputs are annotated with center=UChicago. _
SELECT distinct data.name                      
Line: 429 to 429

Warp Parameters1
Changed:
<
<
  • Query9
>
>
  • Query9

    • Query: A user has annotated some atlas graphics with key-value pair where the key is studyModality. Find all the graphical atlas sets that have metadata annotation studyModality with values speech, visual or audio, and return all other annotations to these files.
SELECT distinct data.dataId, data.name, attribute, value                      
Line: 481 to 482


box1 align_warp box1 reslice
Added:
>
>
box3 box1

  • contains
Line: 494 to 496

ORDER BY compoStepClass
Added:
>
>
COMPOSTEPCLASS       STEPCLASS 
---------------------------------  
box1                 align_warp           
box1                 reslice
box3                 box1         
box3                 align_warp           
box3                 reslice

  • virtualInstanceOf
instanceOf does not contain information about composite steps. virtualInstanceOf computes instanceOf for composite steps and contains information about both base and composite steps.

Line: 673 to 685

The user ‘uBio’ sees stages 1 and 2 of the workflow as a single black box. First, he cannot see data whose DATAID are 11 to 13. Second, for him DATAIDs 1 through 10 are used to produce DATAIDs 15 through 22 as outputs.

Changed:
<
<
The user ‘uBlackBox’ cannot see the details of step 1 either. He is in the same case as UBio except that he cannot see data whose DATAIDs are 15 through 22. The only output that he can see are the final outputs that is, DATAID 28 through 30.
>
>
The user ‘uBlackBox’ cannot see the details of step 1 either. He is in the same case as UBio except that he cannot see data whose DATAIDs are 15 through 22. The only outputs of the workflow that he can see are the final outputs that is, DATAID 28 through 30.

The implications of user views when it comes to query answering, are discussed in the following section.

Suggested Queries

In this section, we answer query 1 considering user views and we introduce two other queries to emphasize the benefit of our approach. We will show that user views allow

Changed:
<
<
  • (i) to consider security levels (when the user is not allowed to see the details of a box and which are the inputs used to produced a given output), see the next query and the first version of new query, and
  • (ii) to focus on high-level information (when the user is able to know which are the inputs used to produced a given output but is not interested in knowing the details about the intermediate steps), see the second version of new query.
>
>
  • (i) to consider security levels (when the user is not allowed (a) to see the detail of a box and (b) to know which are the inputs used to produce a given output), and
  • (ii) to focus on high-level information (when the user (a) is able to know which are the inputs used to produce a given output but (b) is not interested in knowing the details about the intermediate steps and data items).

The next query and the first version of new query will deal with (i) while the second version of new query will deal with (ii).


* Query1 for each user

    • Query for user “uBlackbox”
Line: 713 to 728

  • Results obtained in a user friendly format:
Query1_UBlack.jpg

Changed:
<
<
  • Note: Only the first (DATAIDs 1 to 10) inputs of the workflow are seen by UBlackBox?.
>
>
  • Note: Only the first (DATAIDs 1 to 10) inputs of the workflow are seen by uBlackBox.

    • Query for user “uBio”
      • Query: As previously with WHERE users = ‘uBio’
      • Result:
Line: 820 to 835

      • Results obtained in a user friendly format:
Query1_UBio.jpg
Changed:
<
<
      • Note: UBio does not know that only DATAIDs 1, 2, 9 and 10 are used by 1.align_wrap. He thinks that DATAIDs 1 through 10 are necessary. The same can be said for the steps inside box1 and box2.
>
>
      • Note: For security reasons, uBio does not know that only DATAIDs 1, 2, 9 and 10 are used by 1.align_wrap.
He thinks that DATAIDs 1 through 10 are necessary. The same can be said for the steps inside box1 and box2.

    • Query for user “uAdmin”
      • Query: As previously with WHERE users = ‘uAdmin’
      • Result:
Line: 871 to 887

9 softmean 22 Resliced Header4 23 Atlas Image 9 softmean 22 Resliced Header4 24 Atlas Header
Changed:
<
<
      • Note: All the data are obtained.
>
>
      • Note: All the data are obtained (same results than for the very first version of query1).

  • New Query: What are the inputs that “Resliced Image1” (ID 15) depends on?
Changed:
<
<
We will consider two versions of this query. The first version (security) considers that for security reasons, the users are not allowed to know the details about the inputs used to produced a given output. The second version (high-level) considers that the users are allowed to know the details about the inputs used to produced a given output. In both situations, intermediate data and steps will not be known by the users according to their user view.
>
>
We will consider two versions of this query. The first version (security) considers that for security reasons, the users are not allowed to know the details about the inputs used to produced a given output. The second version (high-level) considers that the users are allowed to know the details about the inputs used to produced a given output. However, in both situations intermediate data and steps will not be known by the users according to their user view.

Changed:
<
<
    • First version (security) of the query for ‘uBlack’
>
>
    • First version (security) of the query for ‘uBlackBox’

      • Query:
SELECT *
  FROM uProcess  upc 
Changed:
<
<
WHERE usr = 'uBlack'
>
>
WHERE usr = 'uBlackBox'

START WITH outputName = 'Resliced Image1' CONNECT BY PRIOR upc.output = upc.input;
Line: 905 to 921

      1. 10
Changed:
<
<
Note that as expected, a lot of inputs are provided since this user sees the two first stages as a black box. Consequently, he cannot distinguish which input is used to produce which output. This can be interesting for security reason.
>
>
Note that as expected, a lot of inputs are provided since this user sees the two first stages as a black box. Consequently, he cannot distinguish which inputs are used to produce a given output. This can be interesting for security reason.

    • First version (security) of the query for ‘uAdmin’
      • Query: Idem with ‘uAdmin’
Line: 949 to 965

      1. 10
Changed:
<
<
Note that as expected, only the right inputs are provided.
>
>
Note that as expected, only the right inputs are provided and none intermediate results are provided.

Categorisation of queries

Deleted:
<
<
According to your provenance approach, you may be able to provide a categorisation of queries. Can you elaborate on the categorisation and its rationale

Generally speaking, our provenance queries can be categorized as such:
  • step vs data provenance: depending on whether the user is interested in knowing the step(s) or the data that have been used to produce a given data item.
  • immediate vs deep provenance: depending on whether the user is interested in knowing only the previous step (and possibly the data used by this step) or the whole sequence of steps (and possibly the data used by these steps) used to produced a given data item.
Line: 979 to 993

(* Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.)

Changed:
<
<
-- SarahCohenBoulakia - 10 Sep 2006
>
>
-- SarahCohenBoulakia - 11 Sep 2006

  • tables1.png:
    tables1.png
 <<O>>  Difference Topic UPenn (r1.7 - 11 Sep 2006 - SarahCohenBoulakia)

META TOPICPARENT WebHome

Provenance Challenge Template

Line: 207 to 207

9 softmean 22 Resliced Header4 23 Atlas Image 9 softmean 22 Resliced Header4 24 Atlas Header
Changed:
<
<
The results may also be visualized with a user-friendly interface, where the data is grouped by steps. The data can be retrieved either by name or by id (the id option may be useful when a lot of data is retuned by a query).
>
>
The results can also be visualized within the user-friendly interface. On the top, the query is given by the user, and on the bottom the result (provenance) is represented through a graph. The user can access to the characteristics of a given step or data item by clicking on it (visualized on the right hand side).

    • Result (user friendly view):
Changed:
<
<
To be included soon
>
>
query1.jpg

* Query2

    • Query: Find the process that led to Atlas X Graphic, excluding everything prior to the averaging of images with softmean.
Line: 253 to 253

9 softmean 22 Resliced Header4 23 Atlas Image 9 softmean 22 Resliced Header4 24 Atlas Header
Changed:
<
<
    • Result (User friendly view): To be included soon
>
>
    • Result (User friendly view): query2.jpg

  • Query3
    • Query: Find the Stage 3, 4 and 5 details of the process that led to Atlas X Graphic.
Line: 295 to 296

10 slicer 24 Atlas Header 25 Atlas X Slice 9 softmean 21 Resliced Image4 24 Atlas Header
Changed:
<
<
    • Result (User friendly view): To be included soon
>
>
    • Result (User friendly view):
Query3.jpg

* Query4

    • Query: Find all invocations of procedure align_warp using a twelfth order nonlinear 1365 parameter model (see model menu describing possible values of parameter "-m 12" of align_warp) that ran on a Monday.
Line: 330 to 332

1 align_warp 10 Reference Header 11 Warp Parameters1 8/7/2006 1 align_warp 9 Reference Image 11 Warp Parameters1 8/7/2006
Changed:
<
<
    • Result (User friendly view): To be included soon
>
>
    • Result (User friendly view): query4.jpg

  • Query5
    • Query: Find all Atlas Graphic images outputted from workflows where at least one of the input Anatomy Headers had an entry global maximum=4095.
Line: 680 to 683

  • (ii) to focus on high-level information (when the user is able to know which are the inputs used to produced a given output but is not interested in knowing the details about the intermediate steps), see the second version of new query.

* Query1 for each user

Deleted:
<
<
    • Results obtained in a user friendly format (to be included soon)

    • Query for user “uBlackbox”
      • Query:

Line: 693 to 695

CONNECT BY PRIOR input = output ORDER BY step;
Changed:
<
<
      • Result:
>
>
      • Result (SQL):

STEP       STEPCLASS   INPUT    INPUTNAME            OUTPUT   OUTPUTNAME        
-----------------------------------------------------------------------------
Line: 708 to 710

1-box3 box3 8 Anatomy Header4 28 Atlas X Graphic 1-box3 box3 9 Reference Image 28 Atlas X Graphic
Added:
>
>
  • Results obtained in a user friendly format:
Query1_UBlack.jpg


      • Note: Only the first (DATAIDs 1 to 10) inputs of the workflow are seen by UBlackBox?.
    • Query for user “uBio”
      • Query: As previously with WHERE users = ‘uBio’
Line: 813 to 818

9 softmean 22 Resliced Header4 23 Atlas Image 9 softmean 22 Resliced Header4 24 Atlas Header
Added:
>
>
      • Results obtained in a user friendly format:
Query1_UBio.jpg

      • Note: UBio does not know that only DATAIDs 1, 2, 9 and 10 are used by 1.align_wrap. He thinks that DATAIDs 1 through 10 are necessary. The same can be said for the steps inside box1 and box2.
    • Query for user “uAdmin”
      • Query: As previously with WHERE users = ‘uAdmin’
Line: 914 to 921

Since uAdmin can see everything, he knows that only DATAIDs 1, 2 and 9 through 11 have been used as inputs.
Changed:
<
<
    • Second version (high-level) of the query for ‘uBlack’
>
>
    • Second version (high-level) of the query for ‘uBio’

      • Query:

Added:
>
>
SELECT DISTINCT input, inputName FROM uProcess WHERE usr='uAdmin' START WITH output = ( SELECT dataId FROM data WHERE name = 'Resliced Image1' ) CONNECT BY PRIOR input = output INTERSECT SELECT DISTINCT input, inputName FROM uProcess WHERE usr='uBio' START WITH output = ( SELECT dataId FROM data WHERE name = 'Resliced Image1' ) CONNECT BY PRIOR input = output

Deleted:
<
<
      • Result: no rows selected
(As previously, uBlackBox can’t even see ‘Resliced Image1’)

    • Second version (high-level) of the query for ‘uBio’
      • Query: Idem with ‘uBio’

      • Result:
INPUT
Line: 931 to 948

      1. 10
Deleted:
<
<
11

Note that as expected, only the right inputs are provided.
Deleted:
<
<
    • Second version (high-level) of the query for ‘uAdmin’
      • Query: Idem with ‘uAdmin’
      • Result:
INPUT
----------
         1
         2
         9
        10
        11
Same results as previously.

Categorisation of queries

According to your provenance approach, you may be able to provide a categorisation of queries. Can you elaborate on the categorisation and its rationale

Line: 997 to 999

META FILEATTACHMENT tables2.png attr="" comment="" date="1157911122" path="C:\Documents and Settings\Sarah\Mes documents\Mes images\tables2.png" size="547280" user="SarahCohenBoulakia" version="1.1"
META FILEATTACHMENT tables3.png attr="" comment="" date="1157911188" path="C:\Documents and Settings\Sarah\Mes documents\Mes images\tables3.png" size="1178125" user="SarahCohenBoulakia" version="1.1"
META FILEATTACHMENT tables4.png attr="" comment="" date="1157911255" path="C:\Documents and Settings\Sarah\Mes documents\Mes images\tables4.png" size="1087598" user="SarahCohenBoulakia" version="1.1"
Added:
>
>
META FILEATTACHMENT query1.jpg attr="" comment="" date="1157938018" path="C:\Documents and Settings\Sarah\Mes documents\Mes images\query1.jpg" size="221199" user="SarahCohenBoulakia" version="1.1"
META FILEATTACHMENT Query1_UBio.jpg attr="" comment="" date="1157938098" path="C:\Documents and Settings\Sarah\Mes documents\Mes images\Query1_UBio.jpg" size="180867" user="SarahCohenBoulakia" version="1.1"
META FILEATTACHMENT Query1_UBlack.jpg attr="" comment="" date="1157938124" path="C:\Documents and Settings\Sarah\Mes documents\Mes images\Query1_UBlack.jpg" size="113527" user="SarahCohenBoulakia" version="1.1"
META FILEATTACHMENT query2.jpg attr="" comment="" date="1157938207" path="C:\Documents and Settings\Sarah\Mes documents\Mes images\query2.jpg" size="145721" user="SarahCohenBoulakia" version="1.1"
META FILEATTACHMENT query4.jpg attr="" comment="" date="1157938238" path="C:\Documents and Settings\Sarah\Mes documents\Mes images\query4.jpg" size="101298" user="SarahCohenBoulakia" version="1.1"
META FILEATTACHMENT query3.jpg attr="" comment="" date="1157938270" path="C:\Documents and Settings\Sarah\Mes documents\Mes images\query3.jpg" size="144123" user="SarahCohenBoulakia" version="1.1"
 <<O>>  Difference Topic UPenn (r1.6 - 10 Sep 2006 - SarahCohenBoulakia)

META TOPICPARENT WebHome

Provenance Challenge Template

Line: 971 to 971

We would be interested in discussing with the workshop attendees about three points: (i) the meaning of a stage in a workflow (and in particular, how to interpret this notion with respect to user views), (ii) the biological significance of the procedures that can compose a workflow (in the workflow provided, what are the “significant” steps for a biologist?) and (iii) the (general) problem of computing a difference between two workflows (cf. query 7).

Added:
>
>

Aknowledgments

This work is supported by NSF grants 0513778, 0415810, and 0612177*.

(* Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.)


-- SarahCohenBoulakia - 10 Sep 2006

 <<O>>  Difference Topic UPenn (r1.5 - 10 Sep 2006 - SarahCohenBoulakia)

META TOPICPARENT WebHome

Provenance Challenge Template

Line: 140 to 140

Provenance Queries

Changed:
<
<
Also, make sure you complete the ProvenanceQueriesMatrix.
>
>
Teams Queries
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
UPenn team thumbs up thumbs up thumbs up thumbs up thumbs up thumbs up smile thumbs up thumbs up

For each query, we provide the corresponding SQL query and the results obtained. We also provide these results in a user-friendly format (as screenshots of our JAVA user interface).

 <<O>>  Difference Topic UPenn (r1.4 - 10 Sep 2006 - SarahCohenBoulakia)

META TOPICPARENT WebHome

Provenance Challenge Template

Participating Team (Under construction)

  • Short team name: Database Group, University of Pennsylvania
Changed:
<
<
  • Participant names: Sarah Cohen-Boulakia, Shirley Cohen, Susan Davidson, Thunyarat (Bam) Amornpetchkul
>
>
  • Participant names: Sarah Cohen-Boulakia, Shirley Cohen, Susan Davidson, Thunyarat (Bam) Amornpetchkul, and Olivier Biton

Changed:
<
<
  • Project Overview: This project aims to provide a formal model of provenance for scientific workflows which is simple and general (i.e. can be used with existing workflow systems, such as Kepler and myGrid) and sufficiently expressive to answer the provenance queries encountered in case studies. Interestingly, our model not only takes into account the chained and complex structure of scientific workflows, but allows asks for provenance at different levels of abstraction using user views.
>
>
  • Project Overview: This project aims to provide a formal model of provenance for scientific workflows which is both simple and general (i.e. can be used with existing workflow systems, such as Ptolemy/Kepler and myGrid) and sufficiently expressive to answer the provenance queries encountered in case studies. Interestingly, the proposed model not only takes into account the chained and complex structures of scientific workflows, but also allows for reasoning about provenance at different levels of abstraction through user views.

  • Provenance-specific Overview: See above.
  • Relevant Publications: Shirley Cohen, Sarah Cohen-Boulakia and Susan Davidson. Towards a Model of Scientific workflows and User Views. Proceedings of DILS'06, Data Integration for the Life Sciences, Springer-Verlag, Lecture Notes in Bioinformatics (LNBI) Num. 4075, pp. 264-279.
Deleted:
<
<
The following will be updated soon!

Workflow Representation

Changed:
<
<
Provide here a description of how you have encoded the Challenge workflow.
>
>
Our aim is to provide a framework to represent minimal information necessary such that we can reason about provenance in scientific workflows. With this in mind, we distinguish between the specification of the workflow and the execution of the workflow. On the one hand, the workflow specification is composed of step-classes (alternatively called procedures, tasks, actors, processes, boxes) such as reslice in the challenge workflow. On the other hand, an execution of a workflow generates a partial order of steps, each of which has a set of input and output data objects. Each step is an instance of a step-class, and the input-output flow of data and class associated with each step must conform to the workflow specification. An example of step in the challenge workflow is “8.reslice” which is an instance of the step-class “reslice”. The proposed model of provenance for workflows can represent the relationship between a step-class (specification) and a step (execution). This is necessary for keeping track of the data and parameters used by each step of a workflow execution as well as to trace the data produced. The model is represented within a relational framework extended with transitive closure. It is implemented under Oracle 10.i and Java. An object layer together with a user interface (JDBC component) is provided. More details on our model and implementation of the Challenge workflow are provided in the following section.

Provenance Trace

Changed:
<
<
Upload a representation of the information you captured when executing the workflow. Explain the structure (provide pointers to documents describing your schemas etc.)
>
>
Our model is composed of base tables which provide minimal information about the workflow execution, the workflow specification, and the relationships between the specification and execution. The model is represented though the E/R diagram below. The tables IMMCONTAINS and USERVIEW will be defined in the section Worflow Variation. The other tables and their schemas are listed below. We list the type of each column (VARCHAR are strings) and the primary keys (in bold). For each table, we give an example based on the Challenge workflow.

Changed:
<
<

Provenance Queries

>
>
er.jpg Figure 1 – E/R schema for the model of Provenance

Note: The content of the tables is provided at the end of this page (see images tables1.png to tables4.png).


Changed:
<
<
For each query, if your system can support your query, provide a description of how you implement the query, what result is returned; otherwise, explain whether the query is in the remit of your system.
>
>

  • Data
Data (ID int, name string, type string) relates the unique identifier of a data item to its name and its type.

As an example, the data whose ID is 1 has the name “Anatomy Image1” and is of type Anatomy Image.

       DATAID               NAME                 TYPE   
-----------------------------------------------------              
1                    Anatomy Image1       Anatomy Image

  • dataAttributes
dataAttributes (dataId string, attribute string, value string) contains the annotations of each data item. There can be several pairs of keys and values annotated with each data item. As an example, Anatomy Image1, whose ID is 1, is annotated with a key word “center” and value “UChicago.” In other words, center= UChicago associated with DATAID 1, is represented in our model in the following way:
      DATAID               ATTRIBUTE            VALUE   
-----------------------------------------------------  
1                    center               UChicago

  • stepParam
stepParam(step int, attribute string, value string) stores information about the parameters used by a step. step is the id of the step, attribute represents the category of the step, and value represents the description of the category.

As an example, the table below expresses that the step whose id is 1 takes a nonlinear (linear=false), 12th order (order=12) parameter, and whose model is 1365 (model=1365). This parameter can be invoked by using “-m 12 –q” (description=-m 12 –q).

      STEP                 ATTRIBUTE            VALUE    
-----------------------------------------------------             
1                    linear               false                
1                    order                12                   
1                    model                1365                 
1                    description          -m 12 -q     

  • input
input(step int, dataId int, ts Date) stores information about an input to a step and the timestamp it occurred.

As an example, Step 2 (2.align_warp) takes as inputs data whose ID is 3 (Anatomy Image2), 4 (Anatomy Header2), 9 (Reference Image), and 10 (Reference Header). Each input in the example was taken on the same date, 8/8/2006.

STEP                 DATAID               TS   
-----------------------------------------------------                
2                    3                    8/8/2006            
2                    4                    8/8/2006            
2                    9                    8/8/2006            
2                    10                   8/8/2006

* output output(step int, dataId int, ts Date) stores information about an output from a step and the timestamp it occurred.

As an example, Step 10 (10. slicer) produced one output with a DATAID of 25 (Atlas X Slice) on 08/17/2006.

STEP                 DATAID               TS                  
-----------------------------------------------------  
10                   25                   8/17/2006   

  • instanceOf
instanceOf (step int, step-class string, ts) defines the relationship between a step-class and a step of which it is an execution, and contains the execution time of the step.

As an example, steps 1, 2, 3, and 4 are executions of the step-class “align_warp.”

    STEP                 STEPCLASS            TS    
-----------------------------------------------------               
1                    align_warp           8/7/2006            
2                    align_warp           8/8/2006            
3                    align_warp           8/10/2006           
4                    align_warp           8/11/2006           

  • stageInstance
Note: The notion of stage was not initially in our model and we think it needs to be better defined. However, we added it to be able to answer some of the queries in the challenge.

stageInstance(step string, stage string) stores the relationship between a stage and a step. As an example, steps 1 through 4 are in stage 1.

STEP                 STAGE
-----------------------------------------------------       
1                    1          
2                    1          
3                    1          
4                    1       
  • Process
This table is automatically generated from the previous tables. It represents the details of each execution including the step, step-class, input, output, parameter, and starting time.
CREATE VIEW Process 
AS 
    SELECT DISTINCT 
          instanceOf.step      step, 
          instanceOf.stepClass stepClass, 
          input.dataId         input, 
          dataInput.name       inputName,
          output.dataId        output, 
          dataOutput.name      outputName,
          input.ts             time 
     FROM 
           instanceOf, 
           input, 
           data dataInput,
           output,
           data dataOutput
     WHERE instanceOf.step = input.step 
       AND input.dataId = dataInput.dataId
       AND input.step = output.step 
       AND output.dataId = dataOutput.dataId
       AND input.ts <= output.ts
As an example, step 13 is an execution of the step-class convert. It takes an input of DATAID 25 (Atlas X Slice) and was executed on 08/20, 2006. It does not take any parameters. It produces the output of DATAID 28 (Atlas X Graphic).
   STEP STEPCLASS  INPUT  INPUTNAME OUTPUT OUTPUTNAME  TIME 
---------------------------------------------------------------- 
  13   convert    25     Atlas X Slice 28  Atlas X Graphic 8/20/2006 

Provenance Queries


Also, make sure you complete the ProvenanceQueriesMatrix.

Added:
>
>
For each query, we provide the corresponding SQL query and the results obtained. We also provide these results in a user-friendly format (as screenshots of our JAVA user interface).

  • Query1
    • Query: Find the process that led to Atlas X Graphic / everything that caused Atlas X Graphic to be as it is. This should tell us the new brain images from which the averaged atlas was generated, the warping performed etc.
SELECT DISTINCT step, stepClass, input, inputName, output, outputName 
     FROM process 
     START WITH output = 
              (
               SELECT dataId FROM data WHERE name = 'Atlas X Graphic'
               ) 
     CONNECT BY PRIOR input = output 
     ORDER BY step
  • Note: This query exploits the Connect by prior operator of Oracle which implements the transitive closure function. Transitive closure is necessary for returning all the data (recursively) that have been used to compute Atlas X Graphic.
    • Result (SQL):
STEP STEPCLASS   INPUT    INPUTNAME            OUTPUT OUTPUTNAME         
----------------------------------------------------------------  
1    align_warp  1        Anatomy Image1       11     Warp Parameters1    
1    align_warp  10       Reference Header     11     Warp Parameters1    
1    align_warp  2        Anatomy Header1      11     Warp Parameters1    
1    align_warp  9        Reference Image      11     Warp Parameters1    
10   slicer      23       Atlas Image          25     Atlas X Slice       
10   slicer      24       Atlas Header         25     Atlas X Slice       
13   convert     25       Atlas X Slice        28     Atlas X Graphic     
2    align_warp  10       Reference Header     12     Warp Parameters2    
2    align_warp  3        Anatomy Image2       12     Warp Parameters2    
2    align_warp  4        Anatomy Header2      12     Warp Parameters2    
2    align_warp  9        Reference Image      12     Warp Parameters2    
3    align_warp  10       Reference Header     13     Warp Parameters3    
3    align_warp  5        Anatomy Image3       13     Warp Parameters3    
3    align_warp  6        Anatomy Header3      13     Warp Parameters3    
3    align_warp  9        Reference Image      13     Warp Parameters3    
4    align_warp  10       Reference Header     14     Warp Parameters4    
4    align_warp  7        Anatomy Image4       14     Warp Parameters4    
4    align_warp  8        Anatomy Header4      14     Warp Parameters4    
4    align_warp  9        Reference Image      14     Warp Parameters4    
5    reslice     11       Warp Parameters1     15     Resliced Image1     
5    reslice     11       Warp Parameters1     16     Resliced Header1    
6    reslice     12       Warp Parameters2     17     Resliced Image2     
6    reslice     12       Warp Parameters2     18     Resliced Header2    
7    reslice     13       Warp Parameters3     19     Resliced Image3     
7    reslice     13       Warp Parameters3     20     Resliced Header3    
8    reslice     14       Warp Parameters4     21     Resliced Image4     
8    reslice     14       Warp Parameters4     22     Resliced Header4    
9    softmean    15       Resliced Image1      23     Atlas Image         
9    softmean    15       Resliced Image1      24     Atlas Header        
9    softmean    16       Resliced Header1     23     Atlas Image         
9    softmean    16       Resliced Header1     24     Atlas Header        
9    softmean    17       Resliced Image2      23     Atlas Image         
9    softmean    17       Resliced Image2      24     Atlas Header        
9    softmean    18       Resliced Header2     23     Atlas Image         
9    softmean    18       Resliced Header2     24     Atlas Header        
9    softmean    19       Resliced Image3      23     Atlas Image         
9    softmean    19       Resliced Image3      24     Atlas Header        
9    softmean    20       Resliced Header3     23     Atlas Image         
9    softmean    20       Resliced Header3     24     Atlas Header        
9    softmean    21       Resliced Image4      23     Atlas Image         
9    softmean    21       Resliced Image4      24     Atlas Header        
9    softmean    22       Resliced Header4     23     Atlas Image         
9    softmean    22       Resliced Header4     24     Atlas Header    
The results may also be visualized with a user-friendly interface, where the data is grouped by steps. The data can be retrieved either by name or by id (the id option may be useful when a lot of data is retuned by a query).
    • Result (user friendly view):
To be included soon

* Query2

    • Query: Find the process that led to Atlas X Graphic, excluding everything prior to the averaging of images with softmean.
SELECT DISTINCT step, stepClass, input, inputName, output, outputName            
     FROM Process                                                                
     START WITH output = (                                                       
               SELECT dataId FROM data WHERE name = 'Atlas X Graphic'            
          )                                                                      
     CONNECT BY PRIOR input = output                                             
 MINUS (                                                                         
     SELECT DISTINCT step, stepClass, input, inputName, output, outputName       
          FROM Process                                                           
               START WITH output IN (                                            
                    SELECT input FROM Process WHERE stepClass = 'softmean'       
               )                                                                 
     CONNECT BY PRIOR input = output                                             
 )                                                                              
    • Result (SQL):
STEP    STEPCLASS   INPUT INPUTNAME            OUTPUT OUTPUTNAME   
----------------------------------------------------------------     
10      slicer      23    Atlas Image          25     Atlas X Slice   
10      slicer      24    Atlas Header         25     Atlas X Slice   
13      convert     25    Atlas X Slice        28     Atlas X Graphic 
9       softmean    15    Resliced Image1      23     Atlas Image     
9       softmean    15    Resliced Image1      24     Atlas Header    
9       softmean    16    Resliced Header1     23     Atlas Image     
9       softmean    16    Resliced Header1     24     Atlas Header    
9       softmean    17    Resliced Image2      23     Atlas Image     
9       softmean    17    Resliced Image2      24     Atlas Header    
9       softmean    18    Resliced Header2     23     Atlas Image     
9       softmean    18    Resliced Header2     24     Atlas Header    
9       softmean    19    Resliced Image3      23     Atlas Image     
9       softmean    19    Resliced Image3      24     Atlas Header    
9       softmean    20    Resliced Header3     23     Atlas Image     
9       softmean    20    Resliced Header3     24     Atlas Header    
9       softmean    21    Resliced Image4      23     Atlas Image     
9       softmean    21    Resliced Image4      24     Atlas Header    
9       softmean    22    Resliced Header4     23     Atlas Image     
9       softmean    22    Resliced Header4     24     Atlas Header    
    • Result (User friendly view): To be included soon

  • Query3
    • Query: Find the Stage 3, 4 and 5 details of the process that led to Atlas X Graphic.
SELECT DISTINCT step, stepClass, input, inputName, output, outputName 
     FROM process                                                     
     WHERE step IN (                                                  
          SELECT step FROM StageInstance                              
          WHERE (                                                     
               stage in (3,4,5)                                       
          )                                                           
     )                                                                
     START WITH output = (                                            
           SELECT dataId FROM data WHERE name = 'Atlas X Graphic'     
     )                                                                
     CONNECT BY PRIOR input = output                                 
    • Result (SQL):
STEP  STEPCLASS INPUT INPUTNAME         OUTPUT  OUTPUTNAME     
----------------------------------------------------------------       
9     softmean  19    Resliced Image3   23      Atlas Image          
9     softmean  21    Resliced Image4   23      Atlas Image          
9     softmean  17    Resliced Image2   24      Atlas Header         
9     softmean  20    Resliced Header3  24      Atlas Header         
9     softmean  15    Resliced Image1   24      Atlas Header         
9     softmean  19    Resliced Image3   24      Atlas Header         
13    convert   25    Atlas X Slice     28      Atlas X Graphic      
9     softmean  15    Resliced Image1   23      Atlas Image          
9     softmean  18    Resliced Header2  23      Atlas Image          
9     softmean  16    Resliced Header1  23      Atlas Image          
9     softmean  18    Resliced Header2  24      Atlas Header         
9     softmean  17    Resliced Image2   23      Atlas Image          
10    slicer    23    Atlas Image       25      Atlas X Slice        
9     softmean  22    Resliced Header4  23      Atlas Image          
9     softmean  16    Resliced Header1  24      Atlas Header         
9     softmean  22    Resliced Header4  24      Atlas Header         
9     softmean  20    Resliced Header3  23      Atlas Image          
10    slicer    24    Atlas Header      25      Atlas X Slice        
9     softmean  21    Resliced Image4   24      Atlas Header         
    • Result (User friendly view): To be included soon

* Query4

    • Query: Find all invocations of procedure align_warp using a twelfth order nonlinear 1365 parameter model (see model menu describing possible values of parameter "-m 12" of align_warp) that ran on a Monday.
SELECT step, stepClass, input, inputName, output, outputName, time    
     FROM process                                                     
     WHERE stepClass = 'align_warp'                                   
     AND EXISTS                                                       
        (                                                             
       SELECT value                                                   
         FROM stepParams                                              
       WHERE stepParams.step = process.step                           
         AND attribute='order'                                        
         AND value='12'                                               
       )                                                              
     AND EXISTS                                                       
        (                                                             
       SELECT value                                                   
         FROM stepParams                                              
       WHERE stepParams.step = process.step                           
         AND attribute='model'                                        
         AND value='1365'                                             
       )                                                              
    AND rtrim(to_char(time, 'DAY')) = 'MONDAY'                       
    • Result (SQL):
STEP  STEPCLASS    INPUT  INPUTNAME         OUTPUT  OUTPUTNAME        TIME 
----------------------------------------------------------------      
1     align_warp   1      Anatomy Image1    11      Warp Parameters1  8/7/2006  
1     align_warp   2      Anatomy Header1   11      Warp Parameters1  8/7/2006  
1     align_warp   10     Reference Header  11      Warp Parameters1  8/7/2006  
1     align_warp   9      Reference Image   11      Warp Parameters1  8/7/2006  
    • Result (User friendly view): To be included soon

  • Query5
    • Query: Find all Atlas Graphic images outputted from workflows where at least one of the input Anatomy Headers had an entry global maximum=4095.
SELECT name                                                             
  FROM data                                                             
  WHERE dataId IN                                                       
          (                                                             
        SELECT output                                                   
          FROM process                                                  
          START WITH input IN (                                         
                    SELECT data.dataId                                  
                      FROM data, dataAttributes                         
                      WHERE data.dataId = dataAttributes.dataId         
                        AND data.type = 'Anatomy Header'                
                      AND dataAttributes.attribute = 'global maximum'   
                      AND dataAttributes.value = '4095'                 
                   )                                                    
          CONNECT BY PRIOR output=input                                 
         )                                                              
  AND type = 'Atlas Graphic'            
  • Result (SQL):
NAME 
---------------------             
Atlas X Graphic      
Atlas Y Graphic      
Atlas Z Graphic

  • Query6
    • _Query: Find all output averaged images of softmean (average) procedures, where the warped images taken as input were align-warped using a twelfth order nonlinear 1365 parameter model, i.e. "where softmean was preceded in the workflow, directly or indirectly, by an align-warp procedure with argument -m 12." _
SELECT distinct pred.orig,  data.name               
  FROM                                              
       (                                            
    SELECT distinct CONNECT_BY_ROOT step orig, step 
      FROM process                                  
      START WITH stepClass='softmean'               
      CONNECT BY PRIOR input=output                 
     ) pred,                                        
     output,                                        
     data                                           
  WHERE                                             
    EXISTS (                                        
              SELECT *                              
          FROM stepParams                           
                WHERE attribute = 'order'           
                AND value = 12                      
              AND stepParams.step = pred.step       
           )                                        
     AND EXISTS                                     
         (                                          
              SELECT *                              
          FROM stepParams                           
                WHERE attribute = 'model'           
          AND value = 1365                          
          AND stepParams.step = pred.step           
           )                                        
  AND pred.orig = output.step                       
  AND output.dataId = data.dataId                  
    • Result (SQL):
ORIG                 NAME  
---------------------------------               
9                    Atlas Header         
9                    Atlas Image          

  • Query7
    • Query:_A user has run the workflow twice, in the second instance replacing each procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs. The exact level of detail in the difference that is detected by a system is up to each participant. _

    • Note: Computing the difference between two workflows or two workflow executions will be possible in our next implementation. Query 7 is a very simple example of the problem. However, when the difference between the data used by a step or produced by a step is considered, the problem becomes much harder. We would be very interested in talking about this problem with the workshop attendees.

  • Query8
    • Query: _A user has annotated some anatomy images with a key-value pair center=UChicago. Find the outputs of align-warp where the inputs are annotated with center=UChicago. _
SELECT distinct data.name                      
  FROM process, dataAttributes, data           
  WHERE process.stepClass='align_warp'         
    AND process.input = dataAttributes.dataId  
  AND dataAttributes.attribute = 'center'      
  AND dataAttributes.value = 'UChicago'        
  AND process.output = data.dataId            
    • Result:
NAME     
------------------            
Warp Parameters2     
Warp Parameters1  

  • Query9
    • Query: A user has annotated some atlas graphics with key-value pair where the key is studyModality. Find all the graphical atlas sets that have metadata annotation studyModality with values speech, visual or audio, and return all other annotations to these files.
SELECT distinct data.dataId, data.name, attribute, value                      
  FROM data, dataAttributes, attribute, value                                 
  WHERE                                                     
    data.type='Atlas Graphic'                               
  AND data.dataId = dataAttributes.dataId                   
  AND dataAttributes.attribute = 'studyModality'            
  AND dataAttributes.value IN ('speech', 'visual', 'audio')
    • Result:
DATAID               NAME                 ATTRIBUTE            VALUE   
---------------------------------------------------------------------             
29                   Atlas Y Graphic      studyModality        audio                
29                   Atlas Y Graphic      studyModality        visual               
30                   Atlas Z Graphic      studyModality        speech  

Suggested Workflow Variants

Changed:
<
<
Suggest variants of the workflow that can exhibit capabilities that your system support.
>
>
The model we propose takes into account the capabilities of many of the current scientific workflow systems; one such capability is the idea of composite step-classes, that is, step-classes which are composed of step-classes. There are several reasons why composite step classes are useful in workflows. First, they help to manage the complexity of a workflow; users may wish to focus on a certain level of abstraction and ignore the lower levels of detail. Second, composite step-classes can also represent levels of ``authorization''; users without the appropriate clearance level would not be allowed to see the lower level executions of a step-class.

Added:
>
>
Typically, an input to a composite step-class is also an input to one or more of its substep classes while an output of a substep class is either an input to another substep class or becomes the output of a composite step class. Also, composite step-classes define a partition of the step-classes of the workflow (there is no step-class which belongs to multiple composite step-classes).

Changed:
<
<

Suggested Queries

>
>
We have therefore defined the following notion of user views:

Given a workflow specification, the user view of a user (or class of users) U, userView(U), is the set of lowest-level step classes that U is entitled to see.

Note that a user view cannot contain two step classes such that one is contained in the other. We also assume that the user view is valid, i.e. that each of the highest-level step classes in the workflow specification is either in the view, or that at some lower-level all of its contained substeps are in the user view.

To answer questions of provenance, we must take the user view into account and reason about the input and output to steps which are instances of step-classes that are in the user view. We must know the containment relationship between step-classes and the relationship between each user and the step-classes he is entitled to see.

In the following, we introduce the part of our model which deals with user views and we will give examples based on a variation of the provenance workflow in the challenge. In the following discussion, we refer to composite steps as “boxes”.

wf.png Figure 2 – Workflow for the challenge with user views (boxes 1 to 3)

UserView Application

In this section, we introduce some additional tables which are needed for reasoning about provenance in the context of user views:

  • immContains
immContains(compoStepClass string, stepClass string) indicates the immediate containment between step-classes.

As an example, box 1 covers all the step-class align_warp and reslice

COMPOSTEPCLASS       STEPCLASS 
---------------------------------           
box1                 align_warp           
box1                 reslice

  • contains
For each step class, contains gives all the step classes it transitively contains.
CREATE VIEW
AS  
   SELECT DISTINCT CONNECT_BY_ROOT compoStepClass as compoStepClass, stepClass
     FROM immcontains
     CONNECT BY PRIOR stepClass=compoStepClass
     ORDER BY compoStepClass

  • virtualInstanceOf
instanceOf does not contain information about composite steps. virtualInstanceOf computes instanceOf for composite steps and contains information about both base and composite steps.
CREATE VIEW virtualInstanceOf
AS
  SELECT step step, stepClass, ts
    FROM instanceOf   
  UNION  
  select min(step) || '-' || compoStepClass, compoStepClass, min(ts)
    from contains, instanceOf
   where contains.stepClass = instanceOf.stepClass
    group By compoStepClass                                     
For example, step 1 is an instance of align_wrap (as previously). Step 1-box1 denotes the first step contained in box1, it is an instance of the composite step-class box1.
STEP      STEPCLASS   TS
-------------------------------         
1         align_warp  8/7/2006   
1-box1    box1        8/7/2006   

  • userView
userView(usr string, stepClass string) indicates the lowest step-class level that a user can see. Remember that the union of all the step-classes a user can see must cover the whole workflow (user views are valid).

As an example, the user ‘uBlackBox’ sees the whole workflow as one step. The user ‘uBio’ cannot see the detail of step-class align_warp and reslice separately (he sees box1 instead). He cannot see the details of step-class slicer or convert (he sees box2 instead). However, he can see the step-class softmean. UAdmin can see all step-classes.

USR                  STEPCLASS  
--------------------------------          
uBlackBox            box3                 
uBio                 box1                 
uBio                 softmean             
uBio                 box2                 
uAdmin               align_warp           
uAdmin               reslice              
uAdmin               softmean             
uAdmin               slicer               
uAdmin               convert    

The following two tables, Cinput and Coutput are input and output extended to composite steps.

  • Cinput
Similarly to input, Cinput stores information about inputs to a composite step and the time of the input to the box within the context of a user view. Specifically, an input to a box is an input to one of the steps the box contains, but must not be an output of any step contained in the box.
CREATE VIEW cInput           
AS              
 -- First union operand: "For each containment, the input is an input of  
   -- the higher level step class if and only if this input was not an output 
   -- from another of its sub steps        
   SELECT inst2.step, input.dataId, min(input.ts) ts      
    FROM input,            
     instanceOf inst1,          
       contains,           
       virtualInstanceOf inst2         
    WHERE             
     input.step = inst1.step         
      AND inst1.stepClass = contains.stepClass       
      AND contains.compoStepClass = inst2.stepClass      
      AND NOT EXISTS           
            (             
             SELECT o.step          
             FROM output o, instanceOf inst3, contains c2     
             WHERE o.dataId = input.dataId       
              AND o.step = inst3.step        
                AND inst3.stepClass = c2.stepClass       
                AND c2.compoStepClass = contains.compoStepClass    
          )             
    GROUP BY inst2.step, input.dataId        
   -- Second union operand          
   -- We want to keep the direct input relations       
   UNION             
   SELECT step, dataId, ts          
    FROM input            
As an example, the step 10-box2 takes DATAID 23 (Atlas Image) and DATAID 24 (Atlas Header) as inputs.
STEP    DATAID  TS    
--------------------------              
10-box2 23      8/17/2006           
10-box2 24      8/17/2006           

  • Coutput
Similarly to output, Coutput stores information of outputs produced from each composite step and the time of the output to the box within the context of a user view. An output from a box is an output from one of the steps the box contains, but must not be an input to any step contained in the box.
CREATE VIEW coutput                     
AS                          
  -- First union operand: "For each containment, the output is an output of 
  -- the higher level step class if and only if this output is used as an   
  -- input for another step class which is not contained in the same step class 
  -- or if this is a final output of the workflow           
   SELECT inst2.step, output.dataId, output.ts             
  FROM output,                      
   instanceOf inst1,                  
   contains,                      
   virtualInstanceOf inst2                  
  WHERE                         
    output.step = inst1.step                
      AND inst1.stepClass = contains.stepClass            
      AND contains.compoStepClass = inst2.stepClass           
      AND EXISTS                      
       (                         
      /* Step class is used as an entry of 
       *  at least one step       
       * class which is not a substep of 
         this step class */       
         SELECT i.step                    
         FROM input i, instanceOf inst3             
         WHERE i.dataId = output.dataId             
                  AND i.step = inst3.step                
         AND NOT EXISTS                   
            (                      
             SELECT contains.compoStepClass            
             FROM contains                 
             WHERE contains.stepClass = inst3.stepClass        
             AND contains.compoStepClass = inst2.stepClass       
            )                      
         UNION                      
          -- Step class is a final output?            
         (                       
         SELECT output.dataId                
         FROM dual                   
         MINUS                     
         SELECT i.dataId                   
         FROM input i                  
         )                       
        )                       
   UNION                         
   -- Second union operand                   
   -- We want to keep the direct output relations          
   SELECT step, dataId, ts                   
  FROM output        
As an example, step 10-box2 produced data ID 28 to 30 (Atlas Graphic X, Y and Z) on 08/20/2006, 08/21/2006, and 08/22/2006 respectively.
STEP         DATAID  TS  
---------------------------- 
10-box2      28 8/20/2006           
10-box2      29 8/21/2006           
10-box2      30 8/22/2006

  • UProcess
Similarly to Process, UProcess represents the detail of each execution which a user can see.
CREATE VIEW UProcess                                    
AS                                                      
    SELECT DISTINCT                                     
        userView.usr            usr,                        
        instanceOf.step         step,                     
        instanceOf.stepClass    stepClass,              
        input.dataId            input,                        
        dataInput.name          inputName,                
        output.dataId           output,                     
        dataOutput.name         outputName,               
        input.ts                time                            
     FROM                                               
           userView,                                    
         virtualInstanceOf instanceOf,                  
           cinput input,                                
           data dataInput,                              
           coutput output,                              
           data dataOutput                              
     WHERE                                              
               userView.stepClass = instanceOf.stepClass
         AND instanceOf.step = input.step               
       AND input.dataId = dataInput.dataId              
       AND input.step = output.step                     
       AND output.dataId = dataOutput.dataId            
       AND input.ts <= output.ts                        
   --ORDER BY usr, step, input                          
As an example, let us consider three user views and their perspectives on stage 1 of the workflow (align_warp).

The user ‘uAdmin’ can see all the steps of the workflow. First, he can see the wrap param images produced (DATAID 11 to 14). Then, he can see that only data ID 1, 2, 9, and 10 are used as inputs to step 1.


Changed:
<
<
Suggest significant queries that your system can support and are not in the proposed list of queries, and how you have implemented/would implement them. These queries may be with regards to a variant of the workflow suggested above.
>
>
The user ‘uBio’ sees stages 1 and 2 of the workflow as a single black box. First, he cannot see data whose DATAID are 11 to 13. Second, for him DATAIDs 1 through 10 are used to produce DATAIDs 15 through 22 as outputs.

Added:
>
>
The user ‘uBlackBox’ cannot see the details of step 1 either. He is in the same case as UBio except that he cannot see data whose DATAIDs are 15 through 22. The only output that he can see are the final outputs that is, DATAID 28 through 30. The implications of user views when it comes to query answering, are discussed in the following section.

Suggested Queries

In this section, we answer query 1 considering user views and we introduce two other queries to emphasize the benefit of our approach. We will show that user views allow

  • (i) to consider security levels (when the user is not allowed to see the details of a box and which are the inputs used to produced a given output), see the next query and the first version of new query, and
  • (ii) to focus on high-level information (when the user is able to know which are the inputs used to produced a given output but is not interested in knowing the details about the intermediate steps), see the second version of new query.

* Query1 for each user

    • Results obtained in a user friendly format (to be included soon)
    • Query for user “uBlackbox”
      • Query:
SELECT DISTINCT step, stepClass, input, inputName, output, outputName     
     FROM uProcess                                                        
   WHERE usr='uBlackBox'                                                     
     START WITH output = (                                                
               SELECT dataId FROM data WHERE name = 'Atlas X Graphic'     
               )                                                          
     CONNECT BY PRIOR input = output                                      
     ORDER BY step;                                                        
      • Result:
STEP       STEPCLASS   INPUT    INPUTNAME            OUTPUT   OUTPUTNAME        
-----------------------------------------------------------------------------
1-box3     box3        1        Anatomy Image1       28       Atlas X Graphic   
1-box3     box3        10       Reference Header     28       Atlas X Graphic   
1-box3     box3        2        Anatomy Header1      28       Atlas X Graphic   
1-box3     box3        3        Anatomy Image2       28       Atlas X Graphic   
1-box3     box3        4        Anatomy Header2      28       Atlas X Graphic   
1-box3     box3        5        Anatomy Image3       28       Atlas X Graphic   
1-box3     box3        6        Anatomy Header3      28       Atlas X Graphic   
1-box3     box3        7        Anatomy Image4       28       Atlas X Graphic   
1-box3     box3        8        Anatomy Header4      28       Atlas X Graphic   
1-box3     box3        9        Reference Image      28       Atlas X Graphic    
      • Note: Only the first (DATAIDs 1 to 10) inputs of the workflow are seen by UBlackBox?.
    • Query for user “uBio”
      • Query: As previously with WHERE users = ‘uBio’
      • Result:
STEP   STEPCLASS INPUT INPUTNAME        OUTPUT OUTPUTNAME      
1-box1 box1      1     Anatomy Image1   15     Resliced Image1 
1-box1 box1      1     Anatomy Image1   16     Resliced Header1
1-box1 box1      1     Anatomy Image1   17     Resliced Image2 
1-box1 box1      1     Anatomy Image1   18     Resliced Header2
1-box1 box1      1     Anatomy Image1   19     Resliced Image3 
1-box1 box1      1     Anatomy Image1   20     Resliced Header3
1-box1 box1      1     Anatomy Image1   21     Resliced Image4 
1-box1 box1      1     Anatomy Image1   22     Resliced Header4
1-box1 box1      10    Reference Header 15     Resliced Image1 
1-box1 box1      10    Reference Header 16     Resliced Header1
1-box1 box1      10    Reference Header 17     Resliced Image2 
1-box1 box1      10    Reference Header 18     Resliced Header2
1-box1 box1      10    Reference Header 19     Resliced Image3 
1-box1 box1      10    Reference Header 20     Resliced Header3
1-box1 box1      10    Reference Header 21     Resliced Image4 
1-box1 box1      10    Reference Header 22     Resliced Header4
1-box1 box1      2     Anatomy Header1  15     Resliced Image1 
1-box1 box1      2     Anatomy Header1  16     Resliced Header1
1-box1 box1      2     Anatomy Header1  17     Resliced Image2 
1-box1 box1      2     Anatomy Header1  18     Resliced Header2
1-box1 box1      2     Anatomy Header1  19     Resliced Image3 
1-box1 box1      2     Anatomy Header1  20     Resliced Header3
1-box1 box1      2     Anatomy Header1  21     Resliced Image4 
1-box1 box1      2     Anatomy Header1  22     Resliced Header4
1-box1 box1      3     Anatomy Image2   15     Resliced Image1 
1-box1 box1      3     Anatomy Image2   16     Resliced Header1
1-box1 box1      3     Anatomy Image2   17     Resliced Image2 
1-box1 box1      3     Anatomy Image2   18     Resliced Header2
1-box1 box1      3     Anatomy Image2   19     Resliced Image3 
1-box1 box1      3     Anatomy Image2   20     Resliced Header3
1-box1 box1      3     Anatomy Image2   21     Resliced Image4 
1-box1 box1      3     Anatomy Image2   22     Resliced Header4
1-box1 box1      4     Anatomy Header2  15     Resliced Image1 
1-box1 box1      4     Anatomy Header2  16     Resliced Header1
1-box1 box1      4     Anatomy Header2  17     Resliced Image2 
1-box1 box1      4     Anatomy Header2  18     Resliced Header2
1-box1 box1      4     Anatomy Header2  19     Resliced Image3 
1-box1 box1      4     Anatomy Header2  20     Resliced Header3
1-box1 box1      4     Anatomy Header2  21     Resliced Image4 
1-box1 box1      4     Anatomy Header2  22     Resliced Header4
1-box1 box1      5     Anatomy Image3   15     Resliced Image1 
1-box1 box1      5     Anatomy Image3   16     Resliced Header1
1-box1 box1      5     Anatomy Image3   17     Resliced Image2 
1-box1 box1      5     Anatomy Image3   18     Resliced Header2
1-box1 box1      5     Anatomy Image3   19     Resliced Image3 
1-box1 box1      5     Anatomy Image3   20     Resliced Header3
1-box1 box1      5     Anatomy Image3   21     Resliced Image4 
1-box1 box1      5     Anatomy Image3   22     Resliced Header4
1-box1 box1      6     Anatomy Header3  15     Resliced Image1 
1-box1 box1      6     Anatomy Header3  16     Resliced Header1
1-box1 box1      6     Anatomy Header3  17     Resliced Image2 
1-box1 box1      6     Anatomy Header3  18     Resliced Header2
1-box1 box1      6     Anatomy Header3  19     Resliced Image3 
1-box1 box1      6     Anatomy Header3  20     Resliced Header3
1-box1 box1      6     Anatomy Header3  21     Resliced Image4 
1-box1 box1      6     Anatomy Header3  22     Resliced Header4
1-box1 box1      7     Anatomy Image4   15     Resliced Image1 
1-box1 box1      7     Anatomy Image4   16     Resliced Header1
1-box1 box1      7     Anatomy Image4   17     Resliced Image2 
1-box1 box1      7     Anatomy Image4   18     Resliced Header2
1-box1 box1      7     Anatomy Image4   19     Resliced Image3 
1-box1 box1      7     Anatomy Image4   20     Resliced Header3
1-box1 box1      7     Anatomy Image4   21     Resliced Image4 
1-box1 box1      7     Anatomy Image4   22     Resliced Header4
1-box1 box1      8     Anatomy Header4  15     Resliced Image1 
1-box1 box1      8     Anatomy Header4  16     Resliced Header1
1-box1 box1      8     Anatomy Header4  17     Resliced Image2 
1-box1 box1      8     Anatomy Header4  18     Resliced Header2
1-box1 box1      8     Anatomy Header4  19     Resliced Image3 
1-box1 box1      8     Anatomy Header4  20     Resliced Header3
1-box1 box1      8     Anatomy Header4  21     Resliced Image4 
1-box1 box1      8     Anatomy Header4  22     Resliced Header4
1-box1 box1      9     Reference Image  15     Resliced Image1 
1-box1 box1      9     Reference Image  16     Resliced Header1
1-box1 box1      9     Reference Image  17     Resliced Image2 
1-box1 box1      9     Reference Image  18     Resliced Header2
1-box1 box1      9     Reference Image  19     Resliced Image3 
1-box1 box1      9     Reference Image  20     Resliced Header3
1-box1 box1      9     Reference Image  21     Resliced Image4 
1-box1 box1      9     Reference Image  22     Resliced Header4
10-box2box2      23    Atlas Image      28     Atlas X Graphic 
10-box2box2      24    Atlas Header     28     Atlas X Graphic 
9      softmean  15    Resliced Image1  23     Atlas Image     
9      softmean  15    Resliced Image1  24     Atlas Header    
9      softmean  16    Resliced Header1 23     Atlas Image     
9      softmean  16    Resliced Header1 24     Atlas Header    
9      softmean  17    Resliced Image2  23     Atlas Image     
9      softmean  17    Resliced Image2  24     Atlas Header    
9      softmean  18    Resliced Header2 23     Atlas Image     
9      softmean  18    Resliced Header2 24     Atlas Header    
9      softmean  19    Resliced Image3  23     Atlas Image     
9      softmean  19    Resliced Image3  24     Atlas Header    
9      softmean  20    Resliced Header3 23     Atlas Image     
9      softmean  20    Resliced Header3 24     Atlas Header    
9      softmean  21    Resliced Image4  23     Atlas Image     
9      softmean  21    Resliced Image4  24     Atlas Header    
9      softmean  22    Resliced Header4 23     Atlas Image     
9      softmean  22    Resliced Header4 24     Atlas Header    
      • Note: UBio does not know that only DATAIDs 1, 2, 9 and 10 are used by 1.align_wrap. He thinks that DATAIDs 1 through 10 are necessary. The same can be said for the steps inside box1 and box2.
    • Query for user “uAdmin”
      • Query: As previously with WHERE users = ‘uAdmin’
      • Result:
STEP   STEPCLASS      INPUT  INPUTNAME            OUTPUT  OUTPUTNAME        
--------------------------------------------------------------------------
1      align_warp     1      Anatomy Image1       11      Warp Parameters1  
1      align_warp     10     Reference Header     11      Warp Parameters1  
1      align_warp     2      Anatomy Header1      11      Warp Parameters1  
1      align_warp     9      Reference Image      11      Warp Parameters1  
10     slicer         23     Atlas Image          25      Atlas X Slice     
10     slicer         24     Atlas Header         25      Atlas X Slice     
13     convert        25     Atlas X Slice        28      Atlas X Graphic   
2      align_warp     10     Reference Header     12      Warp Parameters2  
2      align_warp     3      Anatomy Image2       12      Warp Parameters2  
2      align_warp     4      Anatomy Header2      12      Warp Parameters2  
2      align_warp     9      Reference Image      12      Warp Parameters2  
3      align_warp     10     Reference Header     13      Warp Parameters3  
3      align_warp     5      Anatomy Image3       13      Warp Parameters3  
3      align_warp     6      Anatomy Header3      13      Warp Parameters3  
3      align_warp     9      Reference Image      13      Warp Parameters3  
4      align_warp     10     Reference Header     14      Warp Parameters4  
4      align_warp     7      Anatomy Image4       14      Warp Parameters4  
4      align_warp     8      Anatomy Header4      14      Warp Parameters4  
4      align_warp     9      Reference Image      14      Warp Parameters4  
5      reslice        11     Warp Parameters1     15      Resliced Image1   
5      reslice        11     Warp Parameters1     16      Resliced Header1  
6      reslice        12     Warp Parameters2     17      Resliced Image2   
6      reslice        12     Warp Parameters2     18      Resliced Header2  
7      reslice        13     Warp Parameters3     19      Resliced Image3   
7      reslice        13     Warp Parameters3     20      Resliced Header3  
8      reslice        14     Warp Parameters4     21      Resliced Image4   
8      reslice        14     Warp Parameters4     22      Resliced Header4  
9      softmean       15     Resliced Image1      23      Atlas Image       
9      softmean       15     Resliced Image1      24      Atlas Header      
9      softmean       16     Resliced Header1     23      Atlas Image       
9      softmean       16     Resliced Header1     24      Atlas Header      
9      softmean       17     Resliced Image2      23      Atlas Image       
9      softmean       17     Resliced Image2      24      Atlas Header      
9      softmean       18     Resliced Header2     23      Atlas Image       
9      softmean       18     Resliced Header2     24      Atlas Header      
9      softmean       19     Resliced Image3      23      Atlas Image       
9      softmean       19     Resliced Image3      24      Atlas Header      
9      softmean       20     Resliced Header3     23      Atlas Image       
9      softmean       20     Resliced Header3     24      Atlas Header      
9      softmean       21     Resliced Image4      23      Atlas Image       
9      softmean       21     Resliced Image4      24      Atlas Header      
9      softmean       22     Resliced Header4     23      Atlas Image       
9      softmean       22     Resliced Header4     24      Atlas Header      
      • Note: All the data are obtained.

  • New Query: What are the inputs that “Resliced Image1” (ID 15) depends on?
We will consider two versions of this query. The first version (security) considers that for security reasons, the users are not allowed to know the details about the inputs used to produced a given output. The second version (high-level) considers that the users are allowed to know the details about the inputs used to produced a given output. In both situations, intermediate data and steps will not be known by the users according to their user view.

    • First version (security) of the query for ‘uBlack’
      • Query:
SELECT *
  FROM uProcess  upc 
     WHERE usr = 'uBlack'
     START WITH outputName = 'Resliced Image1'
     CONNECT BY PRIOR upc.output = upc.input;  
      • Result: no rows selected
(uBlackBox can’t even see ‘Resliced Image1’)

    • First version of the query for ‘uBio’
      • Query: Idem with ‘uBio’
      • Result:
  INPUT
----------
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
Note that as expected, a lot of inputs are provided since this user sees the two first stages as a black box. Consequently, he cannot distinguish which input is used to produce which output. This can be interesting for security reason.

    • First version (security) of the query for ‘uAdmin’
      • Query: Idem with ‘uAdmin’
      • Result:
INPUT
----------
         1
         2
         9
        10
        11
Since uAdmin can see everything, he knows that only DATAIDs 1, 2 and 9 through 11 have been used as inputs.

    • Second version (high-level) of the query for ‘uBlack’
      • Query:

      • Result: no rows selected
(As previously, uBlackBox can’t even see ‘Resliced Image1’)

    • Second version (high-level) of the query for ‘uBio’
      • Query: Idem with ‘uBio’
      • Result:
INPUT
----------
         1
         2
         9
        10
        11
Note that as expected, only the right inputs are provided.

    • Second version (high-level) of the query for ‘uAdmin’
      • Query: Idem with ‘uAdmin’
      • Result:
INPUT
----------
         1
         2
         9
        10
        11
Same results as previously.

Categorisation of queries

Changed:
<
<
According to your provenance approach, you may be able to provide a categorisation of queries. Can you elaborate on the categorisation and its rationale.
>
>
According to your provenance approach, you may be able to provide a categorisation of queries. Can you elaborate on the categorisation and its rationale

Generally speaking, our provenance queries can be categorized as such:

  • step vs data provenance: depending on whether the user is interested in knowing the step(s) or the data that have been used to produce a given data item.
  • immediate vs deep provenance: depending on whether the user is interested in knowing only the previous step (and possibly the data used by this step) or the whole sequence of steps (and possibly the data used by these steps) used to produced a given data item.

According to our classification, all the queries proposed in this challenge are deep provenance queries, they are both step and data provenance queries.


Live systems

Changed:
<
<
If your system can be accessed live (through portal, web page, web service, or other), provide relevant information here.
>
>
Our implementation is not presently available to the public at large. However, we will be happy to share the code with those who are interested in seeing it. Please send us an email.

Further Comments

Deleted:
<
<
Provide here further comments.

Conclusions

Changed:
<
<
Provide here your conclusions on the challenge, and issues that you like to see discussed at a face to face meeting.
>
>
We have shown how our model is able to represent the challenge workflow and to answer the proposed queries. We have also shown how the notion of a user view can allow the user to manage the complexity of a workflow through higher levels of abstraction and layering.

We would be interested in discussing with the workshop attendees about three points: (i) the meaning of a stage in a workflow (and in particular, how to interpret this notion with respect to user views), (ii) the biological significance of the procedures that can compose a workflow (in the workflow provided, what are the “significant” steps for a biologist?) and (iii) the (general) problem of computing a difference between two workflows (cf. query 7).

-- SarahCohenBoulakia - 10 Sep 2006

  • tables1.png:
    tables1.png

  • tables2.png:
    tables2.png

Added:
>
>
  • tables3.png:
    tables3.png

Changed:
<
<
-- SarahCohenBoulakia - 06 Sep 2006
>
>
  • tables4.png:
    tables4.png

Added:
>
>
META FILEATTACHMENT wf.png attr="" comment="" date="1157908776" path="C:\Documents and Settings\Sarah\Mes documents\wf.png" size="84256" user="SarahCohenBoulakia" version="1.1"
META FILEATTACHMENT er.jpg attr="" comment="" date="1157909001" path="C:\Documents and Settings\Sarah\Mes documents\er.jpg" size="163515" user="SarahCohenBoulakia" version="1.1"
META FILEATTACHMENT tables1.png attr="h" comment="" date="1157910807" path="C:\Documents and Settings\Sarah\Mes documents\Mes images\tables1.png" size="1178542" user="SarahCohenBoulakia" version="1.1"
META FILEATTACHMENT tables2.png attr="" comment="" date="1157911122" path="C:\Documents and Settings\Sarah\Mes documents\Mes images\tables2.png" size="547280" user="SarahCohenBoulakia" version="1.1"
META FILEATTACHMENT tables3.png attr="" comment="" date="1157911188" path="C:\Documents and Settings\Sarah\Mes documents\Mes images\tables3.png" size="1178125" user="SarahCohenBoulakia" version="1.1"
META FILEATTACHMENT tables4.png attr="" comment="" date="1157911255" path="C:\Documents and Settings\Sarah\Mes documents\Mes images\tables4.png" size="1087598" user="SarahCohenBoulakia" version="1.1"
 <<O>>  Difference Topic UPenn (r1.3 - 07 Sep 2006 - SarahCohenBoulakia)

META TOPICPARENT WebHome

Provenance Challenge Template

Line: 9 to 9

  • Project URL: http://db.cis.upenn.edu
  • Project Overview: This project aims to provide a formal model of provenance for scientific workflows which is simple and general (i.e. can be used with existing workflow systems, such as Kepler and myGrid) and sufficiently expressive to answer the provenance queries encountered in case studies. Interestingly, our model not only takes into account the chained and complex structure of scientific workflows, but allows asks for provenance at different levels of abstraction using user views.
  • Provenance-specific Overview: See above.
Changed:
<
<
  • Relevant Publications: Shirley Cohen, Sarah Cohen-Boulakia and Susan Davidson. Towards a Model of Scientific workflows and User Views. Proceedings of DILS'06, Data Integration for the Life Sciences,
Springer-Verlag, Lecture Notes in Bioinformatics (LNBI) Num. 4075, pp. 264-279.
>
>
  • Relevant Publications: Shirley Cohen, Sarah Cohen-Boulakia and Susan Davidson. Towards a Model of Scientific workflows and User Views. Proceedings of DILS'06, Data Integration for the Life Sciences, Springer-Verlag, Lecture Notes in Bioinformatics (LNBI) Num. 4075, pp. 264-279.

The following will be updated soon!

 <<O>>  Difference Topic UPenn (r1.2 - 07 Sep 2006 - SarahCohenBoulakia)

META TOPICPARENT WebHome

Provenance Challenge Template

 <<O>>  Difference Topic UPenn (r1.1 - 06 Sep 2006 - SarahCohenBoulakia)
Line: 1 to 1
Added:
>
>
META TOPICPARENT WebHome

Provenance Challenge Template

Participating Team (Under construction)

  • Short team name: Database Group, University of Pennsylvania
  • Participant names: Sarah Cohen-Boulakia, Shirley Cohen, Susan Davidson, Thunyarat (Bam) Amornpetchkul
  • Project URL: http://db.cis.upenn.edu
  • Project Overview: This project aims to provide a formal model of provenance for scientific workflows which is simple and general (i.e. can be used with existing workflow systems, such as Kepler and myGrid) and sufficiently expressive to answer the provenance queries encountered in case studies. Interestingly, our model not only takes into account the chained and complex structure of scientific workflows, but allows asks for provenance at different levels of abstraction using user views.
  • Provenance-specific Overview: See above.
  • Relevant Publications: Shirley Cohen, Sarah Cohen-Boulakia and Susan Davidson. Towards a Model of Scientific workflows and User Views. Proceedings of DILS'06, Data Integration for the Life Sciences,
Springer-Verlag, Lecture Notes in Bioinformatics (LNBI) Num. 4075, pp. 264-279.

The following will be updated soon!

Workflow Representation

Provide here a description of how you have encoded the Challenge workflow.

Provenance Trace

Upload a representation of the information you captured when executing the workflow. Explain the structure (provide pointers to documents describing your schemas etc.)

Provenance Queries

For each query, if your system can support your query, provide a description of how you implement the query, what result is returned; otherwise, explain whether the query is in the remit of your system.

Also, make sure you complete the ProvenanceQueriesMatrix.

Suggested Workflow Variants

Suggest variants of the workflow that can exhibit capabilities that your system support.

Suggested Queries

Suggest significant queries that your system can support and are not in the proposed list of queries, and how you have implemented/would implement them. These queries may be with regards to a variant of the workflow suggested above.

Categorisation of queries

According to your provenance approach, you may be able to provide a categorisation of queries. Can you elaborate on the categorisation and its rationale.

Live systems

If your system can be accessed live (through portal, web page, web service, or other), provide relevant information here.

Further Comments

Provide here further comments.

Conclusions

Provide here your conclusions on the challenge, and issues that you like to see discussed at a face to face meeting.

-- SarahCohenBoulakia - 06 Sep 2006

Revision r1.1 - 06 Sep 2006 - 20:52 - SarahCohenBoulakia
Revision r1.14 - 20 Oct 2006 - 16:12 - SarahCohenBoulakia