> > |
Change Proposal: Remove IDs from Serialisation-Independent Model
Authors
SimonMiles
2009 July 21, extracted from previous discussion in ChangeProposalRemoveNonCore.
Subject
Core OPM specification
Background
Problem addressed
Artifact and process IDs appear to serialisation-specific, so should not belong to the core model, and an OPM graph is defined as edges between IDs rather than between artifacts and processes which is not intuitive and contradicts the figures illustrating OPM (see my arguments for these assertions in rationale and comments sections below).
Proposed solution
In the Provenance Graph Definition, place the same rules on artifacts and processes as accounts, i.e.
- "Artifacts are entities that we assume can can be compared. Artifacts contain a placeholder for a domain specific value or reference to a piece of state. Two artifacts are equal if and only if they have the same identifier (irrespective of their placeholder contents). Artifacts can optionally belong to accounts: account membership is declared by listing the accounts an artifact belongs to."
- "Processes are entities that we assume can can be compared. Processes can optionally belong to accounts: account membership is declared by listing the accounts a process belongs to."
In the formalisation, replace the following:
- "We assume the existence of a few primitive sets: identifiers for processes, artifacts and agents, roles, and accounts. These sets of identifiers provide indentifies to the corresponding entities within the scope of a given provenance graph. A given serialization will standardize on these sets, and provide concrete representations for them. "
with:
- "A serialisation of OPM will provide means to declare two accounts, two artifacts or two processes to be equal to each other, e.g. IDs scoped locally to the serialised graph."
Remove the following, and put it instead in the specification of the XML serialisation of OPM:
- "It is important to stress that the purpose of these identifiers is to define the structure of graphs: they are not meant to define identities that are persistent and reliably resolvable over time."
In the formalism, take Account, Role, Process, Agent, Artifact, Value to be the primitive sets; define ArtifactValue? to be a mapping from an Artifact to its Value; define ArtifactAccounts? to be a mapping from an Artifact to its set of Accounts; define ProcessAccounts? to be a mapping from a Process to its set of Accounts; use Artifact/Process instead of ArtifactID?/ProcessID in defining the causal relationships.
Rationale for the solution
I would normally consider graphs to be modelled as edges between nodes, but an OPM graph is modelled by edges between IDs which are parts of nodes.
Nodes need identity to allow sharing but that does not mean they have to have explicit identifiers outside of any one serialisation. If we want to assign identifiers for a particular purpose or in a particular serialisation we can, e.g. that proposed in ChangeProposalDCNaming. The artifact and process IDs seem tied to serialisation for a few reasons.
First, IDs being replaced seems to affect the serialisation but not the meaning of the graph. For example, if I took an OPM graph in the current XML serialisation and loaded it into memory using the JavaBeans?? deserialiser (effectively creating a new serialisation), then the JavaBeans?? for the graph edges would address the artifact and process node JavaBeans?? by memory locations not their original IDs, and the graph can be fully interpreted without ever using the IDs used in the XML serialisation.
Second, for a particular class of applications, it may be sufficient not to ascribe explicit IDs in the serialisation, because only tree structures are ever present in the causal graphs. This would not make the graph non-interpretable or non-interoperable.
Third, a given serialisation or use of annotations may provide adequate identity for expressing shared nodes without requiring further IDs. For example, an XML serialisation automatically gives each part of the graph a unique XPath. Or, where global identifiers are provided in annotations for other purposes, these can also be used to express that two artifacts are one and the same. This suggests to me that the requirement for core OPM is for identity, not identifiers.
Finally, to answer the point about how to know whether two artifact descriptions denote the same or not, an alternative would be to include a serialisation-specific relation between two artifact descriptions saying they denote the same if they are (with the default assumption that they denote something different). I agree that, to include such a relationship requires identifying the artifacts, but, as with the relationship, this can be done in a serialisation-specific way. I am not suggesting this is preferable to using IDs, only that it seems to achieve the same end if a particular serialisation chose to do this, and either way of establishing identity and the important thing for the core model is to that (shared) identity is clear rather identifiers.
Comments
Community is invited to provide comments on proposals.
comment 1 by Luc Moreau
IDs are introduced to help us construct graphs and express sharing of nodes. Without an ID, how can we decide whether
<artifact value="5"/>
is the same or not as
<artifact value="5"/>
I therefore think that IDs are crucial to understand the shape of a graph, and an essential part of OPM.
Regarding the question "Even if artifact/process IDs are desirable, why are 'used' and 'wasGeneratedBy' arcs defined as being between IDs and not between processes and artifacts themselves? Surely the edges of the graph should be between the nodes of the graph? ", this is applies to the proposed XML serialisation. We might have serialised OPM differently, and proposals are welcomed (note the xml serialisation has never been reviewed!) It is however crucial that sharing is expressed in the graph. How would you do it if artifacts/processes are placed in the edges instead of their identifiers. If our underpinning data structure was a tree (without sharing of nodes/branches), then, agreed, without IDs.
comment 2 by Luc Moreau
I see a big distinction between IDs in OPM graphs and (global) names of nodes. IDs help express the topology of the graph. Two nodes with different IDs are by definition distinct nodes in the graph. Naming schemes and naming conventions are different. It is not rare that a given entity could be different names. In such a case, different names do not imply that the entities are different.
comment 3 by Ben Clifford
Artifacts and processes have identity by virtue of their existence, not by virtue of being given an identifying label. In some representations (such as the present XML format) its necessary to give IDs to artifacts and processes in order to describe the relations between them. But in another representation, where an OPM graph is drawn on a piece of paper, IDs are not necessary, whilst still being a complete representation of the OPM graph. My feeling is that local ID information, if necessary, be pushed to the specifications defining the particular representation.
comment 4 by Luc Moreau in response to comment 3
When drawing a graph on a piece of paper, nodes have a unique "address" given by their position on the paper.
You express sharing in the graph by drawing lines between specific positions.
I don't understand the proposal of moving IDs out of the model to specific serialisations. How do we know whether two artifact
descriptions in an OPM graph denote the same or not?
comment 5 by Simon Miles in response to comments 1, 2, 3, 4
My intuition is, I think, the same as Ben's: nodes need identity to allow sharing but that does not mean they have to have explicit identifiers outside of any one serialisation. If we want to assign identifiers for a particular purpose or in a particular serialisation we can, e.g. that proposed in ChangeProposalDCNaming. The artifact and process IDs seem tied to serialisation for a few reasons.
First, IDs being replaced seems to affect the serialisation but not the meaning of the graph. For example, if I took an OPM graph in the current XML serialisation and loaded it into memory using the JavaBeans? deserialiser (effectively creating a new serialisation), then the JavaBeans? for the graph edges would address the artifact and process node JavaBeans? by memory locations not their original IDs, and the graph can be fully interpreted without ever using the IDs used in the XML serialisation.
Second, for a particular class of applications, it may be sufficient not to ascribe explicit IDs in the serialisation, because only tree structures are ever present in the causal graphs. This would not make the graph non-interpretable or non-interoperable.
Third, a given serialisation or use of annotations may provide adequate identity for expressing shared nodes without requiring further IDs. For example, an XML serialisation automatically gives each part of the graph a unique XPath. Or, where global identifiers are provided in annotations for other purposes, these can also be used to express that two artifacts are one and the same. This suggests to me that the requirement for core OPM is for identity, not identifiers.
Finally, to answer the point about how to know whether two artifact descriptions denote the same or not, an alternative would be to include a serialisation-specific relation between two artifact descriptions saying they denote the same if they are (with the default assumption that they denote something different). I agree that, to include such a relationship requires identifying the artifacts, but, as with the relationship, this can be done in a serialisation-specific way. I am not suggesting this is preferable to using IDs, only that it seems to achieve the same end if a particular serialisation chose to do this, and either way of establishing identity and the important thing for the core model is to that (shared) identity is clear rather identifiers.
comment 6 by Luc Moreau in response to 5
I can't see what your proposal is.
To me, it is crucial that we can reason about node equality in the abstract model, independently of any serialisation. Serialisations (in xml and rdf) or representations (as Java objects) will have to preserve this notion of equality.
Given that we aim at inter-operability, I am not in favour to say that the model "assume a notion of equality over nodes based on their identity". This would lead to problems of interpretation, and ultimately, systems will not inter-operate.
Sharing is an essential aspect of a provenance graph, and we must have a precise, unambiguous way of doing it. This does not prevent a given serialisation to do without identifiers, but it will be the duty of that serialisation to provide the means to reconstruct identifiers when reconstructing an OPM graph, and to drop them as it sees fit when serializing an OPM graph.
Comment 7 by Luc on the revised proposal
I am opposed to this proposal for the reason I explained before. It is important that the opm abstract model provides the means to decide if two nodes are equal. This is not an issue to be left to serialisations, because otherwise we will have no means of mapping serialisation X to serialisation Y, unless we have intimate knowledge of both x and y. I also want to be able to implement opm graph reasoning, independently of how I am going to serialise my graphs.
Your comment however raises the issue of account equality, and maybe we should introduce identifiers for them too.
Comment 8 by Simon Miles in reply to Comment 7
I afraid I still don't understand why IDs are part of the abstract (serialisation-independent) model. I try to explain why it seems wrong from a couple of perspectives below, then answer specific points in your comment.
First, to make a comparison, if I create a UML model, for example, I would not have to add ID attributes to every class before I can make one object an aggregate of another, or represent one object passing a message to another. I can even make a graph out of inter-referencing objects. It is in the nature of modelling that entities are distinguished and become referenceable. This does not place particular restrictions on how the abstract model may be realised in implementation, i.e. how C++/Java/whatever chooses to give references to the classes and objects.
Second, the formal model also seems to contradict the example figures in the specification depicting OPM graphs (and I agree with the model used by the figures). In the formal model, the edges go between IDs of the graph nodes, but in the figures the edges go between the nodes themselves. It might be argued that the figures only simplify and approximate the OPM graph, but I can't see anything missing in them.
With regards to mapping between serialisations, I'm not sure if you mean translation of a graph from one serialisation to another or combination of two independently produced graphs including documentation of the same artifact/process. If the former, there seems no need for equivalence of IDs (if any) between one serialisation and another: X represents an OPM graph, possibly using IDs to express node sharing in the graph, Y represents the same graph in a different form, possibly using different IDs to express node sharing. If the latter, then you would require IDs whose scope of uniqueness exceeded the graph they are part of, to know that something in one graph is equivalent to something in another, and I understand the usefulness of globally unique ID annotations as a separate issue.
With regards to reasoning, I still see no impediment. Isn't it the graph itself you are reasoning over, in some representation? The IDs are opaque, so provide no information over which to reason?
Vote
-- SimonMiles - 21 Jul 2009
|