Axioms of Versioning

Axioms of Versioning International Symposium on Versioning XML Vocabularies and Systems August 11, 2008 The problems of language versioning can be better understood with the help of some formal axioms defining the relations among the extensions and semantics of languages. Such axioms allow us to specify what makes one language extensionally, syntactically, or semantically a subset, superset, or equivalent of another. The difference between syntactic and semantic compatibility makes clear how languages can grow in a forward-compatible way. The key to compatible versioning is to assign new semantics in the new version of a language for syntax that was already accepted in the prior version, but to which the prior version assigned no semantics. Marc de Graauw Marc de Graauw is an independent consultant and has worked in IT for 20 years; as an IT architect, he has contributed to digital exchanges in health care, criminal justice, insurance and social security. Marc is a frequent speaker at international conferences and regularly writes articles on health care, versioning and REST-vs-SOA. He lives and works in Amsterdam, the Netherlands. Independent marc@marcdegraauw.com Copyright Marc de Graauw. This work is licensed under a Creative Commons Attribution-No Derivative Works 3.0 Unported License. versioning semantics compatibility

Introduction What is compatibility anyway? What does it mean when we say a version of a language is backward and/or forward compatible with another language? What does it mean when we say this of a document, or message, or schema, or an application? Does the notion of compatibility apply to languages at all, or does it only apply to documents or applications? And what is the difference between syntactical an semantical compatibility? Questions like these have lead me to explore the notion of compatibility in a more formal way. For efforts in achieving back- and forward compatibility in real life applications, a framework for understanding key notions in compatible versioning can be very useful. Key to understanding the working of advanced compatibility mechanisms is the semantical equivalence set. In this paper we shall define what it is, and explain why this notion is central to understanding compatibility. I will focus on message exchanges, since that is my main area of expertise, but the principles apply much wider. In general, they should apply to most computer-understandable document formats for data storage and/or exchange. More specific, I do not intend to cover natural language, other non-computer languages, programming languages or specifically multi-document data stores such as databases.

Languages, Sets and Extensions Consider a set U of language specifications in the above field. It's easy to define for each language specification the set of conforming document instances extensionally, as a set. For instance, the extension of UBL 1.0 is the set of all possible conforming UBL 1.0 instances, or the extension of HL7v3, May 2007 ballot is the set of all possible conforming HL7v3, May 2007 instances. (Often, language specifications for message exchange have many flavors and variations. HL7v3 has many versions ("ballots") and even those are hardly ever implemented as they are, but instead are localized to a particular national environment before implementation.)

Extension ELx of a Language If we show this pictorially, we get the above figure. Lx is a language specification which has all blue rectangles as conforming documents. The extension of Lx, ELx, are all blue rectangles. The complement (everything not in ELx) are all non-rectangular blue shapes, and all shapes in other colors.

ELx as Subset of ELy The extension of the language Ly consisting of all rectangles in any color is a superset of the extension of Lx; for shorthand, we say Ly is a superlanguage of Lx.

ELx as Subset of ELz Likewise, Lz, the language of all blue shapes, is a superlanguage of Lx.

ELx as Intersection of ELz and ELy And Lx is (extensionally) also the intersection of Ly and Lz. Starting with a rather loose notion of "conformance" (let's assume for the moment this notion is unproblematic) it's easy to define relation between languages extensionally. A language Lx can be an extensional sublanguage of Ly: if every document conforming to Lx is also a conforming Ly document. If the reverse is true, all conforming Ly documents are also conforming Lx documents, Lx is an extensional superlanguage of Ly. If both are true, Lx and Ly are extensionally equivalent. Note that we only speak of extensional equivalence: for all we know, Lx may be about pasta recipes and Ly about reggae albums: if by some weird happenstance both languages have encoded their information in similar data formats, we can still compare the pasta and reggae languages extensionally.

Accepting and Rejecting Documents

Accept and Reject Behavior of Px Applications can read and write (produce and consume, send and receive) documents. For now we still suppose an application reads and writes the same set of documents. In the above picture, the implementation Px1 of language Lx accepts all blue rectangles (ELx), and rejects all other documents. So through accepting and rejecting, Px1 decides which documents are syntactically conformant with Lx (as implemented in Px1).

From Syntax to Semantics When we move from pure extensions, we must consider the semantics of a language. This is a difficult notion. At one extreme, a language may not consider semantics at all, and just define a syntactical data format, and leave all semantics up to applications. It's often said that "XML is just syntax". But if this is true, XML itself is not very interesting, and the real-world applications of XML, such as in UBL or HL7v3, certainly must include some semantics. At its highest level, those semantics are descriptions in natural language, defining for instance what an invoice or a medication prescription is. At a lower level, those semantics may be expressed in more formal ways. One way to (try to) express the semantics of messages in a messaging system with respect to forward and backward compatibility is to use formal semantics. This is – in my opinion – the wrong way. For simple (abstracted, theoretical) systems this might work. One would have to model the system and the encapsulated exchanged information using a language such as OWL. For real life purposes, this is hardly feasible. To make an OWL description of the underlying semantics of UBL or HL7v3 is a (much) larger undertaking than specifying a decent versioning mechanism to provide back- and forward compatibility to a reasonable degree: so using formal semantics to provide a framework and definitions for the notion of compatibility defeats the problem.

Semantics and Behavior A language specification of the kind under consideration should enable a capable engineer to implement a conforming application: an application which presumably can read an write conforming documents (let's assume for the moment the application can read every document it writes and vice versa). The application as a total then exhibits some behavior: if we use a word processor to read a document, we expect the word processor to display the stored text and formatting. If we use a medical application to send a prescription, we expect the receiving application to display the intended medication and applicable quantity and dosage. Languages may not define behavior very strictly. HL7v3 even goes at great length to stress is does not restrict application behavior at all, but I do not think this really holds. Obviously there are a great deal of application behaviors which are at odds with HL7v3 specifications, such as displaying pasta recipes instead of medication prescriptions. The behavior of a message endpoint is not unproblematic. It's usually best to consider some message endpoint to a certain degree as a black box; to consider only what's going in and coming out, not what is inside. How it works is not relevant; what it does is relevant. But when sending messages, there often is no direct behavior (no response) associated. When I send you a message with my new home address, I may expect no response (or just a simple ask). I do expect future mail to be delivered to my new home address, however. the point is, there may never be new mail to me, so there may never be a response message showing the desired behavior. On a higher level, the behavior of a corresponding message endpoint may not be deterministic at all. When a general physician sends a medication prescription to a pharmacist, the pharmacist may overrule the general physician's decision: this is an independent professional judgment, based on the pharmacists own competence. So the behavior of a message endpoint is not unproblematic. I believe the solution is basically to take humans out of the box. We must consider the behavior of a receiving computer system when judging the behavior for a message, not the humans interpreting and acting on the information. The behavior of the application is also in need of human interpretation. For instance, a part of a medication prescription may be: <code code=”27” codeSystem=”2.16.840.1.113883.2.4.4.5” /> and the receiving application may display: "Dissolve in water" which is a faithful rendering of the information sent. However, instead of "Dissolve in water" the receiving application may also display some local code. In general, it is not possible to judge whether a receiving application behaves well without the judgment of a professional well acquainted with the messaging framework as well as the receiving application. So where does this leave us with behavior as a basis for a framework for compatible versioning? If we take a look a practical aspects, behavior is a near perfect fit. The most practical aspect of a framework for compatible versioning is testing the entire messaging chain. And especially for this aspect, behavior as a basis for semantic compatibility is perfect. The fact that there may – in real life – never be a response message showing the desired behavior is not relevant. Relevant is the fact that such behavior is a testable condition of the messaging chain. In a test, we can send an "update address" message and next test the printing of an address label. Likewise, the fact that behavior is not deterministic with a human "in the box" is not a problem: in testing, we generally want a professional judging whether the system does what it is supposed to do. Again, behavior as a basis for semantic compatibility is a perfect fit. If we have a certain implementation of a language specification, the reading of a document should result in a changed application state: for instance, the text on the screen has changed, or its internal data store has been updated, resulting in specific behavior if the data store is later queried. Let's assume, as most often is the case, that the behavior (the state change) is deterministic: if the application is in the same state before reading a document, then reading the document always results in the same end-state.

Behavior of Language Processors

Behavior of Px We can show the behavior of a processor Px1 pictorially as in the above figure: for every document in ELx, Px1 exhibits some particular behavior. Px1 is partitioned, as it were, in behavioral compartments, one for each document. Of course it is quite common for messaging systems to have different behaviors for exactly the same document: for instance, a common property of reliable messaging is for an order to be executed, and a duplicate to be rejected (unless the original was lost, which is the raison d'être of reliable messaging). This is no serious obstacle to the model: for testing purposes, it is sufficient (and desirable) to assume a fixed start state for any sequence of messages, which makes the responses deterministic. For real life systems, if we wanted to model this aspect, we could assume the behavior to be determined by the message and (some aspect of) the state of the processor. Like before, I believe a framework in which testable scenario's can be described sufficient. Next we need to make some idealizations to get rid of all the processors. We'll simply assume every language specification is flawless (not hard to imagine, is it?) and there is a perfect processor for every language; we'll take that processor as an exemplar of the language. And we no longer assume a processor makes the same set of documents as it accepts.

Syntactical Compatibility This allows us to define syntactic compatibility: a (language) processor is syntactically compatible with another (language) processor if they are able to exchange documents without syntactic failure: if they are able to accept each other's documents. A common case is when they implement the same language (version).

Behavior with Disjunct ELX and ELy But the processors need not produce the same documents at all. Another common case is where one processor makes and sends orders, but does not receive orders, and the other processor receives orders and makes order confirmations, but no orders. Even though the processors make and accept different documents, it makes perfect sense to say they are syntactically compatible. So we can say a processor Px is compatible with a processor Py if Px makes only documents which Py accepts, and Py makes only documents which Px accepts. (It would be more common to say Px and Py are partial implementations of one and the same language, but that does not change the point.

Behavior and Forward Compatiblity of Px and Px+1 For versioning, the interesting case is twofold: for backward syntactical compatibility, a processor Px+1 should accept all documents produced by Px. and for forward syntactical compatibility, a processor Px should accept all documents produced by Px+1. The above picture highlights one variant, where Px and Px+1 both accept the same set of documents, but make different sets. The accept/reject reaction to a document is fundamental to the entire compatibility and versioning problem. However, the accept/reject distinction is not syntactical in nature. It is a behavioral aspect of a processor: accepting or rejecting is a response to a document, it attaches meaning to the document: acceptable! (or not). So the basic aspect which decides syntactical conformance in a sender/receiver (writer/reader, producer/consumer) configuration is itself semantical in nature. The key to enhance (future) compatibility on a syntactic level is to make processor accept more than they produce. In other words, a processor should not reject everything it could not produce itself, and should not reject everything it does not completely understand. Next, we'll see what to do with that un-understood "extra".

Semantical Equivalence Sets

Behavior of Px In the above figure, we have a processor for a language which makes only blue rectangles, but accepts all rectangles. If we now associate some particular behavior which each class of documents, we get semantical equivalence sets. Not every document is normally associated with some unique behavior: we often have syntactical redundancy, as in C's i=1+1 and i+=1 or attribute order in XML: so there is a set of documents which excite the same behavior in a receiver. The pattern of syntactical forward compatibility, where a receiver accepts more than it can act upon, achieves the same: there is a set of documents for which trigger the same behavior: since the receiver does not understand the extra syntactical components, it cannot act upon them, and it therefore must act the same, whatever the extra content.

Semantics and Compatibility

Behavior of Px+1 If a language Lx has a successor Lx+1, Lx+1 may specify additional behavior for documents accepted, but not produced by Lx: in the above figure, green and red rectangles. This has two consequences: - for every document produced by Px, Px+1 will behave as Px expects: it is therefore safe for Px to send messages to Px+1 - for documents produced by Px+1 which Px cannot produce (again the green and red rectangles) Px will exhibit the same behavior as for other documents in the same semantical equivalence set: the language designer of Lx+1 will know what behavior this is, and if it is deemed acceptable, it is safe for Px+1 to send messages to Px (if the behavior is not deemed acceptable, Px+1 must make sure in some way that Px will reject those documents, maybe through a "mustUnderstand" flag).

Conclusion The basics of compatible versioning thus are: 1. make sure Px accepts more documents than it produces (or fully understands), and, 2. partition the semantical equivalence sets of Px into smaller, more refined semantical equivalence sets for Px+1 with additional (new) behavior.

Definitions This Appendix contains informal definitions of the concepts of compatibility. Two languages are extensionally equivalent if they accept the same set of documents. A language is an extensional sublanguage of a second language if all documents accepted by the first language are also accepted by the second. If two processors behave the same for every document which belongs to a language Lx, the processors are behaviourally equivalent for Lx. Two languages are syntactically compatible if they accept the documents produced by each other. A language change is syntactically backward compatible if a new receiver accepts all documents produced by an older sender. A language change is syntactically forward compatible if an old receiver accepts all documents produced by a new sender. If two languages take the same documents as input, and their processors behave the same for every document, the languages are semantically equivalent. The semantical equivalence set of a document d is the set of documents which make a processor behave the same as d. A language is a semantical superlanguage of a second language if for all documents produced by the second language, processors of the first language behave the same as processors of the second language. A later version of a language is semantically backward compatible with an earlier version if the later version is a semantical superlanguage of the earlier one (an old sender may expect a newer, but semantically backward compatible, receiver to behave as the sender intended). An earlier language is semantically forward compatible with a later language iff the later language is a semantical sublanguage of the earlier one (this is only possible if a language loses semantics). A later language semantically extends and earlier language if the later language introduces new behaviour for some documents accepted, but not produced by the earlier one. Formal Axiomatization This appendix contains a formal account of the principles which are introduced informally in the main article. Let U be a set of (specifications of) software processable languages {L1, L2, ... Ln} This axiomatization does not concern itself with natural language The extension of a language Lx is a set of conforming document instances ELx = {Dx1, Dx2, ... Dxn} Iff ELx = ELy then Lx and Ly are extensionally equivalent Lx and Ly may still be different: they may define different semantics for the same syntactical construct Iff ELx ⊂ ELy then Lx is an extensional sublanguage of Ly Iff ELx ⊃ ELy then Lx is an extensional superlanguage of Ly D is the set of all possible documents; or the union of all ELx where Lx ∈ U For each Lx ∈ U there is a set of processors Px = {Px1, Px2, ... Pxn} which implement Lx Each Pxy exhibits behaviour as defined in Lx Processors can produce and consume documents Each Pxy produces only documents it can consume itself At consumption, Pxy may accept or reject documents The behaviour of a processor Pxy of language Lx is a function Bxy The function Bxy takes as argument a document, and it's value is a processor state We assume a single processor state before each function execution, alternatively we could assume a <state, document> tuple as function argument If for two processors Pxy and Pxz for language Lx for a document d Bxy(d) = Bxz(d) then the two processors behave similar for d Two processor states for language Lx are deemed equivalent if a human with thorough knowledge of language specification Lx considers the states equivalent. Details may vary insofar as the language specification allows it. Processor equivalence is not intended to be formally or computably decidable; though in some cases it could be. If ∀d ( d ∈ ELx → Bxy(d) = Bxz(d) ) then Pxy and Pxz are behaviourally equivalent for Lx If two processors behave the same for every document which belongs to a language Lx, the processors are behaviourally equivalent for Lx. An ideal language specifies all aspects of desired processor behaviour completely and unambiguously; assume all languages in U are ideal A processor Px is an exemplary processor of a language Lx if it fully implements language specification Lx; assume all processors for all languages in U are exemplary Because they are (defined to be) exemplary, every two processors for a language Lx are behaviourally equivalent ELx = { d is a document ∧ Px accepts d } The complement of ELx is the set of everything (normally, every document) which is rejected by Px The make set MLx = { d is a document ∧ Px can produce d } A language Lx is syntactically compatible with Ly iff MLx ⊂ ELy and MLy ⊂ ELx Two languages are syntactically compatible if they accept the documents produced by each other. A language Ln+1 is syntactically backward compatible with Ln iff MLn ⊂ ELn+1 and Ln+1 is a successor of Ln A language change is syntactically backward compatible if a new receiver accepts all documents produced by an older sender. A language Ln is syntactically forward compatible with Ln+1 iff MLn+1 ⊂ ELn and Ln+1 is a successor of Ln A language change is syntactically forward compatible if an old receiver accepts all documents produced by a new sender. A document d can be a member of the extension of any number of languages Px is an (exemplary) processor of Lx, Py is an (exemplary) processor of language Two languages Lx and Ly are semantically equivalent iff ELx = ELy ∧ ∀d ( (d ∈ ELx ) → Bx(d) = By(d) ) If two languages Lx and Ly take the same documents as input, and their exemplary processors behave the same for every document, the languages are semantically equivalent. Two languages can only be compared if their exemplary processors are similar enough to be compared. Not every two languages can be compared. "Semantic" should not be interpreted in the sense of "formal semantics". The semantical equivalence set of a document d for Lx = { y ∈ ELx | Bx(d) = Bx(y) } Or: SLx,d = { y ∈ ELx | Bx(d) = Bx(y) } The semantical equivalence set of a document d is the set of documents which make a processor behave the same as d Semantical equivalence occurs for expressions which are semantically equivalent, such as i = i + 1 and i += 1 for C, or different attribute order in XML etc. d ∈ SLx,d Any two semantical equivalence sets of Lx are necessarily disjunct If z ∈ SLx,e were also z ∈ SLx,d then every member of SLx,e would be in SLx,d and vice versa and thus SLx,d = SLx,e A language Ly is a semantical superlanguage of Lx iff ∀d ( d ∈ MLx → By(d) = Bx(d) ) For all documents produced by Px, Py behaves the same as Px Equivalence in this case should be decided based on Lx; if Ly makes behavioural distinctions which are not mentioned in Lx, behaviour is still the same as far as Lx is concerned It follows: ∀d ( d ∈ MLx → ∃SLy,d ( SLy,d ⊂ ELy ∧ ( SLx,d ∩ MLx ) ⊂ SLy,d ∧ By(d) = Bx(d) ) ) For any document produced by Px, the part of its semantical equivalence set which Px can actually produce, is a subset of the semantical equivalence set of Py for this document For all d ∈ ELx ∧ d ∉ MLx there may be many equivalence sets in Ly for which By(d) ≠ Bx(d) In other words: for documents accepted but not produced by Px, Ly may define additional behaviours Lx is a semantical sublanguage of Ly iff Ly is a semantical superlanguage of Lx A language Ln+1 is semantically backward compatible with Ln iff Ln+1 is a semantical superlanguage of Ln and Ln+1 is a successor of Ln An old sender may expect a newer, but semantically backward compatible, receiver to behave as the sender intended A language Ln is semantically forward compatible with Ln+1 iff Ln+1 is a semantical sublanguage of Ln and Ln+1 is a successor of Ln Semantic forward compatibility is only possible if a language loses semantics; i.e. its processors exhibit less functionality, and produce less diverse documents A processor cannot understand what it does not know about yet A language Ln+1 semantically extends Ln iff ∃d ( d ∈ MLn+1 ∧ d ∈ ELn ∧ BLn+1 ≠ BLn ) Ln+1 introduces new behaviour for some documents accepted, but not produced by Ln Assume Ln is syntactically forward compatible with Ln+1, Ln+1 is semantically backward compatible with Ln, and Ln+1 semantically extends Ln Pn accepts all documents produced by Pn+1, or: MLn+1 ⊂ ELn For documents produced by Pn, Pn+1 behaves as Pn would expect, or: ∀d( d ∈ MLn → Bn+1(d) = Bn(d) ) But for some documents accepted, but not produced by Pn, Pn does not behave as Pn+1 would, or: ∃d ( d ∈ MLn+1 ∧ d ∈ ELn ∧ d ∉ MLn ∧ BLn+1 ≠ BLn ) New senders send documents to old receivers, which do not behave as new receivers would behave ∀d (d ∈ MLn+1 ∧ d ∈ ELn ∧ d ∉ MLn → ∃d'( d' ∈ MLn ∧ Bn(d') = Bn(d) ) For every document d, accepted but not produced by Pn, there is a document d' for which Pn behaves the same as for d, or: d' ∈ SLn,d (the semantical equivalence set of d for Ln) We can assume Ln transforms document d to document d' If a language Ln accepts a document which it would not produce, it transforms it to a document which it could produce There has to be no literal transformation We now have: Bn(d') = Bn(d) ∧ Bn+1(d') ≠ Bn+1(d) Pn behaves the same for d and the transformed d'; but Pn+1 behaves differently Iff it is acceptable that Bn(d') = Bn(d) for transformed documents, then Pn may ignore the semantical extension responsible for Bn+1(d') ≠ Bn+1(d) The fact that Ln is syntactically forward compatible with Ln+1 enables Pn+1 to introduce new semantics without breaking existing Pn We could say Ln is partially semantically forward compatible with Ln+1; it only ignores semantics it may ignore New senders can interact with old receivers; old receivers may ignore what they do not understand Iff it is not acceptable that Bn(d') = Bn(d) for transformed documents, then Pn must understand the semantical extension responsible for Bn+1(d') ≠ Bn+1(d) If Pn must understand d, then Pn should not process d A Pn+1 must not send d to a Pn; or Pn should be adapted so that Pn rejects d (or: d ∉ ELn) If a Pn+1 can send d to a Pn, then Ln should not be (completely) syntactically forward compatible with Ln+1 A processor should not accept what it must understand, yet cannot understand Bibliography UBL. http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=ubl HL7v3. http://www.hl7.org/ Versioning XML with XML Schema, David Orchard, XTech 2008. http://2008.xtech.org/public/schedule/detail/530 Extending and Versioning Languages: Terminology, Draft TAG Finding 13 November 2007. http://www.w3.org/2001/tag/doc/versioning-20071113.html Extending and Versioning Languages: Strategies, Draft TAG Finding 17 September 2007. http://www.w3.org/2001/tag/doc/versioning-strategies-20070917.html.