Caution: work in progress. The last part, the conclusion, is a mess. But I'm mostly out of time :-(
The Web architecture document provides a starting point.
Language subset: one language is a subset (or, "profile") of a second language if any document in the first language is also a valid document in the second language and has the same interpretation in the second language. Taken another way, the set of allowable documents in the first language is less than the first.
Language superset: one language is a superset of a second language if the second is a language subset of the first.
Language extension: one language is an extension of a second language if the first is a superset of the second. This follows our intuition, that adding to a language (extending) results in a bigger set than the first set.
Now let’s take a look at XML languages. One of my favourite examples for talking about extensibility is the use of names. Let's define a Name structure and progressively elaborate.
The first version of our Name structure allows a first name:
There are 3 important observations to make:
We can define our language for name in terms of these sets. A vocabulary consists of the required combinations, the set of the optional combinations, and the set of allowable combinations. We'll express this as:
V = R + O + A
Specifically, this language is V1 = R1 + O1 + A1.
As you look at the previous 4 examples, it's clear that XML is rather verbose. In fact, I think a litle to verbose for the purposes of what needs to be expressed herein. We will introduce a shorthand notation. Now this notation probably won't be surprising, as it simply converts the "markup" into strings. Remove the closing markup and put a space in between the content. In the previous case, invalid strings are "<first>Dave <last>Orchard", "<first>Dave <middle>b <last>Orchard" and "<first>Dave <last>Orchard <middle>b".
To do the kind of comparison we want, we're going to create a shorthand for the schemas as well. V1 isn't really very intuitive. We want to be able to give an indication of the relative size of R, O, and A. The notation we'll propose is using by listing the number of required, optional and allowable elements in the order they occur. We can write this short-hand for V1 as:
V1 = R
Combining the scheman notation with the document notation, the following table the validity of a particular instance.
| Validity\Instance | <first>Dave | <first>Dave <last>Orchard | <first>Dave <last>O | <first>Dave <last>Orchard <middle>B | <first>Dave <middle>B <last>Orchard | <title>Mr <first>Dave <last>Orchard |
| R | Yes | No | No | No | No | No |
Taking a look at the schema, we obviously noticed that it didn't look like it was sufficient. We know about last names, middle names, titles, suffixes, and more. Let us say that we want to allow more complex names to be exchanged. We want to allow other "name" components to appear, but we don't want to constrain these in any way. We use XML Schema's wildcard capability. As described many times, the wildcard is called any and allows (surprise) any element to appear in a document where the any appears in the schema. The most obvious place that we would want this is after the first name. The schema for this is:
The short hand for this schema is "V2=RA"
The validity table is
| Validity\Instance | <first>Dave | <first>Dave <last>Orchard | <first>Dave <last>O | <first>Dave <last>Orchard <middle>B | <first>Dave <middle>B <last>Orchard | <title>Mr <first>Dave <last>Orchard |
| R | Yes | No | No | No | No | No |
| RA | Yes | Yes | Yes | Yes | Yes | No |
We have kept the same number of "known" elements and simply allowed more elements to appear. The observations we can make about this language are:
Using the previous notation for listing the sets , this "A" set for this vocabulary (A2) is much larger than the A set for the previous set (A1), and the R and O sets are the same.
Let's add an optional element to the first schema, the non-extensible schema. We want to constrain the contents of the last name to be a string type. We obviously could constrain this further to be alphabet characters, with no numbers or punctuation allowed. For simplicity, let's keep this as just a string data type though. We are "extending" our vocabulary because we are defining an additional term. In this example, we add add an optional last name after the first name. The schema is:
The short hand for this schema is:
V3 = RO
| Validity\Instance | <first>Dave | <first>Dave <last>Orchard | <first>Dave <last>O | <first>Dave <last>Orchard <middle>B | <first>Dave <middle>B <last>Orchard | <title>Mr <first>Dave <last>Orchard |
| R | Yes | No | No | No | No | No |
| RA | Yes | Yes | Yes | Yes | Yes | No |
| RO | Yes | Yes | No | No | No | No |
Instead of an optional last name, let's make the last name required. The schema is:
The short hand for this schema is:
V4 = RR
| Validity\Instance | <first>Dave | <first>Dave <last>Orchard | <first>Dave <last>O | <first>Dave <last>Orchard <middle>B | <first>Dave <middle>B <last>Orchard | <title>Mr <first>Dave <last>Orchard |
| R | Yes | No | No | No | No | No |
| RA | Yes | Yes | Yes | Yes | Yes | No |
| RO | Yes | Yes | No | No | No | No |
| RR | No | Yes | No | No | No | No |
Let's combine the optional schema and the extensible schema. In this example, we add add an optional last name after the first name and allow elements before and after the optional last name. Warning: Due to XML Schema's Unique Particle Attribution rule, wildcards before and after optional elements are not allowed I explain this problem and a clunky solution at XML.com article on versioning. For now, we will assume that this is legal to express. The (illegal) schema is:
The short hand for this schema is:
V5 = RAOA
| Validity\Instance | <first>Dave | <first>Dave <last>Orchard | <first>Dave <last>O | <first>Dave <last>Orchard <middle>B | <first>Dave <middle>B <last>Orchard | <title>Mr <first>Dave <last>Orchard |
| R | Yes | No | No | No | No | No |
| RA | Yes | Yes | Yes | Yes | Yes | No |
| RO | Yes | Yes | No | No | No | No |
| RR | No | Yes | No | No | No | No |
| RAOA | Yes | Yes | No | Yes | Ye | No |
In our set theory, we can compare single mandatory (V1=R), extensible (V2=RA), optional (V3=RO), two mandatory(V4=RR) and extensible with optional(V5=RAOA)
There are a number of important observations to make with regards to these sets:
Now isn't this interesting... When we added the optional element to V2 to create V5, we increased the set of optional terms but we reduced the set of allowable terms. In thise case, by extending an extensible schema we are supersetting the known terms (R + O) and subsetting the allowable terms (A).
Now what is a “compatible” change to a vocabulary, such as V0 or V2? We provided terms for backwards and forwards compatibility in the TAG finding, which we'll reprise here:
A language change is backwards compatible if newer processors can process all instances of the old language.
A language change is forwards compatible if older processors can process all instances of the newer language.
In this context, process means validate. We can compare the versions of our vocabularies for compatible changes.
V1 can validate R2, R3, R5 but none of the optional or allowable flavours.
V2 can validate V1, V3, V4, and V5 documents.
V3 can validate V1, R2, R5 and O5 documents. V3 cannot validate A2, V4 and A5 documents.
V4 can validate no defined set of V1, V2, V3, V5 documents
V5 can validate V1, R2, O2, V3, V4. V5 cannote validate A2 documents.
A table that compares which sets of documents a given schema can validate is below. The schema are the rows and the document instances are the columns.
| Schema\Set of Documents | V1(R) | V2(RA) | V3(RO) | V4(RR) | V5(RAOA) |
| R | - | Required | Required | None | Required |
| RA | All | - | All | All | All |
| RO | All | Required | - | All | Optional |
| RR | None | None | None | - | None |
| RAOA | All | Optional | All | All | - |
Given these sets of validation logic, selectively choose some of the table entries to compare compatibility depending upon the schemas.
In the previous vocabulary comparison, there is an interesting observation to make. There are 2 cases where compatibility existed between one version and a subset of another, and the subset is the "defined" terms.
In defining lastname, we expand the set of known terms by some amount of the unknown terms. So when we “extend” our language, we increase the set of known terms. But we also reduce the set of allowable combination of terms.
One way of looking at this is that Extensibility in XML languages is actually the process of creating successive subsets of the allowable combination of terms. What you say? Extensibility is about subsetting? Allow me to explain.
V0 allows any terms after the firstname element and has only one known term. V1 allow only 1 term (lastname) after the firstname element, allows any terms after the lastname element, and has two known terms. V1 has a larger set of know combination of terms but a subset of allowable combination of terms.
What we have discovered is that a language extension is a superset of the known combination of terms but is also a subset of the allowable combination of terms. Isn’t this deliciously ironic? Extensibility allows us to subset in the future.
We have a problem though: We can’t describe compatibility in terms of just V0 and V1. Backwards compatibility is where V0 can be interpreted as V1, and forwards compatibility is where V1 can be interpreted as V0. How do we define the set theory for this? Our intuition says that backwards compatibility is where V0 is a subset of V1. But this is based upon the “closed” set model, not the “open” set model that we’ve been dealing with. We know that in the example of V, V0 is a superset of the allowable combination of terms in V1 yet is a subset of the known terms.
But we know that there is difference between the known combination of terms and the allowable combination of terms. A piece of software that knows about V0 will only send
We will refine our definition of V to be the set of allowable terms A and K to be the set of known terms. So K is a subset of A. K0 = firstname, and K1 = firstname, lastname. In K1, remember that lastname is optional. We need to introduce a function on our sets to determine the minimal set of terms. We will call this req() for required elements.
We use these sets and the function to determine compatibility. In the case of backwards compatibility, we can allow any of the optional known elements to be omitted. Omiting all the optional aspect of K1 is called req(K1).
V1 is backwards compatible with V0 if req(K1) = req(K0), K1 is a superset of K0, and A1 is a subset of A0.
In the case of forwards compatibility, an instance of V1 can be treated as an instance of K0.
V0 is forwards compatible with V1 if req(V1) = K0 and A1 is a subset of A0.
This follows our intuition: Forwards compatibility requires accepting and ignoring unknown content, so A0 must be larger than K0 and we have to be able to map the A0 set down to K0. If A1 is not a subset of A0, then the portion of A1 that is outsideA0 can’t be mapped to K0.
The mapping function we talk about is described as the “MustIgnore” rule, and is essential to enable the mapping of V1 to the K0 portion of V0.
Now have two definitions of compatibility based upon our sets of known and allowable terms.