December 19, 2004

XML and Entropy

Lately, I've been spending a lot of time reading outside my usual diet of programming books,in great part because I find that inspirations often strike when you can think about different endeavors and how problems were solved in those. One particular book that has set me to thinking has been Minds, Machines, and the Multiverse: The Quest for the Quantum Computer by Julian Brown, an intriguing discussion both of quantum computers and the multiverse interpretation of quantum probabilities.

In one section, Brown discusses the role of thermodynamics in information theory, and more specifically, the role of entropy. Entropy is one of those concepts that has gained, over the years, and almost mystical aura about it, the basis for all Murphy's Laws, but in point of fact it is actually a pretty simple concept to understand - and it has definite applications to one of the central problems that I see with how business people utilize XML technologies.

Entropy, in its information theory form, is a measure of the total number of states that a given system can be in. For instance, think about two bytes ... this can hold up to 65,536 possible states. Typically entropy is measured as the logarithm (and in the case of information sets, the logarithm base 2, or log2) so that the total entropy of that system of bits would be log2(65,536) or 16, which is, not coincidentally, the number of bits in two bytes.

One of the challenges faced by IT professionals working with XML is trying to figure out which tools work best for the scope of XML you're going to be working with. DOM or SAX manipulation typically does not handle complex XML well, XQuery is perhaps better for slightly more complex XML but lacks the recursive templating structure that works best for documents. The difficulty comes in determing at what point an XML resource shifts from one area of complexity to another.

Every XML document has a schema that describes the structure. If you make a few basic assumptions, you can in fact get an idea about the number of states that the schema allows:

  • multiplicities of an identical structure count only once if the upper limit of such multiplicities is "unbounded",
  • PCDATA data is immaterial, whether as text or as attributes. However, an attribute with multiple NMTokens will be treated as having one state for each enumeration.
  • alternatives within the schema will each be treated as separate trees for determining the total count of states.
  • If a given element can contain another element of that same name, then the count stops at that second element. This avoids infinite recursion.

By this measure, a schema with no variability would have an entropy of log2(1) = 0, a schema with one element of variability would have an entropy of log2(2) = 1, two elements of variability would be log2(3) = 1.585, and so forth. By this measure, most business documents (invoices and so forth) would likely have entropies in the neighborhood of 0 to 4, XML processes (such as an XSLT transformation) might be in the neighborhood of 10-12, and literary documents might have entropies in the neighborhood of 15-20. Keep in mind that these are logarithmic values - an entropy of 20 would correspond to 2 to the 20th states, or roughly 1,000,000 possible schema instances.

Entropy is important because it can better clarify the domain at which it is best to work with a given document. XQuery I think provides a good case in point here. XQuery supports XPath, and so it has some of the advantages that XSLT has, but it's not really all that useful for dealing with documents -- converting a DocBook document into WordML or vice versa would be impossible in XQuery, but for many business schemas with comparatively low entropies, XSLT is definitely overkill.

Lately, its become somewhat fashionable, especially among the business set, to deprecate XSLT in favor of XQuery for enterprise level applications. I see nothing wrong with this. XSLT is not always easy to work with, requires a different way of looking at a set of problems, and is probably not worth the effort or overhead for many types of transformations. That does not mean that XSLT does not have a very important place in the ecosystem, and its my own personal opinion that thinking that you have to choose either/or will limit you when you do have to deal with high entropy documents, as is the case with most document management applications.

201 comments:

«Oldest   ‹Older   601 – 201 of 201
«Oldest ‹Older   601 – 201 of 201   Newer› Newest»