The XBIS encoding format mirrors the standard text form of an XML document in that all components of the document are present in the same order they'd appear in text. What differs is that XBIS uses a more compact representation of the components, and presents them in a more easily processed form.

The compactness comes mainly from taking advantage of the highly repetitive structure of a normal XML document, where the same element and attribute names are typically used many times over. XBIS defines each name as text only once, then uses a handle value to refer back to the name when it is repeated. This same approach is used with namespaces, so that even namespace prefixes are not repeated as text.

XBIS can also apply this approach to attribute values and character data, which often use the same text repeatedly. The extent to which this is done is an encoding option, but does not actually effect the format itself; a reader does not need to know what options were used to generate the encoding in order to build a document representation from the encoded form. This is an important concern for general usage, since it allows documents to be exchanged between XBIS users without the need for any external information (such as a schema, or parameter settings).

Besides the more compact representation, XBIS gains speed on the input side by presenting the document data in predigested form. This eliminates the need for any complex parsing of the input and allows the document to be reconstructed with minimal overhead.

Building Blocks

The XBIS encoding builds up from several simple types. These simple types are described in this section.

Figure 1. Integer value representations
Integer value representations

Integer Values

Positive integer values are used extensively in the encoding. The standard format of representing these values uses the low-order 7 bits of a byte for the actual value representation, with the high-order 8th bit used as a continuation flag - when the 8th bit is set, the next byte in the encoded stream contains another 7 bits of the value.

Figure 1 shows how this looks when applied to values of various sizes. Values in the range of 0-127 can be represented in a single byte, as shown in the top image. Values of 128-16383 require two bytes, as shown in the second image (where the upper byte comes first in the encoded form). Larger values require more bytes, all the way to a maximum of five bytes to represent the maximum possible integer value. In actual use, the values being encoded can generally be represented in one or two bytes.

Quick Values

Quick values are a way of representing a limited range of positive integer values within a portion of a byte. This format is often used in combination with flags in a byte. When a value is to be encoded in this manner it is first incremented. If the incremented value fits within the portion of the byte allowed for the quick value, the value is stored directly within the byte. Otherwise, a 0 is stored within the byte and the incremented value is encoded in the following byte(s) using the normal integer value encoding defined above.

Strings

Strings are the basic building blocks of the serial form. The general string format uses a leading length value which gives the number of characters (not bytes) in the string, plus one. The value 0 is used for a null string, as opposed to the value 1 which represents a string of zero characters.

This length value is encoded as a normal integer value, as described above. It is followed immediately by the actual characters of the string. Each 16-bit Java character is also encoded as an integer value, so the length of the string data in bytes can potentially be up to three times the number of characters in the string (but for characters in the standard ASCII range the length in bytes will be the same as the number of characters).

String lengths can also be encoded as quick values in some cases. These work slightly differently in that the actual character length of the string is encoded as a quick value, rather than the length-plus-one value used in the general format. Since there is no way of representing a null string with this encoding, quick values are used for string lengths only when the string is required to be non-null.

Handles

Handle values are used to refer to previously defined items, which include element and attribute names, namespaces (both active and defined), namespace URIs, and optionally attribute value and character data strings. Each type of item listed uses a separate set of handles in order to keep the handle values as small as possible, giving the most efficient encoding. The context of a handle reference always determines which type of handle is being referenced.

Actual values start at 1. Except for namespaces (which use a pair of predefined handles), a handle value of 1 will always represent the first item defined of that type, a handle of 2 the second, and so on.

A 0 in a handle value is used to indicate that a new item of the appropriate type is being defined. The new item is implicitly assigned the next handle value of that type and may then be referenced by that handle value later in the encoding.

When handles are encoded in quick value fields the actual value stored is one greater than the handle value, since as described above the quick value format makes special use of the 0 value. In this case a 1 indicates that a new item is being defined.

Names and Namespaces

Element and attribute name definitions use a common format. The first byte of the definition, shown in Figure 2, contains a quick value field for an active namespace handle, along with a separate quick value field for the local name length.

Figure 2. Name definition byte
Name definition byte

If the active namespace handle value is larger than can be represented in the quick value field (indicated by a 0 value in the field), the actual handle value immediately follows this definition byte. Two active namespace handles are predefined. Handle 1 is always assigned to the no namespace namespace, and handle 2 is always assigned to the xml namespace. Other namespaces are assigned active handles as they're declared, which may be reused for other namespaces outside the range of the first declaration.

If the namespace used for a name has not previously been defined it may be included in the name definition. This uses the handle value 0 to indicate a new definition (as discussed in the section on Handles, above). In this case the name definition byte is followed by the actual namespace definition, starting with a byte of the form shown in Figure 3. Here one quick value field gives the namespace URI handle, while the second quick value field gives the prefix length. This first byte of the namespace definition is optionally followed by the encoded characters of the namespace URI (if this is the first time that URI has been used, as indicated by a handle value of 0), and then by the encoded characters of the prefix (if any).

Figure 3. Namespace definition lead byte
Namespace definition lead byte

Going back to the basic name definition, the additional information for the local name is encoded after any additional information for the namespace. If the name length quick value field in the name definition byte is too small to hold the length, the full length follows any namespace information. It is followed by the encoded characters of the local name.

Structure Encoding

XBIS is a stream encoding which is mainly intended for use with single documents. However, the format allows for encoding arbitrary combinations of elements and documents, and there are cases where this may be very useful to an application. Consider the case where multiple documents of the same type are being transferred from one program to another, for instance. The first document encoded would define most or all of the element and attribute names used in the entire series of documents, allowing the names to be referenced as simple handles in all the following encoded documents.

Each XBIS stream starts with four bytes reserved for XBIS itself. The first byte is a format identifier, which is set by the encoder to specify the format version used to encode the document and checked by the decoder to ensure that it is able to process that format. The only value used at present is 1, identifying the format defined by this document.

The second byte is an identifier for the adapter used to drive the encoding. This value is set by the encoder for information purposes only; the decoder reads this value and makes it available to the application but may not otherwise use it. This requirement is intended to preserve compatibility between all XBIS adapters. There are currently three values defined for this byte, 1 for the SAX2 adapter, 2 for the DOM adapter, 3 for the dom4j adapter, and 4 for the JDOM adapter.

The remaining two bytes of XBIS header are reserved for future use. They are currently written as 0 values and are ignored on input.

After the header the stream consists of one or more nodes. These are the primary document structure components, representing everything from a complete document down to a comment or character data string. Attributes are not considered nodes in the XBIS encoding, though, and are handled separately.

At the top level only two types of nodes are valid, element nodes and document nodes (when XBIS is used for complete documents, only the document nodes are valid at the top level). Each of these may in turn contain other nodes (including element nodes) as content. The content node definitions are nested within the definition of the containing node.

Each node begins with a node definition byte, which may be followed by additional information for the node. This node definition byte uses different formats for different types of nodes, with the high-order bits used as flags to identify the format.

Element Nodes

Element nodes use the format shown in Figure 4. The high-order bit of the node definition byte is always a 1 for an element node, and the next two bits are used as flags for whether the element has, respectively, attributes (including namespace declarations) and content (0 if not, 1 if so). The remaining bits are a quick value for the element name handle, extended if necessary into the following byte(s). If the name has not previously been defined, the new name definition immediately follows the node definition byte.

Figure 4. Node definition byte - Element
Node definition byte - Element

If the element has attributes, these are next. Attributes begin with an attribute definition byte, taking one of the forms shown in Figure 5. The top format is used for attributes with ordinary (unshared) values. The bottom format is used for attributes with shared values, which use handles to avoid encoding the same text repeatedly. Both ordinary and shared attribute values may be used in any combination.

Figure 5. Attribute definition byte formats
Attribute definition byte formats

Both attribute definition byte formats use the low-order bits of the byte for a quick value of the attribute name handle (extended, if necessary, to the following byte(s)). If the name has not previously been defined the name definition immediately follows the attribute definition byte.

The actual value of the attribute is next. For ordinary values, and for new shared values (as indicated by the flag in the attribute definition byte), these are strings in the general format. For previously-defined shared values the value is represented by a handle which identifies the value text.

The list of attributes for an element is terminated by a 0 value in place of an attribute definition byte (which can never be 0). If the node definition byte for the element does not indicate that attributes are present this 0 value is not included in the encoding.

If the element has content, the content nodes are next. The content nodes can be of any type (subject to XML structure concerns - a document as content of an element is obviously invalid, for instance). Each begins with a node definition byte, and as with the attributes the list of content nodes is terminated by a 0 byte in place of a node definition byte.

Figure 6. Node definition byte - Plain text
Node definition byte - Plain text

Text Nodes

Plain text (ordinary character data) nodes use the format shown in Figure 6. This gives the text length as a quick value in the low-order bits of the node definition byte (extended, if necessary, to the following byte(s)). It is followed by the actual encoded characters of text.

Figure 7. Node definition byte - Shared text
Node definition byte - Shared text

Shared text nodes use the format shown in Figure 7. This gives the handle for shared text in the low-order bits of the node definition byte (extended, if necessary, to the following byte(s)). If the text has not previously been defined (as indicated by a 0 value for the handle), the text definition immediately follows the node definition byte, as a string in the general format.

Both types of text nodes can be used within a single document, in any combination.

Namespace Declaration Nodes

Namespace declaration nodes can be used both to define new namespaces and to reference previously-defined namespaces. The first byte uses the format show in Figure 8, with a quick value field for a namespace definition handle in the low-order bits. If the node is redeclaring a previously-defined namespace the handle value will reference that namespace definition. As usual, a handle value of 0 is used to indicate a new handle definition. In this case the actual namespace definition (in the format described above, under Names and Namespaces) follows the first byte.

Figure 8. Node definition byte - Namespace
Node definition byte - Namespace

Namespace declarations always apply in the scope of an element. When a namespace declaration node is used it must precede the element it applies to. Namespaces may also be declared as part of an element or attribute name definition, when the namespace applies to that name. When a namespace is first defined it's assigned a namespace definition handle, and each time it's declared it's assigned an active namespace handle. The active namespace handle is only valid within the scope of the element declaring the namespace, while the namespace definition handle is valid from the point of definition on.

Each namespace definition associates a particular prefix (which may be the empty prefix) with a particular namespace URI. If multiple prefixes are defined for the same namespace URI a separate namespace definition is included in the serial form for each prefix. The actual URI will only be encoded with the first namespace definition, though, and will be referenced using a handle in any other

Figure 9. Node definition byte - Other
Node definition byte - Other

Other Nodes

The other node types use a simple format in which the node definition byte just identifies the type of node, and any additional information for that node type is in the following bytes (with text items using the general text format, as described under Strings, above). These other node types are:

  1. Document node: followed by content node list, as for element node

  2. Comment node: followed by actual comment text

  3. CDATA node: followed by CDATA text

  4. Processing Instruction node: followed by target and value text items

  5. Document Type node: followed by name, public id, and system id text items

  6. Notation node: followed by name, public id, and system id text items

  7. Unparsed Entity declaration node: followed by name, public id, system id, and notation name text items

  8. Skipped Entity node: followed by the entity name text

  9. Element declaration node: followed by element name and content model text items

  10. Attribute declaration node: followed by owning element name, attribute name, attribute type, defaulting type, and default value text items

  11. External Entity declaration node: followed by name, public id, and system id text items

The value 0 for a node definition byte is used to indicate the end of a list of node definitions. All other values not included in the above list are reserved and currently unused.

Note that most of these node types are not required for preserving the canonical form of XML documents, since XML Canonicalization discards Document Type, Notation, Unparsed/Skipped/External Entity, and Element/Attribute declaration information. These types are supported by XBIS to allow its use as a serialization mechanism for XML document models, but they should be considered optional.