Structure and Syntax of XML Documents
Syntax defines how the words of a language are arranged into phrases and sentences and how components (like prefixes and suffixes) are combined to make words. XML documents are composed of markup and content—content (text) is hierarchically structured by markup tags. There are six kinds of markup that can occur in an XML document: elements, entity references, comments, processing instructions, marked sections, and document type declarations. The following subsections introduce each of these markup concepts.
XML Elements indicate logical parts of a document and they are the most common form of markup. An
element is delimited by tags which are surrounded by angle brackets (“<”, “>” and “</”, “/>”).
The tags give a name to the document part they surround—the element name should be given to
convey the nature or meaning of the content. A non-empty element begins with a start-tag,
<tag>, and ends with an end-tag,
</tag>. The text between the start-tag and end-tag is called
the element’s content. In the above example of a letter document, the element
<salutation>Dear Mrs. Robinson,</salutation> indicates the salutation part of
the letter. Rules for forming an element name are:
- Must start with a letter character
- Can include all standard programming language identifier characters, i.e.,
[0-9A-Za-z]as well as underscore
-, and colon
- Is case sensitive, so
<Name>are different element names
Some elements may be empty, in which case they have no content. An empty element can begin
and end at the same place in which case it is denoted as
<tag/>. Elements can contain subelements. The start tag of an element can have, in addition to the element name, related (attribute,
value) pairs. Elements can also have mixed content where character data can appear alongside
subelements, and character data is not confined to the deepest subelements. Here is an example:
<salutation>Dear <name>Mrs. Robinson</name>, </salutation>
Notice the text appearing between the element
<salutation> and its child element
Attributes are name-value pairs that occur inside start-tags after the element name. A start tag can have zero or more attributes. For example,
is an element named
date with the attribute
format having the value
that month is shown first and named in English. Attribute names are formed using the same rules
as element names (see above). In XML, all attribute values must be quoted. Both single and
double quotes can be used, provided they are correctly matched.
XML Entities and Entity References
XML reserves some characters to distinguish markup from plain text (content). The left angle bracket, <, for instance, identifies the beginning of an element’s start- or end-tag. To support the reserved characters as part of content and avoid confusion with markup, there must be an alternative way to represent them. In XML, entities are used to represent these reserved characters. Entities are also used to refer to often repeated or varying text and to include the content of external files. In this sense, entities are similar to macros.
Every entity must have a unique name. Defining your own entity names is discussed in the
section on entity declarations (below). In order to use an entity, you simply
reference it by name. Entity references begin with the ampersand and end with a semicolon, like
&entityname;. For example, the
lt entity inserts a literal
< into a document. So to
include the string
<non-element> as plain text, not markup, inside an XML document all
reserved characters should be escaped, like so
A special form of entity reference, called a character reference, can be used to insert arbitrary Unicode characters into your document. This is a mechanism for inserting characters that cannot be typed directly on your keyboard.
Character references take one of two forms: decimal references,
℞, and hexadecimal
℞. Both of these refer to character number U+211E from Unicode (which is
the standard Rx prescription symbol).
A comment begins with the characters
<!-- and ends with -->. A comment can span multiple
lines in the document and contain any data except the literal string
"--." You can place
comments anywhere in your document outside other markup. Here is an example:
<!-- ******************** My comment is imminent. -->
Comments are not part of the textual content of an XML document and the parser will ignore them. The parser is not required to pass them along to the application, although it may do so.
Processing instructions (PIs) allow documents to contain instructions for applications that will import the document. Like comments, they are not textually part of the XML document, but this time around the XML processor is required to pass them to an application.
Processing instructions have the form:
<?name pidata?>. The name, called the PI target,
identifies the PI to the application. For example, you might have
<?font start italic?>
<?font end italic?>, which indicate the XML processor to start italicizing the text
and to end, respectively.
Applications should process only the targets they recognize and ignore all other PIs. Any data that follows the PI target is optional; it is for the application that recognizes the target. The names used in PIs may be declared as notations in order to formally identify them. Processing instruction names beginning with xml are reserved for XML standardization.
CDATA Sections in XML
In a document, a CDATA section instructs the parser to ignore the reserved markup characters. So,
instead of using entities to include reserved characters in the content as in the above example of
<non-element>, we can write:
<![CDATA[ <non-element> ]]>
Between the start of the section,
<![CDATA[ and the end of the section, ]]>, all character data
are passed verbatim to the application, without interpretation. Elements, entity references,
comments, and processing instructions are all unrecognized and the characters that comprise them
are passed literally to the application. The only string that cannot occur in a
CDATA section is
Document Type Declarations (DTDs)
Document type declaration (DTD) is used mainly to define constraints on the logical structure of documents, that is, the valid tags and their arrangement/ordering.
This is about as much as an average user needs to know about XML. Obviously, it is simple and concise. XML is designed to handle almost any kind of structured data—it constrains neither the vocabulary (set of tags) nor the grammar (rules of how the tags combine) of the markup language that the user intends to create. XML allows you to create your own tag names. Another way to think of it is that XML only defines punctuation symbols and rules for forming “sentences” and “paragraphs,” but it does not prescribe any vocabulary of words to be used. Inventing the vocabulary is left to the language designer.
But for any given application, it is probably not meaningful for tags to occur in a completely
arbitrary order. From a strictly syntactic point of view, there is nothing wrong with such an XML
document. So, if the document is to have meaning, and certainly if you are writing a stylesheet or
application to process it, there must be some constraint on the sequence and nesting of tags,
stating for example, that a
<chapter> that is a sub-element of a
<book> tag, and not the other
way around. These constraints can be expressed using an XML schema (we’ll see in the next chapter)
Complete XML Example
The letter document discussed in this chapter can be represented in XML as follows:
<?xml version="1.0" encoding="UTF-8"?> <!-- Comment: A personal letter marked up in XML. --> <letter language="en-US" template="personal"> <sender> <name>Mr. Charles Morse</name> <address kind="return"> <street>13 Takeoff Lane</street> <city>Talkeetna</city> <state>AK</state> <postal-code>99676</postal-code> </address> </sender> <date format="English_US">February 29, 1997</date> <recipient> <name>Mrs. Robinson</name> <address kind="delivery"> <street>1 Entertainment Way</street> <city>Los Angeles</city> <state>CA</state> <postal-code>91011</postal-code> </address> </recipient> <salutation style="formal">Dear Mrs. Robinson,</salutation> <body> Here's part of an update ... </body> <closing>Sincerely,</closing> <signature>Charlie</signature> </letter>
Line 1 begins the document with a processing instruction
<?xml ... ?>. This is the XML
declaration, which, although not required, explicitly identifies the document as an XML
document and indicates the version of XML to which it was authored.
A variation on the above example is to define the components of a postal address (lines 6–9 and
14–17) as element attributes:
<address kind="return" street="13 Takeoff Lane" city="Talkeetna" state="AK" postal-code="99676" />
Notice that this element has no content, i.e., it is an empty element. This produces a more concise markup, particularly suitable for elements with well-defined, simple, and short content. One quickly notices that XML encourages naming the elements so that the names describe the nature of the named object, as opposed to describing how it should be displayed or printed. In this way, the information is self-describing, so it can be located, extracted, and manipulated as desired. This kind of power has previously been reserved for organized scalar information managed by database systems.
A text document is an XML document if it has a proper syntax as per the XML specification. Such document is called a well-formed document. An XML document is well-formed if it conforms to the XML syntax rules:
- Begins with the XML declaration
<?xml ... ?>
- Has exactly one root element, called the root or document, and no part of it can appear in the content of any other element.
- Contains one or more elements delimited by start-tags and end-tags (also remember that XML tags are case sensitive)
- All elements are closed, that is all start-tags must match end-tags
- All elements must be properly nested within each other, such as
- All attribute values must be within quotations
- XML entities must be used for special characters. Each of the parsed entities that are referenced directly or indirectly within the document is well-formed.
Even if documents are well-formed they can still contain errors, and those errors can have serious consequences. XML Schemas provide further level of error checking. A well-formed XML document may in addition be valid if it meets constraints specified by an associated XML Schema.
Document VS Data-Centric XML
Generally speaking, there are two broad application areas of XML technologies. The first relates to document-centric applications, and the second to data-centric applications. Because XML can be used in so many different ways, it is important to understand the difference between these two categories.
Initially, XML’s main application was in semi-structured document representation, such as technical manuals, legal documents, and product catalogs. The content of these documents is typically meant for human consumption, although it could be processed by any number of applications before it is presented to humans. The key element of these documents is semistructured marked-up text. A good example is the correspondence letter (under “Complete XML Example”) above.
By contrast, data-centric XML is used to mark up highly structured information such as the textual representation of relational data from databases, financial transaction information, and programming language data structures. Data-centric XML is typically generated by machines and is meant for machine consumption. It is XML’s natural ability to nest and repeat markup that makes it the perfect choice for representing these types of data.
Key characteristics of data-centric XML:
- The ratio of markup to content is high. The XML includes many different types of tags. There is no long-running text.
- The XML includes machine-generated information, such as the submission date of a purchase order using a date-time format of year-month-day. A human authoring an XML document is unlikely to enter a date-time value in this format.
- The tags are organized in a highly structured manner. Order and positioning matter, relative to other tags.
- Markup is used to describe what a piece of information means rather than how it should be presented to a human.
An interesting example of data-centric XML is the XML Metadata Interchange (XMI), which is an OMG standard for exchanging metadata information via XML. The most common use of XMI is as an interchange format for UML models, although it can also be used for serialization of models of other languages (metamodels). XMI enables easy interchange of metadata between UML-based modeling tools and MOF (Meta-Object Facility)-based metadata repositories in distributed heterogeneous environments. For more information, read more about XML Metadata Interchange.