Document Type Definition (DTD)

What is DTD?

Document Type Definition (DTD) is a schema language for XML inherited from SGML, used initially, before XML Schema was developed. DTD is one of ways to define the structure of XML documents, i.e., the document’s metadata. Syntactically, a DTD is a sequence of declarations. There are four kinds of declarations in XML:

1. element type declarations, used to define tags;
2. attribute list declarations, used to define tag attributes;
3. entity declarations, used to define entities; and,
4. notation declarations, used to define data type notations.

Each declaration has the form of a markup representation, starting with a keyword followed by the production rule that specifies how the content is created:

<!keyword production-rule>


where the possible keywords are: ELEMENT, ATTLIST (for attribute list), ENTITY, and NOTATION. Next, I describe these declarations.

Element Type Declarations

Element type declarations identify the names of elements and the nature of their content, thus putting a type constraint on the element. A typical element type declaration looks like this:

Declaration type Element name Element’s content model (definition of allowed content: list of names of child elements)
<!ELEMENT chapter (title, paragraph+, figure?)>
<!ELEMENT chapter (#PCDATA)>

The first declaration identifies the element named chapter. Its content model follows the element name. The content model defines what an element may contain. In this case, a chapter must contain paragraphs and title and may contain figures. The commas between element names indicate that they must occur in succession. The plus after paragraph indicates that it may be repeated more than once but must occur at least once. The question mark after figure indicates that it is optional (it may be absent). A name with no punctuation, such as title, must occur exactly once. The following table summarizes the meaning of the symbol after an element:

Kleene symbol Meaning
none The element must occur exactly once
? The element is optional (zero or one occurrence allowed)
* The element can be skipped or included one or more times
+ The element must be included one or more times

Declarations for paragraphs, title, figures and all other elements used in any content model must also be present for an XML processor to check the validity of a document. In addition to element names, the special symbol #PCDATA is reserved to indicate character data. The PCDATA stands for parseable character data. Elements that contain only other elements are said to have element content. Elements that contain both other elements and #PCDATA are said to have mixed content. For example, the definition for paragraphs might be:

<!ELEMENT paragraph (#PCDATA | quote)*>


The vertical bar indicates an “or” relationship, the asterisk indicates that the content is optional (may occur zero or more times); therefore, by this definition, paragraphs may contain zero or more characters and quote tags, mixed in any order. All mixed content models must have this form: #PCDATA must come first, all of the elements must be separated by vertical bars, and the entire group must be optional. Two other content models are possible: EMPTY indicates that the element has no content (and consequently no end-tag), and ANY indicates that any content is allowed. The ANY content model is sometimes useful during document conversion, but should be avoided at almost any cost in a production environment because it disables all content checking in that element.

Attribute List Declarations

Elements which have one or more attributes are to be specified in the DTD using attribute list type declarations. An example for a figure element could be like so:

The CDATA as before stands for character data and #REQUIRED means that the caption attribute of figure has to be present. Other marker could be #FIXED with a value, which means this attribute acts like a constant. Yet another marker is #IMPLIED, which indicates an optional attribute. Some more markers are ID and enumerated data type like so:

<!ATTLIST person sibling (brother | sister) #REQUIRED>


Enumerated attributes can take one of a list of values provided in the declaration.

Entity Declarations

As stated above, entities are used as substitutes for reserved characters, but also to refer to often repeated or varying text and to include the content of external files. An entity is defined by its name and an associated value. An internal entity is the one for which the parsed content (replacement text) lies inside the document.

Once the above example entity is defined, it can be used in the XML document as &substitute; anywhere where the full text should appear. Entities can contain markup as well as plain text. For example, this declaration defines &contact; as an abbreviation for person’s contact information that may be repeated multiple times in one or more documents:

<!ENTITY contact '<a href="mailto:[email protected]">
e-mail</a><br>
<a href="732-932-4636.tel">telephone</a>
'>


Conversely, the content of the replacement text of an external entity resides in a file separate from the XML document. The content can be accessed using either system identifier, which is a URI. Examples are:

Notation Declarations

Notations are used to associate actions with entities. For example, a PDF file format can be associated with the Acrobat application program. Notations identify, by name, the format of these actions. Notation declarations are used to provide an identifying name for the notation. They are used in entity or attribute list declarations and in attribute specifications. This is a complex and controversial feature of DTD and the interested reader should seek details elsewhere.

DTD in Use

A DTD can be embedded in the XML document for which it describes the syntax rules and this is called an internal DTD. The alternative is to have the DTD stored in one or more separate files, called external DTD. External DTDs are preferable since they can be reused in different XML documents by different users. The reader should be by now aware of the benefits of modular design, a key one being able to (re-)use modules that are tested and fixed by previous users. However, this also means that if the reused DTD module is changed, all documents that use the DTD must be tested against the new DTD and possibly modified to conform to the changed DTD. In an XML document, external DTDs are referred to with a DOCTYPE declaration in the second line of the XML document (after the first line: <?xml ... ?>) as seen below.

The following fragment of DTD code defines the production rules for constructing book documents.

 <!ELEMENT address (street+, city, state, postal-code)>
<!ATTLIST address kind (return | delivery) #IMPLIED>
<!ELEMENT street (#PCDATA)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT state (#PCDATA)>
<!ELEMENT postal-code (#PCDATA)>


Line 1 shows the element address definition, where all four sub-elements are required, and the street sub-element can appear more than once. Line 2 says that address has an optional attribute, kind, of the enumerated type.

We can (re-)use the postal address declaration as an external DTD, for example, in an XML document of a correspondence letter as shown below:

<?xml version="1.0"?>
<!-- Comment: Person DTD -->

<!ELEMENT letter (sender?, recipient+, body)>
<!ATTLIST letter language (en-US | en-UK | fr) #IMPLIED
a template (personal | business) #IMPLIED>
<!ELEMENT name (#PCDATA)>
<!ELEMENT body ANY>
]>
<letter language="en-US" template="personal">
<sender>
<name>Mr. Charles Morse</name>
<!-- continued as in Listing 3-1 above -->


In the above DTD document, Lines 2 – 9 define the DTD for a correspondence letter document. The complete DTD is made up of two parts:

1. the external DTD subset, which in this case imports a single external DTD named address.dtd in Line 2; and
2. the internal DTD subset contained between the brackets in Lines 3 – 8.

The external DTD subset will be imported at the time the current document is parsed. The address element is used in Lines 5 and 6.

The content of the body of letter is specified using the keyword ANY (Line 8), which means that a body element can contain any content, including mixed content, nested elements, and even other body elements. Using ANY is appropriate initially when beginning to design the DTD and document structure to get quickly to a working version. However, it is a very poor practice to use ANY in finished DTD documents.

Limitations of DTDs

DTD provided the first schema for XML documents. Their limitations include:

• Language inconsistency since DTD uses a non-XML syntax
• Failure to support namespace integration
• Lack of modular vocabulary design
• Rigid content models (cannot derive new type definitions based on the old ones)
• Lack of integration with data-oriented applications
• Conversely, XML Schema allows much more expressive and precise specification of the content of XML documents. This flexibility also carries the price of complexity.