Document Type Definition (DTD)
What is DTD?
Document Type Definition (DTD) is a schema language for XML inherited from SGML, used initially, before XML Schema was developed. DTD is one of ways to define the structure of XML documents, i.e., the document’s metadata. Syntactically, a DTD is a sequence of declarations. There are four kinds of declarations in XML:
- element type declarations, used to define tags;
- attribute list declarations, used to define tag attributes;
- entity declarations, used to define entities; and,
- notation declarations, used to define data type notations.
Each declaration has the form of a markup representation, starting with a keyword followed by the production rule that specifies how the content is created:
where the possible keywords are:
ATTLIST (for attribute list),
NOTATION. Next, I describe these declarations.
Element Type Declarations
Element type declarations identify the names of elements and the nature of their content, thus putting a type constraint on the element. A typical element type declaration looks like this:
|Declaration type||Element name||Element’s content model (definition of allowed content: list of names of child elements)|
|<!ELEMENT||chapter||(title, paragraph+, figure?)>|
The first declaration identifies the element named
chapter. Its content model follows the
element name. The content model defines what an element may contain. In this case, a chapter
must contain paragraphs and title and may contain figures. The commas between element names
indicate that they must occur in succession. The plus after
paragraph indicates that it may be
repeated more than once but must occur at least once. The question mark after
that it is optional (it may be absent). A name with no punctuation, such as
title, must occur
exactly once. The following table summarizes the meaning of the symbol after an element:
|none||The element must occur exactly once|
|?||The element is optional (zero or one occurrence allowed)|
|*||The element can be skipped or included one or more times|
|+||The element must be included one or more times|
Declarations for paragraphs, title, figures and all other elements used in any content model must
also be present for an XML processor to check the validity of a document. In addition to element names, the special symbol
#PCDATA is reserved to indicate character data. The PCDATA stands
for parseable character data.
Elements that contain only other elements are said to have element content. Elements that contain
both other elements and
#PCDATA are said to have mixed content. For example, the definition
for paragraphs might be:
<!ELEMENT paragraph (#PCDATA | quote)*>
The vertical bar indicates an “or” relationship, the asterisk indicates that the content is optional
(may occur zero or more times); therefore, by this definition, paragraphs may contain zero or
more characters and quote tags, mixed in any order. All mixed content models must have this
#PCDATA must come first, all of the elements must be separated by vertical bars, and the
entire group must be optional.
Two other content models are possible:
EMPTY indicates that the element has no content (and
consequently no end-tag), and
ANY indicates that any content is allowed. The
ANY content model
is sometimes useful during document conversion, but should be avoided at almost any cost in a
production environment because it disables all content checking in that element.
Attribute List Declarations
Elements which have one or more attributes are to be specified in the DTD using attribute list type declarations. An example for a figure element could be like so:
CDATA as before stands for character data and
#REQUIRED means that the caption attribute
of figure has to be present. Other marker could be
#FIXED with a value, which means this
attribute acts like a constant. Yet another marker is
#IMPLIED, which indicates an optional
attribute. Some more markers are ID and enumerated data type like so:
<!ATTLIST person sibling (brother | sister) #REQUIRED>
Enumerated attributes can take one of a list of values provided in the declaration.
As stated above, entities are used as substitutes for reserved characters, but also to refer to often repeated or varying text and to include the content of external files. An entity is defined by its name and an associated value. An internal entity is the one for which the parsed content (replacement text) lies inside the document.
Once the above example entity is defined, it can be used in the XML document as
&substitute; anywhere where the full text should appear. Entities can contain markup as
well as plain text. For example, this declaration defines
&contact; as an abbreviation for
person’s contact information that may be repeated multiple times in one or more documents:
<!ENTITY contact '<a href="mailto:[email protected]"> e-mail</a><br> <a href="732-932-4636.tel">telephone</a> <address>13 Takeoff Lane<br> Talkeetna, AK 99676</address> '>
Conversely, the content of the replacement text of an external entity resides in a file separate from the XML document. The content can be accessed using either system identifier, which is a URI. Examples are:
Notations are used to associate actions with entities. For example, a PDF file format can be associated with the Acrobat application program. Notations identify, by name, the format of these actions. Notation declarations are used to provide an identifying name for the notation. They are used in entity or attribute list declarations and in attribute specifications. This is a complex and controversial feature of DTD and the interested reader should seek details elsewhere.
DTD in Use
A DTD can be embedded in the XML document for which it describes the syntax rules and this is
called an internal DTD. The alternative is to have the DTD stored in one or more separate files,
called external DTD. External DTDs are preferable since they can be reused in different XML
documents by different users. The reader should be by now aware of the benefits of modular
design, a key one being able to (re-)use modules that are tested and fixed by previous users.
However, this also means that if the reused DTD module is changed, all documents that use the
DTD must be tested against the new DTD and possibly modified to conform to the changed DTD.
In an XML document, external DTDs are referred to with a
DOCTYPE declaration in the second
line of the XML document (after the first line:
<?xml ... ?>) as seen below.
The following fragment of DTD code defines the production rules for constructing book documents.
<!ELEMENT address (street+, city, state, postal-code)> <!ATTLIST address kind (return | delivery) #IMPLIED> <!ELEMENT street (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT state (#PCDATA)> <!ELEMENT postal-code (#PCDATA)>
Listing 3-13Example DTD for a postal address element. File name: address.dtd
Line 1 shows the element
address definition, where all four sub-elements are required, and the
street sub-element can appear more than once. Line 2 says that
address has an optional
kind, of the enumerated type.
We can (re-)use the postal address declaration as an external DTD, for example, in an XML document of a correspondence letter as shown below:
<?xml version="1.0"?> <!-- Comment: Person DTD --> <!DOCTYPE letter SYSTEM "http://any.website.net/address.dtd" [ <!ELEMENT letter (sender?, recipient+, body)> <!ATTLIST letter language (en-US | en-UK | fr) #IMPLIED a template (personal | business) #IMPLIED> <!ELEMENT sender (name, address)> <!ELEMENT recipient (name, address)> <!ELEMENT name (#PCDATA)> <!ELEMENT body ANY> ]> <letter language="en-US" template="personal"> <sender> <name>Mr. Charles Morse</name> <address kind="return"> <!-- continued as in Listing 3-1 above -->
In the above DTD document, Lines 2 – 9 define the DTD for a correspondence letter document. The complete DTD is made up of two parts:
- the external DTD subset, which in this case imports a single external DTD named
address.dtdin Line 2; and
- the internal DTD subset contained between the brackets in Lines 3 – 8.
The external DTD subset will be imported at the
time the current document is parsed. The
address element is used in Lines 5 and 6.
The content of the body of letter is specified using the keyword
ANY (Line 8), which means that a
body element can contain any content, including mixed content, nested elements, and even other
body elements. Using
ANY is appropriate initially when beginning to design the DTD and
document structure to get quickly to a working version. However, it is a very poor practice to use
ANY in finished DTD documents.
Limitations of DTDs
DTD provided the first schema for XML documents. Their limitations include:
- Language inconsistency since DTD uses a non-XML syntax
- Failure to support namespace integration
- Lack of modular vocabulary design
- Rigid content models (cannot derive new type definitions based on the old ones)
- Lack of integration with data-oriented applications
- Conversely, XML Schema allows much more expressive and precise specification of the content of XML documents. This flexibility also carries the price of complexity.