Working with Structured Text

An HTML document is like a text file, except that some of the characters are markup. Markup (tags) define the structure of the document.

To identify information as HTML, each HTML document should start with the prologue:

<!doctype html public "-//W3O//DTD W3 HTML 2.0//EN">

NOTE:
If an existing HTML document does not contain a prologue, this prologue must be prepended.

HTML documents should also contain an <html> tag at the beginning of the file (after the prologue), and </html> at the end. Within those tags, an HTML document is organized as a head and a body, much like memo or a mail message. Within the head, you can specify the title and other information about the document. Within the body, you can structure text into paragraphs and lists as well as highlighting phrases and creating links. You do this using HTML elements.

This section describe the syntax of HTML Elements, and provides an example HTML document.

Understanding HTML Elements

In HTML documents, tags define the start and end of headings, paragraphs, lists, character highlighting and links. Most HTML elements are identified in a document as a start tag, which gives the element name and attributes, followed by the content, followed by the end tag. Start tags are delimited by < and >, and end tags are delimited by </ and >. For example:

<h1>This is a Heading</h1>

<p>This is a paragraph.</p>

Some elements appear as just a start tag. For example, to create a line break, you use <br>. Additionally, the end tags of some other elements (e.g. p, li, dt, dd) may be omitted.

NOTE:
Technically, the start and end tags for html, head, and body elements are omissible. However, this is not recommended since the head/ body structure allows an implementation to determine certain properties of a document -- the title, for example -- without parsing the entire document.

The content of an element is a sequence of characters and nested elements. Some elements, such as anchors, cannot be nested. Anchors and character highlighting may be put inside other constructs.

NOTE:
The SGML declaration for HTML specifies SHORTTAG YES, which means that there are some other valid syntaxes for tags, such as NET tags, <em/.../; empty start tags, <>; and empty end tags, </>. Until such time as support for these idioms is widely deployed, their use is strongly discouraged.

Names

The element name immediately follows the tag open delimiter. An element name consist of a letter followed by up to 72 letters, digits, periods, or hyphens. Names are not case sensitive. For example, H1 is equivalent to h1.

Attributes

In a start tag, whitespace and attributes are allowed between the element name and the closing delimiter. An attribute typically consists of an attribute name, an equal sign, and a value (although some attributes may be just a value). Whitespace is allowed around the equal sign.

The value of the attribute may be either:

A string literal, delimited by single quotes or double quotes
A name token (a sequence of letters, digits, periods, or hyphens)

In this example, a is the element name, href is the attribute name, and http://host/dir/file.html is the attribute value:

<a href="http://host/dir/file.html">

Some implementations consider any occurrence of the > character to signal the end of a tag. For compatibility with such implementations, when > appears in an attribute value, you may want to represent it with an entity or numeric character reference, such as: <img src="eq1.ps" alt="a > b">. To put quotes inside of quotes, you use the character representation " as in: <img src="image.ps" alt="First "real" example">

The length of an attribute value (after replacing entity and numeric character references) is limited to 1024 characters.

NOTE:
Some implementations allow any character except space or > in a name token. Attributes values must be quoted only if they don't satisfy the syntax for a name token.

Attributes with a declared value of name (e.g. ismap, compact) may be written using a minimized syntax. The markup:

<ul compact="compact">

can be written as

<ul compact>

NOTE:
Unless you use the minimized syntax, some implementations won't understand.

Undefined tag and attribute names

It is an accepted networking principle to be conservative in that which one produces, and liberal in that which one accepts. HTML parsers should be liberal except when verifying code. HTML generators should generate strictly conforming HTML.

The behavior of WWW applications reading HTML documents and discovering tag or attribute names which they do not understand should be to behave as though, in the case of a tag, the whole tag had not been there but its content had, or in the case of an attribute, that the attribute had not been present.

Special Characters

The characters between the tags represent text in the ISO-Latin-1 character set, which is a superset of ASCII. Because certain characters will be interpreted as markup, they should be represented by markup -- entity or numeric character references. See the Special Characters section of this specification for more information.

Comments

To include comments in an HTML document that will be ignored by the parser, surround them with . After the comment delimiter, all text up to the next occurrence of -- is ignored. Hence comments cannot be nested. Whitespace is allowed between the closing -- and >. (But not between the opening <! and --.)

For example:

<head>

<title>HTML Guide: Recommended Usage</title>

<!-- Id: Text.html,v 1.6 1994/04/25 17:33:48 connolly Exp -->

</head>

NOTE:
Some historical implementations incorrectly consider a > sign to terminate a comment.

Example HTML File

An example HTML file is:

<!doctype html public "-//W3O//DTD W3 HTML 2.0//EN">

<html><head>

<title>Structural Example</title>

</head>

<body>

<h1>First Header</h1>

<p>This is a paragraph in the example HTML file.

Keep in mind that the title does not appear in the 

document text, but that the header (defined by h1) does.

</ul>

<li>First item in an unordered list.

<li>Second item in an unordered list.

</ul>

<p>This is an additional paragraph. Technically, end tags

are not required for paragraphs, although they are allowed.

You can include character highlighting in a paragraph.

<i>This sentence of the paragraph is in italics.</i>

<img src ="triangle.gif" alt="Warning:"> Be sure to read

these instructions.

</body>

</html>

Preceding Section: Understanding HTML and SGML
Following Section: Head
Parent Section: HTML Specification
Contents of HyperText Markup Language