XML — Architecture, Tools, Techniques - Department of Computer ...

cs.rit.edu

XML — Architecture, Tools, Techniques - Department of Computer ...

XML Architecture, Tools, Techniques

Axel-Tobias Schreiner

Department of Computer Science

Rochester Institute of Technology

This volume contains copies of the overhead slides used in class. This information is available

online as part of the World Wide Web; it contains hypertext references to itself and to parts of

the system documentation. The example programs are included into this text from the original

sources.

So that it may be viewed on other platforms, this text also exists as a PDF document. With the

Acrobat Reader from Adobe the text can be printed on Windows systems.

The text is not a complete transcript of the lectures. For self study one would have to consult

books on Java, on HTML and the World Wide Web, and some original papers.

Contents

0 Introduction 1

1 First Steps 3

2 Hypertext Markup Language 17

3 Cascading Style Sheets 23

4 Compiler Construction 33

5 Extensible Markup Language 61

6 Sources 73

7 Simple API for XML 89

8 Document Object Model 105

9 XSL Transformations 111

A Glossary 123

1


Literature

These slides are developed in the Classic Environment of MacOS X using Adobe

Framemaker, PhotoShop, and Distiller. OmniGraffle is used for the drawings. The slides

are available on the Web.

There are very many books about Java, HTML, the World Wide Web, XML and the

related technologies. I found the following books to be more or less useful. Especially

the XML and XSL materials get out of date very quickly:

Bradley 0-201-67487-4 The XSL companion

Behme, Mintert 3-8273-1636-7 XML in der Praxis

Flanagan 1-596-00283-1 Java in a Nutshell (4th Edition)

Flanagan et al. 1-56592-483-5 Java Enterprise in a Nutshell

Flanagan 1-56592-371-5 Java Examples in a Nutshell

Flanagan 1-56592-488-6 Java Foundation Classes in a Nutshell

Harold, Means 0-596-00058-8 XML in a Nutshell

Kay 1-861003-12-9 XSLT Programmer’s Reference

McLaughlin 0-596-00197-5 Java and XML (2nd Edition)

Niederst 1-56592-579-3 HTML Pocket Reference

Quin 0-471-37522-5 Open Source XML Database Toolkit

2


3

1

First Steps

As a buzzword XML encompasses a host of technologies. One central subject is the

representation of information in the World Wide Web.

This chapter discusses several solutions for the fairly trivial problem to display a text in the

center of a browser window. This will involve some XML technologies.

The intent is only to glance at some ideas and sketch a few problems details about the

technologies follow in the subsequent chapters.


1.1 HTML

The Hypertext Markup Language (HTML) describes information to be displayed by a web

browser. A document contains information (text), references to information (net addresses of

pictures etc.), as well as markup, i.e., more or less elaborate instructions how the information

is to be presented:





Hello, Osnabrück!








Hello, Osnabrück!






4

hello/1.html

To center a text one uses a table which is stretched over the entire window. The table

contains a row (tr) and a cell (td) with the centered text. There are suggestions for the type

face, color, and size (font); the text is required to be bold (b). HTML is usually specified using

ASCII; therefore, special characters must be escaped (ü).

The document has a frame (html), the descriptive part (head) can contain a title which

might be rendered as a window title. body marks the information part and can request a

background color.

Markup are elements which are usually opened and closed . The opening tag

can contain attributes in arbitrary order as key and value pairs; the values should be enclosed

in single or double quotes and should not contain the respective quote or < >. Elements may

be nested and they can contain text. Case is not (yet) relevant everywhere. White space

separates information but is usually not significant.

HTML is an application of the Standard Generalized Markup Language (SGML); by way of the

document type (


Even this trivial example shows a weakness: Internet Explorer 5.5 and above centers the text

independent of the document type, Netscape 6 centers for version 3.2 but not 4.0, OmniWeb 4

always displays a scrollbar.

1.2 HTML with CSS

HTML was designed to logically markup information and to leave rendering to the browser.

The example shows that meanwhile form and content get intermixed thoroughly. It is very

difficult to achieve a uniform layout.





Hello, Osnabrück!


body { background: khaki }

table { height: 100%; width: 100% }

td { text-align: center;

font: bold xx-large serif;

color: blue

}







Hello, Osnabrück!




5

hello/2.html

It might be a good idea to use Cascading Style Sheets (CSS). A style element in the head part

of an HTML document can be used to set rendering properties for almost all elements.

Common properties for elements can be bundled and some properties can be inherited

among nested elements. With a suitable structure the style of an entire document can be

controlled in one central place.

CSS properties use different names and a completely different syntax from HTML attributes

comments are enclosed in /* */. Some aspects of HTML rendering can only be controlled

using CSS.


Unfortunately, not all browsers support the same level of CSS and one cannot assume that the

supported features are interpreted in a similar way. Still, if CSS is used conservatively, more

maintenance friendly HTML documents will result.

1.3 Alternative without tables

div is an HTML element which hardly influences rendering. However, CSS can be used to

prescribe very individual properties for div elements, effectively creating new HTML elements:





Hello, Osnabrück!


body { background: khaki }

#top { height: 45% }

div.center { text-align: center;

font: bold oblique xx-large serif;

color: blue;

}





Hello, Osnabrück!



6

hello/3.html

An id attribute must have a value that is unique throughout the document; therefore, it can be

uniquely associated with CSS properties. A class attribute can have the same value in many

elements; this can be used to group like elements and associate them with a common set of

CSS properties.

The first, empty div element arranges for (almost) the top half of the window to be empty. The

second element centers the text below this area. There is no arithmetic in CSS; therefore, the

text cannot be exactly vertically centered.

CSS alone is not sufficient to separate form and content of an HTML document: the head part

of a document contains some content relevant information (title). Also, elements have to be

inserted (table or div) which do not contribute content but which are required to achieve the

desired form.

Finally, the document uses information twice (the same text as content and title) which has to

be specified twice this is harder to maintain.


1.4 XML with CSS

The actual information can be specified much more easily:





Hello, Osnabrück!

7

hello/5.xml

Like HTML, the Extensible Markup Language (XML) is based on SGML. Markup, elements,

and attributes follow very similar rules; however, upper and lower case are distinguished.

Attribute values must always be enclosed in quotes and elements must always be closed.

A document starts with an XML processing instruction (PI). It specifies the XML version

(currently only 1.0) and the character set which is used for the document. In the example

above it is UTF-8, a representation of Unicode which is identical to ASCII in the defined part

and which contains lots of national characters.

XML is extensible because one can invent arbitrary elements, attributes, and attribute values.

Elements can be (completely) nested. A well-formed document must only obey the formatting

rules.

An XML based document can use


hello/5.dtd

An external DTD should also start with an xml PI to specify the character set. The rest of the

DTD consists of definitions with an XML-like syntax. There are SGML comments but a DTD is

not an XML based document this is one of the reasons why XML Schema was made.

ELEMENT defines an XML element and specifies it’s content. #PCDATA (parsed character data)

is Text without further elements.

Without any ATTLIST definition the element cannot use attributes.

An ENTITY defines text replacement. XML predefines & > < " and '

for & > < " and ’.


CSS tries to describe all properties related to representation and to tie these properties to

elements in an SGML based language. It sounds likely that CSS can be used to represent XML

directly in a web browser. It still takes artificial elements to center the text, however:






Hello, Osnabrück!




Hello, Osnabrück!





8

hello/4.xml

The solution is first based on the Alternative without tables. This takes an element for the free

space above the text. vspace has no content. Empty elements can be abbreviated by

specifying / at the end of the opening tag and omitting the closing tag.

The content of an XML based document must be a single element. Here it is called document.

It encloses vspace and hello.

At least in theory CSS can specify that elements are parts of lists or tables. For a table, one

element (like box) has to have the display property value table, another element

(horizontal) is given table-row, and another (cell) is declared to be a table-cell. The

element hello is reused to centralize management of the text properties.


Given all these elements there is enough form information so that CSS can be specified. The

XML based document references CSS with an xml-stylesheet PI with the attributes href for

the address (as a URI) and type="text/css".

/* CSS2 for 4.xml

display: table does not work in IE 5.5 and 6 and Netscape 6

ats 2001-03

*/

* { background: khaki }

vspace { height: 45% }

hello { width: 100%;

text-align: center;

font: xx-large monospace;

color: blue;

}

box { display: table; height: 45% }

horizontal { display: table-row }

cell { display: table-cell }

9

hello/4.css

The background can be given a common color by specifying the property background with *

as a selector for all elements.

By default an element has the display property value block and blocks are displayed one

below the other. inline specifies that elements appear next to one another.

Internet Explorer implements the XML/CSS solution without tables correctly. Netscape 6 simply

shows the text at the top left, i.e., it ignores unknown elements but not their text content, and it

accepts the PI referencing the CSS file.

Internet Explorer 5.5 is rather selective. The Alternative without tables works but the original

solution with tables does not apparently not all values of the display property have been

implemented so far. It is unfortunately intentional that unknown CSS specifications get silently

ignored.


1.5 XML with XSLT and CSS

CSS defines the properties of a SGML based language but it cannot influence the elements

itself and it cannot (at present) calculate with strings or numbers.

The Extensible Style Sheet Language (XSL) is itself an XML based language and it is

supposed to far exceed CSS. XSL-FO defines Formatting Objects with very many properties

for rendering. Nested elements describe an object hierarchy. An XSL-FO document consists of

layout and content elements which are rendered together by a suitable processor. XSL-FO has

finally been defined as a W3C recommendation and there are some initial implementations.

Even with XSL-FO there remains the problem how to get from pure content, i.e., XML elements,

to data that can be displayed such as HTML or XSL-FO elements.

To solve that problem XSL defines a programming language for tree transformations as an

XML based extensible language: XSLT has been defined some time ago and there are several

implementations. As an example, the XML based document





Hello, Osnabrück!

can be transformed into this HTML document




Hello, Osnabrück!






Hello, Osnabrück!





10

hello/5.xml

hello/5.html


with the following XSLT program:























11

hello/5.xsl

An XSLT document consists of an xsl:stylesheet element with the exact attributes shown

above. The xsl:output element modifies the output and avoids certain XML constructs that

cannot be used in HTML documents.

An XSLT program tries to apply xsl:template elements to a document tree which is often

presented as an XML based document.

match="/" recognizes a root node, which is directly outside the entire document and in

particular outside the root element of the document. Therefore, this xsl:template acts as a

main program.

An xsl:template contains xsl elements which are executed and other elements which are

copied into the result tree. The program above constructs a tree consisting of HTML elements

and in two places uses xsl:value-of to copy the content of the hello element from the

document tree to the resulting HTML tree.

The values of the match and select attributes are so called XPath expressions which

primarily select parts of the document tree. Unfortunately this tree consists from the XML point

of view of the elements; from the XPath point of view there are other nodes, e.g., / for the root

node and separate nodes for the attributes and for the text in an element.


There is no DTD for XSLT programs because the program can contain arbitrary elements in

addition to the xsl elements.It is quite hard (if not impossible) to write a DTD that copes with

namespaces such as xsl:. An XSLT program is a well-formed XML document and only the

XSLT processor decides if the xsl elements are nested properly.

Execution

The Internet Explorer applies an XSLT program to an XML based document if there is a PI such

as


and if a newer version of MSXML3 has been installed and if XmlInst was used to register it

properly in the Windows Registry newer versions of Internet Explorer do this automatically.

The conversion happens in the browser, errors are reported in the browser window.

The type value in the PI is not a standard. This solution does not make too much sense

anyhow because rumor has it that Microsoft does not yet support the final version of XSLT.

Current XSLT implementations such as Saxon or Xalan and XML parser such as Crimson,

Xerces or Ælfred which is part of Saxon can be used from the command line to perform the

transformation:

$ java -classpath saxon.jar com.icl.saxon.StyleSheet 5.xml 5.xsl > 5.html

Other conversions can be programmed. The following XSLT program only extracts the text and

recodes it from UTF-8 into ISO-8859-1:










$ java -classpath saxon.jar com.icl.saxon.StyleSheet 5.xml 6.xsl | od -c

0000000 H e l l o , O s n a b r 374 c k

0000020 ! \n

12

hello/6.xsl

Dealing with white space is hard: white space in the context of text gets normalized, white

space between elements is ignored. Here the element xsl:text contains a line separator

which is thus preserved for output. If the XML based document was not designed carefully the

output of white space cannot be controlled exactly using XSLT.


Saxon or Xalan do not validate an XML based document. This would have to be done

separately:

$ java -classpath ..:xerces.jar -Derror=sax.Errors \

-Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser \

-Dtrue='http://xml.org/sax/features/validation

http://xml.org/sax/features/namespaces' sax.Main 5.xml

sax.Main contains a main program to call certain XML parsers, sax.Errors is a class for

error reporting. Both are discussed later.

13


1.6 Server-side XML with XSLT

Obviously one can try to convert XML to HTML using XSLT in a web server only as needed. A

browser usually identifies itself to the server it seems even feasible to convert differently for

different browsers.

tcp.Httpd is a very rudimentary web server which was implemented in Java for a course on

distributed systems.and which relies on a server framework tcp.Server.

XML parser and XSLT transformers are implemented as Java classes; therefore, they can

easily be integrated into such a web server:

protected void getFile (File file) throws IOException {

String path = file.getCanonicalPath();

if (!path.startsWith(root()))

throw new FileNotFoundException(file.getName());

if (!path.endsWith(".xml")) {

super.getFile(file); return;

}

14

hello/XHttpd.java

Basically the method getFile() answers a request for a file and has to be overwritten. If the

filename does not end with .xml the original server takes care of answering.

hello/XHttpd.java

Exception ex = null;

try {

final String xsl[] = { null };

SAXParserFactory.newInstance()

.newSAXParser().parse(file, new DefaultHandler() {

public void processingInstruction (String target, String data) {

if (xsl[0] == null && target.equals("xml-stylesheet")

&& data.indexOf("text/xsl") != -1) { // need not be with type=

int a = data.indexOf("href="); // could be inside "" or ''

if (a == -1 || a+6 >= data.length()) return;

int b = data.indexOf(data.charAt(a+5), a+6);

if (b == -1 || b == a+6) return;

xsl[0] = data.substring(a+6, b);

}

}

});

if (xsl[0] == null) {

super.getFile(file); return;

}

Otherwise a new XML parser is created, the XML file is sent there, and each PI is checked by a

callback whether it is for xml-stylesheet and references text/xsl. If so, the href attribute

is extracted, more or less by brute force. If any of this does not work out, the original server

takes over.


}

URL url = file.toURL();

String xml = url.toString(); // file:/path

Transformer transformer =

TransformerFactory.newInstance().newTransformer(

new StreamSource(new URL(url, xsl[0]).openStream()));

if (argc > 2) {

out.println("HTTP/1.0 200 ok");

out.println("Content-Type: text/html\n");

out.flush();

}

transformer.transform(new StreamSource(xml), new StreamResult(out));

} catch (ParserConfigurationException e) { ex = e;

} catch (SAXException e) { ex = e;

} catch (TransformerConfigurationException e) { ex = e;

} catch (TransformerException e) { ex = e;

}

if (ex != null) throw new IOException(ex.toString()); // yuck...

15

hello/XHttpd.java

If there is an href attribute, it is converted into a URL in the context of the name of the XML file

and a network connection is set up. A new XSLT transformer is created and the XSLT program

is loaded from the network connection. Finally, the XML file is transformed and the result is

delivered by way of the global OutputStream out to the caller of getFile().

Of course there can be a lot of errors which, however, getFile() may only report as an

IOException.

The solution is very rudimentary. For production purposes one would use Java servlets in a

suitable web server or a web publishing framework such as Cocoon. This section was only

meant to show how little effort it takes to use the XML/XSLT technology in one’s own Java

programs. The details of the Java API for XML Processing (JAXP) will be discussed later.

If this server is compiled and started

$ export CLASSPATH=.:httpd.jar:xerces.jar:saxon.jar:unsealed-jaxp.jar

$ javac XHttpd.java

$ java -Dhttp.server.port=8080 \

-Djavax.xml.parsers.SAXParserFactory=org.apache.xerces.jaxp.SAXParserFactoryImpl \

-Djavax.xml.transform.TransformerFactory=com.icl.saxon.TransformerFactoryImpl \

XHttpd `pwd`

then even Netscape 6 using http://localhost:8080/5.xml will see the text in the middle of

a browser window because Netscape only loads an HTML document.

Various transformers and parsers may be combined at startup. However, the jar files for JAXP

and Crimson contain Sealed: true in their manifests. This causes grief with some

combinations because classes from the same package are found in different jar files. The

problem can be circumvented by extracting the sealed archives with jar and repacking them

without the manifest.


1.7 Summary

A reasonably portable representation of information in a web browser can only be

accomplished using HTML.

One should really use CSS to separate form and content of the information, at least to some

extent, and in order to maintain the form in a central place.

If it is not guaranteed that all clients use a current version of Internet Explorer complete with a

properly installed XML/XSLT system and this hardly true even in a small intranet one

should convert XML in the server and should not send it to the client. (MSXML3 is considered

to be somewhat incompatible with the future development of XSLT.)

XML has the obvious advantage that one can use modern character set encodings and reuse

information and present it differently in different places.

It is debatable if XSLT at least by appearance as an XML based document is a usable

language, but manual transformations are significantly more labor intensive. XSLT is for XML

what awk is for lines of text: one can select parts of a document and create a new document

by way of a program.

16


17

2

Hypertext Markup Language

XML can be quickly displayed in a browser if it is transformed into HTML. This chapter is a

quick summary of the relevant part of HTML.

2.1 Terminology

Hypertext is a document with marks that reference other documents. A browser usually

implements a more or less elegant navigation with bookmarks and backtracking between the

connected documents. Typesetting and help systems often use hypertext.

The Hypertext Markup Language HTML (RFC 1866, outdated), under development by the

World-Wide-Web Consortium, is (among other things) a protocol that is shared between a

web server and a web browser. In principle HTML does not depend on the Hypertext Transfer

Protocol HTTP (RFC 2616).

Between server and browser HTML describes the mostly logical markup of a document with

sections, tables, typography, embedded multimedia objects, links, and forms. HTML is used to

represent multi medial hypertext.

A link is specified with a Uniform Resource Identifier (RFC 2396), which defines the transport

protocol and address but not necessarily the type of a resource. A link can be attached

to a piece of text or a picture and depending on the resource can reference a position in

the resource.

A form is specified in HTML and is used to get input in a browser. Depending on the form the

input is encoded and sent from the browser to the server. From that perspective HTML is a

protocol between browser and server. Depending on the transport protocol, the server will

answer, usually using HTML.

The Extensible Markup Language XML, under development by the World-Wide-Web

Consortium, can be viewed as a generalization of HTML to represent and logically structure

arbitrary data often under control of a grammar which is specified with a Document Type

Definition DTD or using XML Schema. HTML is almost an XML application. Tools like an

implementation of the Extensible Style Language XSL are used to transform XML based

documents to render them on a screen or printer. Other tools such as XPATH allow references

between XML based documents.

The following rudimentary description is based on HMTL 4.0.


2.2 Structure

An HTML document has the following overall structure:





Document Title







Markup uses elements consisting of tags, which contain a name and optional attributes. Case

is mostly insignificant, white space is mostly ignored, attribute values must be enclosed in

quotes or use only letters, digits, periods, and hyphens. In place of body there can be a

frameset:











This results in three named frames which can be used independently as target. The

drawback is that bookmarks do not work well.

HTML is used to define structured documents, i.e., many tags require end tags ; the resulting elements cannot be nested in arbitrary ways (and usually not in a

recursive fashion).

The primary goal was to separate form and content and to specify very little about the form in

the document itself. The content is supposed to be logically structured:

... encloses address, can contain a, p

... encloses quotation, can contain a, p

end of line without vertical white space, not in pre

... encloses some part (for style sheet)

... definition list, contains pairs of dt and dd;

lists may be nested

... ...

enclose headlines with more and more detail,

...

can contain font changes (illogical)

horizontal line

... numbered list, contains li

paragraph: end of line with vertical white space

18


... preformatted text in non-proportional font, can contain

a, p, and font changes

... table, see below

... unnumbered list, contains li

2.3 Characters

HTML should be specified using ASCII; however, links can specify the encoding of their targets.

There is a number of very important escape sequences (entities) which all start with & and end

with ;

& &

> >

< <

" "

ß ß

ä ä and similarly all other German umlauts in lower and upper case

Similar to uml there are acute, circ, grave, ring, tilde, and cedil, which can only be

applied to the usual characters, as well as the icelandic upper and lower case eth and thorn

and many other characters, but there are, e.g., no Greek letters.

Using &#ddd; characters can be represented as decimal bytes.

2.4 Tables

Tables are usually abused to arrange for a two-dimensional layout. The structure is simple:






There are lots of attributes, e.g., colspan, rowspan, align, and valign, to combine cells and

to control arrangement within a cell.

Just as for br, li, and p the end tags of rows and cells are usually omitted.

2.5 Embedded Objects

Depending on the browser, different graphics formats can be displayed:


img can be nested as a link into a. Links can also be connected to parts of an image:



...



With object (and previously with applet and embed) one can dedicate parts of a browser

window to another application.

19


2.6 Typography

There are logical tags like strong for emphasis, physical tags like b for bold and tt for

typewriter fonts and finally (deprecated) font to explicitly specify font characteristics and

colors. The names are claimed to have been taken from texinfo.

abbr abbreviation

acronym acronym

cite quote

code program text, non-proportional

div can reference a style sheet

em emphasis

kbd input, non-proportional, often bold

q quote

samp output, non-proportional

strong strong emphasis

var variable name, often italic

b bold

big larger

i italic

small smaller

sub subscript

sup superscript

tt non-proportional

One should really just use the logical tags for typographic markup. Browsers are not obligated

to really differentiate everything.

Font changes can be nested (illogical) and can contain img and br but not (for example) p or

lists.

20


2.7 Links


text or image

a, img, frame, and form use a Universal Resource Identifier URI (RFC 2396), which can in

principle describe every resource in the internet. URI is the “superclass’’ for Universal

Resource Locator URL used to obtain a resource and Universal Resource Name URN

this is supposed to be a unique, persistent name for a resource.

So that everything can be described, a URI starts with a registered schema name, and the rest

of the syntax depends on that:

absoluteURI = scheme ":" ( hierarchical | opaque )

scheme = ALPHA *( ALPHA | DIGIT | "+" | "-" | "." )

opaque can contain just about all characters except for the slash / which is used as path

separator. Here is an example for a non-hierarchical URI for a write-only resource:

mailto:axel@uos.de

The rest of these ideas is silently based on protocols like FTP and HTTP which are used as

schema names:

hierarchical = ( net_path | abs_path ) [ "?" query ]

net_path = "//" authority [ abs_path ]

authority = server | reg_name

reg_name is a registered name, defined elsewhere, without slashes.

server = [ [ userinfo "@" ] hostport ]

hostport = host [ ":" port ]

host = hostname | IPv4address

userinfo consists of a user name which is supposed to be used for authentication. There

could be a password after a colon, but that would be transmitted unencrypted. hostname is a

DNS name, port consists of digits and the IPv4address uses the usual representation.

abs_path = "/" path_segments

path_segments = segment *( "/" segment )

segment = *pchar *( ";" param )

param = *pchar

pchar = char | escaped | ":" | "@" | "&" | "=" | "+" | "$" | ","

escaped = "%" hex hex

char = alphanum | "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"

An abs_path can look like a non-canonical Unix path. param is not supposed to mean

anything but it could be a version number in a file system. In a query the characters ; / ? :

@ & = + , and $ are reserved.

In addition to the absoluteURI there is a relativeURI with elaborate rules how it can be

converted into an absoluteURI based on context.

relativeURI = ( net_path | abs_path | rel_path ) [ "?" query ]

rel_path = rel_segment [ abs_path ]

rel_segment = 1*( char | escaped | ";" | "@" | "&" | "=" | "+" | "$" | "," )

A segment may contain : and ; param, a rel_path must start with a rel_segment, which

may not contain : and ; param , but which may contain ; maybe for the benefit of a Mac.

21


2.8 Tools

Dave Raggett, the author of the newer HTML specifications, provides tidy, an HTML parser and

pretty-printer, which can be used to find errors and to create canonical HTML:

$ tidy -config tidyrc -quiet ../hello/1.html

Can't open "/Users/ats/.tidyrc"






Hello, Osnabrück!





Hello, Osnabrück!





You are recommended to use CSS to specify the font and

properties such as its size and color. This will reduce

the size of HTML files and make them easier maintain

compared with using elements.

tidy is available for very many platforms. Specifically for Windows there are free and

commercial versions of the editors notetab and html-kit which include tidy and other tools.

22


23

3

Cascading Style Sheets

Cascading Style Sheets (CSS) were introduced to precisely control the representation of

HTML in web browsers and to separate form (CSS) and content (HTML) as far as possible.

Level 1 is supported by many browsers, level 2 is only partially supported this chapter will

not discuss level 2.

Theoretically, XML based documents can be displayed in a web browser using CSS lots of

details have to be specified but this only works beginning with version 5.5 in Internet

Explorer.

It makes more sense to transform XML to HTML first this allows information to be modified

or used more than once. In this case, CSS should be used to keep form and content separated

as far as possible.

XSLT is usually used to transform. This could be done client-side, at least in Internet Explorer,

but that severely restricts what browsers can be addressed.

Validator

There is a relatively old, dormant CSS-Validator at the W3C not too robust and difficult to

download.


3.1 Architecture

A document is viewed as a tree. Format properties are assigned to the nodes using constant

statements:

selector { property-name : value ! important ; ... } /* comment */

CSS is not line-oriented. White space is used for separation; comments are as in C.

A selector is a pattern that matches some nodes; e.g., a name matches all nodes

representing document elements with that tag name. In HTML some elements are allowed to

be missing but they are silently added to the tree.

Properties and values are explained below. Not all properties are considered for all nodes. To

be upward compatible, a CSS implementation is allowed to silently ignore properties.

A document must contain or reference it’s CSS statements:





@import url( URI );

h1 { color: blue }




blue

green.



Inheritance

Some property values are inherited, i.e., imported from the parent node, if they have not been

defined explicitly in the current node.

Example: font and color are inherited. background is not inherited but the default value is

transparent, i.e., the background color of the parent node will shine through.

Cascading

A document in a web browser can reference several style sheets. Weights govern what

prevails:

Statements are considered if their selector matches; otherwise there is inheritance and

finally the default value of a property.

important has precedence.

The document takes precedence over the browser user and then over the browser default.

More id and then class and the number of names in selector increase precedence.

Finally, later rules take precedence. @import must be first.

Finding errors is very difficult because unknown things are silently ignored.

24


Patterns

body selects an element

* selects all elements

#x24 selects by (unique) id attribute

p#x24 only selects if respective element has this id attribute

.number selects by class attribute

td.number only selects if respective element has this class attribute

a:link selects untouched link

a:visited selects visited link

a:active selects active link

p:first-letter selects first letter

p:first-line selects first line

a img selects (deeply) nested element

p, td logical OR of patterns

Level 2 has more patterns, e.g., to select directly nested or sibling nodes. There are some

pseudo elements which accept a content property so that the steeliest can insert content.

3.2 Box Model

Every element is considered to be a box; properties can define additional areas:

margin

border

padding

content element width

box width

content and padding both use background of the element. border can be, e.g., a line in a

different color. margin is always transparent, i.e., the parent node influences that.

The report describes how elements are arranged relative to one another. CSS2 permits fixed

which is a position that cannot be moved in the browser window.

25


3.3 Properties

/* CSS1 properties, ordered as per REC-css1

*inherited, default (if any) first, ?preceding restricted to

*color:

The file code/css/css1 was extracted from the report. It describes the CSS1 properties,

values, and defaults.

* before a property indicates that the value is inherited.

26

css/css1

The first value is usually the default. A value like means that there is a special syntax:

color names or explicit RGB values like #rrggbb with hexadecimal values or rgb(r,g,b) with

values in 0..255 or as percentages.

color is the foreground color which is used for text. It should be defined together with

background.

background: -color -image -repeat -attachment -position

background-color: transparent |

background-image: none |

background-repeat: repeat | repeat-x | repeat-y | no-repeat

background-attachment: scroll | fixed

background-position: 0% 0% | .. | ..

| top|center|bottom left|center|right

?block-level and replaced elements

css/css1

background defines the background and should be defined together with color. It can be a

value or an image.

block-level are elements with the property display: block, i.e., all those that do not apply

to the typography of individual characters. replaced are elements like img which are replaced

by external materials.


*font: [-style -variant -weight] -size [/line-height] -family

*font-style: normal | italic | oblique

*font-variant: normal | small-caps

*font-weight: normal | bold | bolder | lighter

| 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900

*font-size: medium | xx-|x-|small|large | larger | smaller

| |

*font-family: ,...

font defines the typeface. serif, sans-serif, monospaced, cursive, and fantasy leave

the exact choice up to the browser.

27

css/css1

can be specified using em and ex relative to the typeface (with font relative to the

typeface of the parent node) and using px based on the screen. Absolute units are in, cm, mm,

pt, and pc (Pica, 12pt).

Percentage values are relative but they are inherited as absolute values.

*word-spacing: normal |

*letter-spacing: normal |

text-decoration: none | underline overline line-through blink ..

vertical-align: baseline | sub | super | top | middle | bottom

| text-top | text-bottom |

?inline elements

*text-transform: none | capitalize | uppercase | lowercase

*text-align: left | right | center | justify

?block-level elements

*text-indent: 0 | |

?block-level elements

*line-height: normal | | |

word- and letter-spacing could even be negative.

text-decoration can be combined. It should also apply to nested elements.

css/css1

vertical-align is relative to the parent node. Only top and bottom apply to the formatted

line, but they may not be used to create circular dependencies.

text-align applies to the text relative to it’s element; center only takes effect if width is also

manipulated.

text-indent could even be negative.

line-height can also be defined as (inherited) part of the typeface size.


padding: -top -right -bottom -left

padding-top: 0 | |

padding-right: 0 | |

padding-bottom: 0 | |

padding-left: 0 | |

padding increases the area of an element and observes background. There can be one

value for all four sides or two values that are used symmetrically.

border border-top|right|bottom|left: -color -style -width

border-width: -top -right -bottom -left

border-color: ..

border-style: none | dotted | dashed | solid | double | groove

| ridge | inset | outset

border-top|right|bottom|left-width: medium | thin | thick |

border surrounds the result. This can be used to create lines.

margin: -top -right -bottom -left

margin-top: 0 | | | auto

margin-right: 0 | | | auto

margin-bottom: 0 | | | auto

margin-left: 0 | | | auto

28

css/css1

css/css1

css/css1

margin once again surrounds the result, but is transparent and observes the background of

the parent node.


width: auto | |

?block-level and replaced elements

height: auto |

?block-level and replaced elements

float: none | left | right

clear: none | left | right | both

display: block | inline | list-item | none

white-space: normal | pre | nowrap

?block-level elements

list-style: -type -image -position

?all only to elements with 'display' value 'list-item'

29

css/css1

list-style-type: disc | circle | square | decimal | lower-roman | upper-roman

| lower-alpha | upper-alpha | none

list-style-image: none |

list-style-position: outside | inside

width and height explicitly define the size of an element, e.g., for img.

float defines an element as display: block and lets text flow past on the opposite side.

This need not work to any depth.

clear moves an element down until there are no float elements on the indicated side.

display differentiates sections, lists, and typography, and it can suppress output. In CSS2

there are more values applying to tables.

white-space can be modified for normal, and must be kept as is for pre. Finally, nowrap

requires br elements for each line break this aspect of br cannot be expressed in CSS.


3.4 Examples

CSS1 suggests the following default for HTML:

BODY {

margin: 1em;

font-family: serif;

line-height: 1.1;

background: white;

color: black;

}

H1, H2, H3, H4, H5, H6, P, UL, OL, DIR, MENU, DIV,

DT, DD, ADDRESS, BLOCKQUOTE, PRE, BR, HR, FORM, DL {

display: block }

B, STRONG, I, EM, CITE, VAR, TT, CODE, KBD, SAMP,

IMG, SPAN { display: inline }

LI { display: list-item }

H1, H2, H3, H4 { margin-top: 1em; margin-bottom: 1em }

H5, H6 { margin-top: 1em }

H1 { text-align: center }

H1, H2, H4, H6 { font-weight: bold }

H3, H5 { font-style: italic }

H1 { font-size: xx-large }

H2 { font-size: x-large }

H3 { font-size: large }

B, STRONG { font-weight: bolder } /* relative to the parent */

I, CITE, EM, VAR, ADDRESS, BLOCKQUOTE { font-style: italic }

PRE, TT, CODE, KBD, SAMP { font-family: monospace }

PRE { white-space: pre }

ADDRESS { margin-left: 3em }

BLOCKQUOTE { margin-left: 3em; margin-right: 3em }

UL, DIR { list-style: disc }

OL { list-style: decimal }

MENU { margin: 0 } /* tight formatting */

LI { margin-left: 3em }

DT { margin-bottom: 0 }

DD { margin-top: 0; margin-left: 3em }

HR { border-top: solid } /* 'border-bottom' could also have been used */

A:link { color: blue } /* unvisited link */

A:visited { color: red } /* visited links */

A:active { color: lime } /* active links */

/* setting the anchor border around IMG elements

requires contextual selectors */

A:link IMG { border: 2px solid blue }

A:visited IMG { border: 2px solid red }

A:active IMG { border: 2px solid lime }

30


The lecture notes depend on XML, XSLT, and CSS:

Homepage index.xml

index.xsl

Navigation and CSS2

index.html

code/css/common.css

home.html

title-frame.html

code/index.html code/css/common.css

Code code/index.xml

code/index.xsl

FTP area ftp/index.xml

ftp/index.xsl

ftp/index.html code/css/common.css

Papers rec/index.xml

rec/index.xsl

rec/index.html code/css/common.css

Notes (FrameMaker) html/skript.html html/images/skript.css

code/css/common.css

Color Table code/xslt/colors.xsl code/xslt/colors.23.html

code/xslt/doColors.xsl

code/xslt/makefile

code/xslt/colors.127.html

RGB Table code/xslt/rgb.txt

code/xslt/rgb2xml

code/xslt/rgb.xsl

code/xslt/rgb.html

The Homepage shows that frameset and frame can be used to program something like the

bookmarks of PDF, which do not disappear when the page content is scrolled.

The drawback is that bookmarks and history in the browser are unlikely to achieve the desired

effect complete restoration of a position.

The CSS2 properties position: fixed and z-level: 1 could be used to shift an area

forward to get a comparable navigation within a single page. A similar example is at the W3C.

An iframe should keep the navigation accessible even if it is bigger than the visible area.

At least for Internet Explorer 5.1 on MacOS X the solution is less than optimal.

31


33

4

Compiler Construction

XML can be used to design and implement languages. This is normally the job of compiler

construction usually, however, with a much more user-friendly representation.

This chapter sketches basic concepts in compiler construction as a foundation and for

comparison to the definition of XML which follows in the next chapter.

4.1 Grammars

A grammar consists of a finite set of nonterminal symbols, a finite set of terminal symbols, a

start symbol which is a nonterminal, and a finite set of rules. A rule is an ordered pair of

sequences of nonterminal and terminal symbols. For example:

nonterminals: a, b

terminals: c, d

start symbol: a

rules: (ab, ac), (a, )

Chomsky distinguishes four different kinds of grammars based on the

structure of the rules.

In a context-free grammar each rule must have a single nonterminal as the

first sequence. It is known that a push-down automaton (PDA, stack machine)

is sufficient for recognition. For example, the grammar above is not contextfree,

but with the following rules it is:

rules: (a, b), (a, ac), (b, d)

In a regular grammar, a rule consists either of a nonterminal and a terminal or of a nonterminal

and a sequence consisting of a nonterminal and a terminal. A finite state automaton (FSA) can

be constructed from the grammar and perform recognition. For example, the grammar above

with the context-free rules is in fact regular.

It turns out that the pattern matching performed by commands like grep can (for the most part)

be done with a FSA. This is where the regular expressions describing the patterns got their

name.


4.2 Rule Specifications

Rules are often specified in Backus Naur Form (BNF). The typical grammar for arithmetic

expressions has the following rules:

sum: product | sum "+" product | sum "-" product;

product: term | product "*" term | product "/" term | product "%" term;

term: Number | "(" sum ")" | "+" term | "-" term;

Each nonterminal shows up on the left. Other names like Number represent categories of

terminals (token). Strings represent terminals directly (literal). All rules for the same

nonterminal are combined into one by joining alternative right hand sides with |. The first

nonterminal is the start symbol.

Extended Backus Naur Form (EBNF) largely avoids recursion by using special syntax for

iterations:

lines: ( sum? "\n" )*;

sum: product ( "+" product | "-" product )*;

product: term ( "*" term | "/" term | "%" term )*;

term: Number | "(" sum ")" | "+" term | "-" term;

34

oops/expr.ebnf

Parentheses are used for grouping and * follows an element that can be repeated zero or

more times. Similarly, + requires one or more iterations and ? specifies that the preceding

element appears once or not at all.

Wirth introduced (directed) syntax graphs where terminals are specified in round and

nonterminals in rectangular nodes:

lines

product

\n

term

*

/

%

sum

term

sum

product

+

-

Number

( sum )

+

-

term

term


The topology of syntax graphs can be more restricted if a variant of Nassi-Shneiderman

diagrams is employed:

Both kinds of pictures suggest that a grammar can act as a blueprint for a recognition program

the method of recursive descent:

sum: product ( "+" product | "-" product )*;

results in

lines

sum

"\n"

sum

product term

term

"*" "/"

term term

"%"

term

Number

product

"+" "-"

product product

"("

sum

")"

sum() {

product();

while (true) {

if (input.equals("+")) product();

else if (input.equals("-")) product();

else break;

}

"+"

term

"-"

term

35


4.3 Trees

The start symbol of a grammar produces syntax trees: Nodes are nonterminals or terminals. If

a node has descendants the node must be a nonterminal and there must be a rule consisting

of the nonterminal and the (ordered) sequence of descendants. For example:

a

a c

b

d

The ordered sequence of all terminal leaves of a syntax tree (with the start symbol as a root) is

called a sentence. In this example the sentence is dc.

A language is a set of sentences, i.e., sequences of terminals. A grammar for a language must

exactly produce all sentences. A language can have more than one grammar.

A grammar is called ambiguous if there is a sentence for which there is more than one syntax

tree. For example:

nonterminals: sum

terminals: number, +

start symbol: sum

rules: (sum, sum + sum), (sum, number)

sum

number

sum

+

sum

number

sum

+

sum

number

sum sum sum

sum

sum

36


Arithmetic expressions are often represented as expression trees. They are similar to syntax

trees but they only contain nodes that are relevant for evaluation:

2

Expression trees represent the meaning of an expression. They obviously result from pruning

syntax trees and this is why two syntax trees for the same sentence are not acceptable. For

compiler construction it is very important that a grammar is not ambiguous.

There is no easy way to prove that a grammar is not ambiguous, but there are sufficient

conditions which can be checked by programs, e.g. LL(1):

lines

+

product

*

3 4

product

Consider a syntax graph to be a map that a recursive descent program has to travel: at each

intersection (in this case in sum) a single next terminal must be sufficient to decide which path

to take.

sum

\n

product

term

*

/

%

"+" "-"

product product

sum

term

Looking at Nassi Shneiderman diagrams one recognizes that this has to be decided for each

building block of EBNF and that the blocks must help each other e.g., here the nested

alternative must tell the enclosing iteration that it can only use + and -.

sum

+

-

Number

( sum )

+

-

term

term

37


4.4 Language Recognition

Nassi Shneiderman diagrams have the following building blocks:

Lit Token Id Seq

"\n" Number sum

Alt | Opt ? Some + Many *

oops.parser is a class hierarchy with these building blocks to represent and serialize trees

for grammars that are defined in EBNF. This can be used to build and serialize a tree for a

grammar. expr.ebnf is converted to a serialized tree in expr.ser as follows:

$ java -classpath .:oops.jar oops.Compile -f Ebnf ebnf.ser Ebnf.Scanner \

> expr.ebnf > expr.ser

Grammar Checking

The classes implement methods to test the LL(1) property for a tree before it is serialized. The

following grammar describes bit sequences but it is not LL(1):

// a grammar which is not LL(1)

bits: ( "0" | "1" )* ( "0" | "1" );

$ java -classpath .:oops.jar oops.Compile -f Ebnf ebnf.ser Ebnf.Scanner \

> bits.ebnf >/dev/null

GoalMakerFactory is Ebnf

oops.parser.Many, warning:

[{ ( "0" | "1" ) }]: ambiguous, will shift

lookahead {[empty], "1", "0"}

follow {"1", "0"}

38

oops/bits.ebnf

The iteration Many cannot decide if it should stop for convenience it will try to collect the

longest sequence which in this case is a mistake.


Input Checking

The classes in oops.parser additionally implement methods for recursive descent so that a

grammar tree can be used to check if some input is a sentence.

oops.Compile defines a main program that reads a serialized grammar tree, instantiates a

scanner object, and then analyzes some input.

# example of an arithmetic expression

3 + 4 * 5

$ java -classpath .:oops.jar oops.Compile expr.ser Scanner a.expr > /dev/null

No suitable Goal for rule lines found; will use a GoalAdapter.

No suitable Goal for rule sum found; will use a GoalAdapter.

No suitable Goal for rule product found; will use a GoalAdapter.

No suitable Goal for rule term found; will use a GoalAdapter.

This is a sentence, so there is no interesting output.

Scanner

The scanner object must implement the oops.parser.Scanner interface:

39

oops/a.expr

scan(Reader,Parser) is called once to tell the scanner about the input and the parser. This

method should call advance() to proceed to the first input symbol.

advance() must proceed to the next input symbol and return true if there is one.

atEnd() must return true once advance() returned false at end of input.

tokenSet() is called to get the current input symbol from the scanner. The actual result is

determined by asking the parser.

For a literal the string from the grammar is sent to the parser with getLitSet() and the result

is either null indicating that the string is unknown or the value that tokenSet() has to

return for the literal.

A token, i.e., a name for a category of input symbols such as Number, is sent to the parser with

((Token)parser.getPeer("Number")).getLookahead() and the result has to be returned

by tokenSet() whenever the input symbol fits the category.

One usually needs to know something about the actual input symbol as well. Therefore, the

method node() can return an object corresponding to the result of tokenSet() in the case

of Number one would return a Double or Long representing the actual value, etc.

[This interface is efficient for the parser, but it is harder for the scanner to implement. Newer

versions of oops.parser use a different interface.]

Here is a scanner for arithmetic expressions implemented with a StreamTokenizer:


** lexical analyzer for arithmetic expressions.

*/

public class Scanner implements oops.parser.Scanner {

protected StreamTokenizer st; // to read input

protected Parser parser; // to get Lit/Token tokenSet values

protected Set NUMBER; // tokenSet for Number

protected Set tokenSet; // null or lookahead identifying token

protected Object node; // Double/Long value for Number

public Set tokenSet () { return tokenSet; }

public Object node () { return node; }

public String toString () { return st.toString(); }

}

public void scan (Reader r, Parser parser) throws IOException {

st = new StreamTokenizer(new FilterReader(new BufferedReader(r)) {

boolean addSpace; // kludge to duplicate \n

public int read () throws IOException {

int ch = '\n';

addSpace = addSpace ? false : (ch = in.read()) == '\n';

return ch;

}

});

st.resetSyntax();

st.commentChar('#'); // comments from # to end-of-line

st.wordChars('0', '9'); // parse decimal numbers as words

st.wordChars('.', '.');

st.whitespaceChars(0, ' '); // ignore control-* and space

st.eolIsSignificant(true); // need '\n'

this.parser = parser;

NUMBER = ((Token)parser.getPeer("Number")).getLookahead();

advance(); // one token lookahead

}

public boolean atEnd () { return st.ttype == st.TT_EOF; }

40

oops/Scanner.java

public boolean advance () throws IOException {

if (atEnd()) return false;

switch (st.nextToken()) {

case StreamTokenizer.TT_EOF: node = null; tokenSet = null; return false;

case StreamTokenizer.TT_EOL: node = null; tokenSet = parser.getLitSet("\n");

break;

case StreamTokenizer.TT_WORD: node = st.sval.indexOf(".") < 0

? (Number)new Long(st.sval)

: (Number)new Double(st.sval);

tokenSet = NUMBER;

break;

default: node = ""+(char)st.ttype;

tokenSet = parser.getLitSet((String)node);

break;

}

return true;

}


4.5 Language Processing

If a grammar is converted into a serialized tree like expr.ser, which can decide together

with a scanner if an input is a sentence for the grammar, then one usually wants to find out

what symbols are in the input. This can be accomplished with an observer pattern:

Token and Lit nodes in activated rules send shift messages to an observer for each

accepted terminal:

public interface Goal {

void shift (Lit sender, Object value);

void shift (Token sender, Object value);

value is the value() returned by the scanner; i.e., the observer can see what number was

actually accepted for a Number.

Once a rule is finished the observer receives a reduce message:

Object reduce ();

The observer can return null or some other object. Either is sent with a shift message to

that (encompassing) observer which activated the rule:

}

lines

sum

"\n"

product

"+"

term

"-"

product "*" product "/"

term term

Number

"("

sum

")"

"%"

term

void shift (Goal sender, Object value);

This observer pattern is nested twice into a factory pattern:

public interface GoalMaker { Goal goal (); }

public interface GoalMakerFactory { GoalMaker goalMaker (String ruleName); }

"+"

term

Lit:

found *

oops.Compile can be called with -f and a class name for a GoalMakerFactory. This is

instantiated once and has to produce a GoalMaker for each nonterminal which in turn has to

produce a Goal whenever a rule is activated for the nonterminal during recursive descent.

There is a DefaultGoalMakerFactory with a GoalAdapter which remembers the very first

value and returns it for reduce(). There also is a DebuggerGoalMakerFactory for tracing

that is used if -d rather than -f is specified as argument for oops.Compile.

"-"

term

Goal:

found term

Token: found Number

41


The following examples demonstrate three different observer patterns: a Handler with a

single Goal object, a Ruler with a single Goal object for each rule even if the rule is used

recursively and finally an Actor with one Goal for each rule activation. The patterns have

different uses.

Trace

Trace is a Goal for tracing which records it’s findings in a StringBuffer, which it does not

return, however:

/** base class for tracing Goals.

*/

public class Trace implements Goal {

/** preserves information.

*/

protected StringBuffer result = new StringBuffer();

/** recognized a literal.

*/

public void shift (Lit sender, Object value) {

if (sender.toString().equals("\"\n\""))

System.err.println(this+"\tshift\t\"\\n\"\t"+value);

else

System.err.println(this+"\tshift\t"+sender+"\t"+value);

if (value != null) result.append(' ').append(value);

}

/** recognized a literal from a category.

*/

public void shift (Token sender, Object value) {

System.err.println(this+"\tshift\t"+sender+"\t"+value);

if (value != null) result.append(' ').append(value);

}

/** satisfied a nonterminal.

*/

public void shift (Goal sender, Object value) {

System.err.println(this+"\tshift\t"+sender+"\t"+value);

result.append(" [").append(value).append(']');

}

/** done with rule.

*/

public Object reduce () {

System.err.println(this+"\treduce\t"+result);

return "reduced";

}

}

There is a bit of trickery so that the trace looks reasonable for the literal "\n".

42

oops.Trace.java


Handler

A relatively simple shell script oops and some classes in oops.helpers cooperate so that a

grammar can be specified together with it’s goals:

import oops.parser.Goal;

/** global trace.

*/

public class Handler extends oops.helpers.Handler {

%%

lines: ( sum? "\n" )*;

sum: product ( "+" product | "-" product )*;

product: term ( "*" term | "/" term | "%" term )*;

term: Number | "(" sum ")" | "+" term | "-" term;

%

return new Trace() {

public void shift (Goal sender, Object value) {

if (sender != this)

System.err.println("not a singleton: "+this+" "+sender);

super.shift(sender, value);

}

public String toString () { return "handler"; }

};

%

%%

}

43

oops/Handler.handler

Code before and after two lines, each consisting only of %%, must define a Java class which

extends one of the three observer patterns defined in oops.helpers.

The middle section contains the EBNF grammar.

In the case of a Handler, the file name must end in .handler and there should be one section

enclosed by two lines, each consisting only of %. This section must return a new Goal this is

the single Goal that the Handler pattern offers as an observer to all rule activations.

The grammar can be extracted with the following command:

$ oops ebnf Handler.handler > expr.ebnf

The observer is created and compiled with the following commands:

$ oops java Handler.handler > Handler.java

$ javac -classpath .:oops.jar Handler.java


Handler is a GoalMakerFactory which constructs a single Trace object. shift() verifies

that there is only a single observer. The trace of 3+4*5 suggests that it could be converted into

an expression tree:

$ java -classpath .:oops.jar oops.Compile -f Handler expr.ser Scanner > /dev/null

GoalMakerFactory is Handler

3+4*5

handler shift Number 3

handler reduce 3

handler shift handler reduced

handler reduce 3 [reduced]

handler shift handler reduced

handler shift "+" +

handler shift Number 4

handler reduce 3 [reduced] [reduced] + 4

handler shift handler reduced

handler shift "*" *

handler shift Number 5

handler reduce 3 [reduced] [reduced] + 4 [reduced] * 5

handler shift handler reduced

handler reduce 3 [reduced] [reduced] + 4 [reduced] * 5 [reduced]

handler shift handler reduced

handler reduce 3 [reduced] [reduced] + 4 [reduced] * 5 [reduced] [reduced]

handler shift handler reduced

handler shift "\n" null

handler shift "\n" null

handler reduce 3 [reduced] [reduced] + 4 [reduced] * 5 [reduced] [reduced] [reduced]

The additional \n is synthesized by the scanner (which otherwise would have to read ahead).

44


Ruler

Ruler is a GoalMakerFactory which creates one Trace object per rule:

/** trace per rule.

*/

public class Ruler extends oops.helpers.Ruler {

%%

lines: ( sum? "\n" )*;

%

return new TraceN("lines");

%

sum: product ( "+" product | "-" product )*;

%

return new TraceN("sum");

%

product: term ( "*" term | "/" term | "%" term )*;

%

return new TraceN("product");

%

term: Number | "(" sum ")" | "+" term | "-" term;

%

return new TraceN("term");

%

%%

}

45

oops/Ruler.ruler

In the case of a Ruler, the file name must end in .ruler and following each rule there can be

a section enclosed by two lines, each consisting only of %. This section must return a new

Goal this is the Goal that the Ruler pattern offers as an observer to each of the preceding

rule’s activations. If there is no section, a GoalAdapter is used.

oops is a bit primitive each rule must start on a new line.

The same commands as above can be used to extract the grammar and compile the observer:

$ oops ebnf Ruler.ruler > expr.ebnf

The observer is created and compiled with the following commands:

$ oops java Ruler.ruler > Ruler.java

$ javac -classpath .:oops.jar Ruler.java


TraceN is a top-level nested class in Ruler. The TraceN objects show their nonterminal name

qualified with an index. reduce() returns the current StringBuffer and creates a new one:

/** counts Goal instances.

*/

protected static int nextN;

/** maintains indexed rule name,

reduces to rule name and string buffer.

*/

protected static class TraceN extends Trace {

private final int n = ++ nextN;

private final String ruleName;

public TraceN (String ruleName) {

this.ruleName = ruleName;

System.err.println(this+"\tnew");

}

public Object reduce () {

super.reduce();

String result = ruleName+this.result;

this.result = new StringBuffer(); // this botches recursive calls

return result;

}

public String toString () {

return ruleName + n;

}

}

46

oops/Ruler.ruler

The trace of 3+4*5 seems to suggest that it is trivial for this kind of observer to make an

expression tree:

$ java -classpath .:oops.jar oops.Compile -f Ruler expr.ser Scanner > /dev/null

lines1 new

sum2 new

product3 new

term4 new

GoalMakerFactory is Ruler

3+4*5

...

lines1 shift sum2

sum [product [term 3]] + [product [term 4] * [term 5]]

Unfortunately, 3*(4+5) uses sum recursively and shows that the Listener cannot cope with

this:

3*(4+5)

...

lines1 shift sum2

sum [product [term [sum [product [term 3] * [term ( 4]] + [product [term 5]]] )]]


Evaluation with a Stack

Stack is extended with a few methods for Integer arithmetic:

public class Cpu extends Stack {

public void add () {

Number right = (Number)pop();

push(new Integer(((Number)pop()).intValue() + right.intValue()));

}

// ...

public void minus () {

push(new Integer(- ((Number)pop()).intValue()));

}

}

The grammar is changed a bit:

lines: ( expr | "\n" )*;

expr: sum "\n";

sum: product ( add | sub )*;

add: "+" product;

sub: "-" product;

product: term ( mul | div | mod )*;

mul: "*" term;

div: "/" term;

mod: "%" term;

term: Number | "(" sum ")" | "+" term | minus;

minus: "-" term;

47

oops/Cpu.java

oops/eval.ebnf

The new rules like expr, add, or minus end exactly when an evaluation has to take place.


Now a Ruler observer can evaluate expressions elegantly:

/** evaluate arithmetic expressions using Cpu.

*/

public class Eval extends Ruler {

protected final Cpu cpu;

{ try {

cpu = (Cpu)Class.forName(

System.getProperty("cpu", "Cpu")).newInstance();

} catch (Exception e) {

System.err.println(e+": cannot create Cpu");

throw new ThreadDeath();

}

}

%%

lines: ( expr | "\n" )*;

expr: sum "\n";

%

return new GoalAdapter() {

public Object reduce () {

System.err.println("\t"+cpu); cpu.removeAllElements();

return null; // no tree to output

}

};

%

sum: product ( add | sub )*;

add: "+" product;

%

return new GoalAdapter() {

public Object reduce () { cpu.add(); return null; }

};

%

// ...

term: Number | "(" sum ")" | "+" term | minus;

%

return new GoalAdapter() {

public void shift (Token sender, Object value) { cpu.push(value); }

};

%

minus: "-" term;

%

return new GoalAdapter() {

public Object reduce () { cpu.minus(); return null; }

};

%

%%

}

48

oops/Eval.ruler


term has to put the value of a Number onto the stack. Other than that the trivial GoalAdapter

is acceptable and reduce() is overwritten to perform calculations on the stack. At expr there

is a result on the stack which can be displayed and discarded.

The grammar is extracted and converted to an oops.parser tree as before:

$ oops ebnf Eval.ruler > eval.ebnf

$ java -classpath .:oops.jar oops.Compile -f Ebnf ebnf.ser Ebnf.Scanner \

> eval.ebnf > eval.ser

The observer is created and compiled with the following commands:

$ oops java Eval.ruler > Eval.java

$ javac -classpath .:oops.jar Eval.java

The resulting interpreter is executed with the following command:

$ java -classpath .:oops.jar oops.Compile -f Eval eval.ser Scanner

GoalMakerFactory is Eval

3+4*5

[23]

3*(4+5)

[27]

+ (1 + 22 - 33 * 4 / 5 % + 6)

[21]

49


Actor

Actor is a GoalMakerFactory which produces a rule-specific Trace object for every rule

activation:

/** trace each activation of each rule.

*/

public class Actor extends oops.helpers.Actor {

/** counts Goal instances.

*/

protected static int nextN;

/** maintains indexed rule name,

reduces to rule name.

*/

protected static class TraceN extends Trace {

private final int n = ++ nextN;

private final String ruleName;

public TraceN (String ruleName) {

this.ruleName = ruleName;

System.err.println(this+"\tnew");

}

public Object reduce () {

super.reduce();

return ruleName + result;

}

public String toString () {

return ruleName + n;

}

}

%%

lines: ( sum? "\n" )*;

%

return new TraceN("lines");

%

// ...

50

oops/Actor.actor

In the case of an Actor, the file name must end in .actor and following each rule there can

be a section enclosed by two lines, each consisting only of %. This section must return a new

Goal a new Goal is created by the Ruler pattern for each of the preceding rule’s

activations. If there is no section, a GoalAdapter is created.

The observer is created and compiled with the following commands:

$ oops java Actor.actor > Actor.java

$ javac -classpath .:oops.jar Actor.java


TraceN is a top-level nested class in Actor. The TraceN objects show their nonterminal name

qualified with an index.

Now we get new StringBuffer objects all the time and they can be returned as result of

reduce(). The trace shows that this trivially produces a syntax tree:

$ java -classpath .:oops.jar oops.Compile -f Actor expr.ser Scanner

GoalMakerFactory is Actor

3*(4+5)

...

lines1 shift sum2

sum [product [term 3] * [term ( [sum [product [term 4]] + [product [term 5]]] )]]

Arithmetic Expressions in XML

The Element Construction Set contains classes to build a tree and render it in XML. Xml is an

Actor for expr.ebnf. It uses these classes to render arithmetic expressions in XML:

/** convert arithmetic expressions to XML.

*/

public class Xml extends Actor {

%%

lines: ( sum? "\n" )*;

%

return new GoalAdapter() {

public void shift (Goal sender, Object value) {

if (result == null) result = new Root("expr");

((Root)result).add(value);

}

public Object reduce () {

if (result != null)

System.out.println(result);

return null; // no serialized tree to output

}

};

%

sum: product ( "+" product | "-" product )*;

%

return new GoalAdapter() {

public void shift (Goal sender, Object value) {

if (result == null) result = value;

else ((Element)result).add(value);

}

public void shift (Lit sender, Object value) {

if (value.equals("+"))

result = new Element("add", new Object[]{ result });

else result = new Element("sub", new Object[]{ result });

}

};

%

51

oops/Xml.actor


lines creates the Root with tag expr and inserts the elements produced by sum.. If the Root

was ever created it is rendered by reduce().

The GoalAdapter defines result, stores the value delivered by the first shift(), and returns

this value by reduce(). The Goal for lines returns null so that oops.Compile does not

produce a serialized object.

sum saves the first Element returned by product. If there is an operator like + a new Element

is created with a tag corresponding to the operator, the current result becomes the first child

of this Element, and the Element is the new result. The next Element returned by product

then becomes the second child, etc.

product works the same and creates Element objects named div, mod or mul.

term: Number | "(" sum ")" | "+" term | "-" term;

%

return new GoalAdapter() {

public void shift (Token sender, Object value) {

result = new Element("literal", new String[] {

"type=Integer", "value="+value });

}

public void shift (Lit sender, Object value) {

if (value.equals("-")) result = new Element("minus");

}

public void shift (Goal sender, Object value) {

if (result == null) result = value;

else ((Element)result).add(value);

}

};

%

%%

}

52

oops/Xml.actor

term is a bit trickier. Number is represented as an Element named literal with attributes

describing the value. shift(Lit) creates an Element named minus only for - and

shift(Goal) either nests a term result into this Element or it remembers the sum or term

result. reduce() delivers result in any case.

The observer pattern leads to simple and safe code the grammar makes sure that all calls

on the Goal happen in the correct order. Differentiating Lit, Token, and Goal as senders

removes the need for a very subtle switch.


$ java -classpath .:oops.jar:problems/3 oops.Compile -f Xml expr.ser Scanner

GoalMakerFactory is Xml

3 + 4 * 5

3 * ( 4 + 5 )

+ (1 + 22 - 33 * 4 / 5 % + 6)


































53


4.6 Bootstrap

The algorithms are encapsulated in the tree. If we manage to describe EBNF in EBNF and

build the tree representation for that grammar, we can use the tree to process sentences

conforming to EBNF, i.e., we can recognize grammars written in EBNF, check them, build and

serialize trees for them using oops.parser, and such a tree can recognize a sentence

conforming to the grammar that the tree represents.

parser: rule+;

rule: Id ":" alt ";";

alt: seq ( "|" seq )*;

seq: ( term )*;

term: item ( "?" | "+" | "*" )?;

item: Id | Lit | "(" alt ")";

The first tree has to be constructed manually:

54

oops/ebnf.ebnf

oops/Boot.java

/** statically craft and serialize parser tree for ebnf.ebnf.

*/

public class Boot {

public static void main (String args []) throws Exception {

class Lits extends Hashtable { // tracks new Lit(x)

Lit make (String lit) { put(lit, lit); return new Lit(lit); }

}

class Ids extends Hashtable { // tracks new Id(x)

Id make (String id) { put(id, id); return new Id(id); }

}

Lits lits = new Lits();

Ids ids = new Ids();

// parser: rule+;

Parser ebnfParser = new Parser(

new Rule(ids.make("parser"), new Some(ids.make("rule")))

);

// rule: Id ":" alt ";";

{ Seq s = new Seq(ids.make("Id"));

s.add(lits.make(":")); s.add(ids.make("alt")); s.add(lits.make(";"));

ebnfParser.add(

new Rule(ids.make("rule"), s)

);

}

// alt: seq ( "|" seq )*;

{ Seq s = new Seq(ids.make("seq"));

Seq t = new Seq(lits.make("|")); t.add(ids.make("seq"));

s.add(new Many(t));

ebnfParser.add(

new Rule(ids.make("alt"), s)

);

}

Literals and names are collected in two hash tables and represented as Lit and Id nodes.


There is a single Parser object which contains one Rule object for each rule.

A Rule object requires an Id with the nonterminal name and a node for the right hand side

which is built up from classes like Seq, Some, Many, Opt, or Alt which eventually contain Lit

and Id objects.

}

}

// seq: ( term )*;

ebnfParser.add(

new Rule(ids.make("seq"), new Many(ids.make("term")))

);

// term: item ( "?" | "+" | "*" )?;

{ Seq s = new Seq(ids.make("item"));

Alt a = new Alt(lits.make("?"));

a.add(lits.make("+")); a.add(lits.make("*"));

s.add(new Opt(a));

ebnfParser.add(

new Rule(ids.make("term"), s)

);

}

// item: Id | Lit | "(" alt ")";

{ Alt a = new Alt(ids.make("Id")); a.add(ids.make("Lit"));

Seq s = new Seq(lits.make("("));

s.add(ids.make("alt")); s.add(lits.make(")"));

a.add(s);

ebnfParser.add(

new Rule(ids.make("item"), a)

);

}

// finalize and check the parser

ebnfParser.setSets(lits.keys(), ids.keys());

// serialize the parser to stdout

ObjectOutputStream out = new ObjectOutputStream(System.out);

out.writeObject(ebnfParser); out.close();

55

oops/Boot.java

Once the Parser contains all Rule objects it is informed about the literals and names at

this point it checks the LL(1) condition and serialized.

$ java -classpath .:oops.jar Boot > boot.ser


This first serialized parser can be applied to EBNF described in EBNF to produce the same or

even an extended parser in a more maintainable fashion. Ebnf is an Actor that builds a tree

like Xml but uses the classes from oops.parser:

oops/Ebnf.actor

parser: rule+;

%

return new GoalAdapter() {

public void shift (Goal sender, Object value) {

if (result == null) result = new Parser((Rule)value);

else ((Parser)result).add((Rule)value);

}

public Object reduce () {

Parser parser = (Parser)result;

// this is where the tree builder needs to get to the scanner...

parser.setSets(scanner.lits.keys(), scanner.ids.keys());

return parser;

}

};

%

rule: Id ":" alt ";";

%

return new GoalAdapter() {

public void shift (Goal sender, Object value) {

result = new Rule((Id)result, ((Node)value).node());

}

};

%

Rules are described by Rule objects and collected in a Parser. Once all rules are in,

setSets() needs an Enumeration of the literal strings and the names which were used in the

grammar.

This information is collected by the scanner which is a bit kludged as an inner class of Ebnf.

setSets() checks the LL(1) condition.

oops.Compile writes the last value returned by reduce() to standard output, i.e., the

serialized tree for the grammar for EBNF defined using EBNF.

56


The other Goal objects are just as simple:

alt: seq ( "|" seq )*;

%

return new GoalAdapter() {

public void shift (Goal sender, Object value) {

Node seq = ((Node)value).node();

if (result == null) result = new Alt(seq);

else ((Alt)result).add(seq);

}

};

%

seq: ( term )*;

%

return new GoalAdapter() {

public void shift (Goal sender, Object value) { // term

Node term = ((Node)value).node();

if (result == null) result = new Seq(term);

else ((Seq)result).add(term);

}

};

%

term: item ( "?" | "+" | "*" )?;

%

return new GoalAdapter() {

public void shift (Lit sender, Object value) {

Node item = ((Node)result).node();

if (value.equals("?")) result = new Opt(item);

else if (value.equals("+")) result = new Some(item);

else if (value.equals("*")) result = new Many(item);

else throw new Error("term "+value); // cannot happen

}

};

%

item: Id | Lit | "(" alt ")";

%

return new GoalAdapter() {

public void shift (Lit sender, Object value) { // ignore ( )

}

};

%

One can check with cmp that this tree can in fact compile itself:

$ java -classpath .:oops.jar oops.Compile -f Ebnf boot.ser Ebnf.Scanner \

> ebnf.ebnf > ebnf.ser

GoalMakerFactory is Ebnf

57

oops/Ebnf.actor

$ java -classpath .:oops.jar oops.Compile -f Ebnf ebnf.ser Ebnf.Scanner ebnf.ebnf |

> cmp - ebnf.ser

GoalMakerFactory is Ebnf


4.7 Pictorial Review

A compiler (scanner/parser for recognition and observer for building something) takes a

sentence and produces some output. The parser is for the same grammar to which the

sentence conforms:

sentence

to

grammar

In particular, a Java compiler produces a class file for a Java class:

class

to

Java

The class file takes some input and produces some output:

input

A compiler for arithmetic expressions can build a tree for an arithmetic expression from

classes in the Element Construction Set:

3+4*5

to

expr

A compiler for EBNF can build a tree from oops.parser classes for a grammar conforming to

EBNF. One grammar expr.ebnf can describe arithmetic expressions:

expr

to

ebnf

parser/observer

for

grammar

parser/observer

for

Java

class

file

output

class

file

would create expr.ser.

Another grammar ebnf.ebnf can describe EBNF grammars and conform to EBNF:

ebnf

to

ebnf

parser/observer

for

expr

parser/observer

for

ebnf

parser/observer

for

ebnf

3+4*5

from

ecs

expr

from

oops

ebnf

from

oops

would create ebnf.ser.

58


A tree for a grammar built from oops.parser can check itself, recognize a sentence

conforming to the grammar represented by the tree, and talk to an observer about it.

If the tree/grammar expr.ser describes arithmetic expressions, the observer Xml.actor can

build a tree for an arithmetic expression from the Element Construction Set:

3+4*5

to

expr

The tree built from ecs can be “executed” to produce XML:

[empty]

If the tree/grammar ebnf.ser describes EBNF grammars, the observer Ebnf.actor can build a

tree based on oops.parser classes for a grammar specified in EBNF. One grammar expr.ebnf

describes arithmetic expressions:

expr

to

ebnf

Given expr.ebnf and ebnf.ser plus Ebnf.actor we can create expr.ser.

Another grammar ebnf.ebnf describes EBNF grammars:

ebnf

to

ebnf

expr

from

oops

3+4*5

from

ecs

ebnf

from

oops

ebnf

from

oops

observer

observer

observer

XML

text

3+4*5

from

ecs

expr

from

oops

ebnf

from

oops

Given ebnf.ebnf and ebnf.ser plus Ebnf.actor we can create ebnf.ser.

This last scenario is helpful to maintain ebnf.ser by editing ebnf.ebnf and Ebnf.actor once we

have an initial version of ebnf.ser.

The initial version, however, has to be built differently. One approach is the program Boot.java

which manually builds a tree based on oops.parser classes for a grammar specified in EBNF

and describing EBNF grammars. The output boot.ser of running Boot is functionally equivalent

to ebnf.ser; the order of the objects in the file might be different. If ebnf.ser is used to build

itself, the output must be identical to ebnf.ser.

59


4.8 Conclusion

A context-free grammar can be described in EBNF, represented as a tree, and used to

recognize the sentences which the grammar produces. The grammar must be checked,

however, e.g., it should satisfy a condition such as LL(1). Checking can be done mechanically.

A grammar combined with a scanner results in a recognizer for sentences. The scanner can

often be implemented on the basis of a StreamTokenizer.

The Goal interface is designed to deal with a sentence, one symbol at a time. Different

observer patterns can be implemented to evaluate an expression or generate a tree for a

sentence.

In general, a GoalMakerFactory must be implemented. oops.helpers contains abstract

classes for three types of factories: Handler delivers one Goal for everything, Ruler delivers

one rule-specific Goal for every activation of the rule, and Actor delivers a new rule-specific

Goal for each activation.

A simple shell script oops permits specification of the grammar and the grammar-specific part

of the factory in the same file.

Usually only very small methods must be implemented which are called in a very controlled

context. It should be quite simple to generate an XML tree automatically.

The technology is powerful enough to be used for it’s own implementation; therefore, the

syntax for EBNF, i.e., for the specification of grammars, can easily be changed.

This chapter used a DTD-like syntax in place of the original oops syntax. oops.parser

contains other classes which support significant extensions of EBNF.

oops has been ported to C#.

60


61

5

Extensible Markup Language

This chapter details how to write well-formed and valid XML documents. Valid means that a

Document Type Definition (DTD) or an XML Schema is fulfilled; therefore, this chapter also

explains how to write these.

If lexical analysis is charged with recognizing elements, attributes, entities, and tags, then wellformed

and sometimes parsed means that lexical analysis succeeds and valid means

that the syntax and (rudimentary) semantics described by the DTD or Schema is satisfied.

A good source for this information is the XML Handbook by Goldfarb and Prescod and the XML

Recommendation of the W3C. The syntax rules are quoted from the Recommendation.

Unfortunately, the Recommendation throws practically everything beginning with the

preprocessor (entities) and ending with the semantics (DTD) together into a pseudo

grammar which is probably a legacy from SGML. One consequence is for example that one

has to require of some entity replacement that it fulfill certain rules of the grammar; this would

not be necessary if the preprocessor and the actual language had ben separated.

5.1 Design Goals

The design goals for XML are:

1. XML shall be straightforwardly usable over the Internet.

2. XML shall support a wide variety of applications.

3. XML shall be compatible with SGML.

4. It shall be easy to write programs which process XML documents.

5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero.

6. XML documents should be human-legible and reasonably clear.

7. The XML design should be prepared quickly.

8. The design of XML shall be formal and concise.

9. XML documents shall be easy to create.

10. Terseness in XML markup is of minimal importance.

[From the Recommendation].

There is hardly anything to be added although machine processing is paramount and the

creation of documents is all but ignored, it ranges after the idea that XML should rather be

defined quickly than formally.

The goals also don’t specify what XML is intended for: a portable description of (tree-)

structured data.


5.2 Representation

The Recommendation refers to Unicode (precisely ISO/IEC 10646):

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD]

| [#x10000-#x10FFFF]

A document can be represented with a different character set which is specified in the prolog

of the document. Upper and lower case is distinguished.

Markup

A document entity

[1] document ::= prolog element Misc*

consists of markup, that is start tags, end tags, empty-element tags, entity references,

character references, comments, CDATA section delimiters, document type declarations and

processing instructions, and finally character data.

References constitute a preprocessor and are used, among other things, to represent special

characters; they start with &, the rest of the markup starts with < and must not contain


Strings

Markup itself only contains white space and NameChar as well as literal data, i.e., arbitrary

strings in single or double quotes:

[10] AttValue ::= '"' ([^


5.3 Logical Structure

Prolog

The prolog of the document entity specifies if it uses external entities. The prolog also

connects to the DTD:

[22] prolog ::= XMLDecl? Misc* (doctypedecl Misc*)?

[23] XMLDecl ::= ''

[24] VersionInfo ::= S 'version' Eq ("'" VersionNum "'" | '"' VersionNum '"')

[25] Eq ::= S? '=' S?

[26] VersionNum ::= ([a-zA-Z0-9_.:] | '-')+

[27] Misc ::= Comment | PI | S

[80] EncodingDecl ::= S 'encoding' Eq ('"' EncName '"' | "'" EncName "'" )

[81] EncName ::= [A-Za-z] ([A-Za-z0-9._] | '-')*

[32] SDDecl ::= S 'standalone' Eq

(("'" ('yes' | 'no') "'") | ('"' ('yes' | 'no') '"'))

VersionNum currently is 1.0. The possible EncName are predefined; the default is UTF-8.

standalone=’yes’ promises that the document does not use external markup declarations

which really affect the document this is intended for processing optimization.

The Document Type Declaration is in the prolog and contains or references markup

declarations which constitute the Document Type Definition as a grammar for a document.

Elements

A document contains one element which can contain further, nested elements.

[39] element ::= EmptyElemTag | STag content ETag

[40] STag ::= ''

[41] Attribute ::= Name Eq AttValue

[42] ETag ::= ''

[43] content ::= CharData? ((element | Reference | CDSect | PI

| Comment) CharData?)*

[44] EmptyElemTag ::= ''

[14] CharData ::= [^


5.4 Physical Structure

A document is stored in one or more files the so-called entities.

The document entity and the external part of the DTD (if there is one) have no names, all other

entities are called by their names.

Parsed entities have a replacement text which consists of their content and which is returned

as part of processing.

There are general entities which are called inside the content and parameter entities which

can only be called in the DTD.

[69] PEReference ::= '%' Name ';'

Unparsed entities only have a name and a notation. They can only be called in certain

attributes.

Parsed Entities

External parsed entities may start with a TextDecl which must not use parsed entities:

[78] extParsedEnt ::= TextDecl? content

[77] TextDecl ::= ''

Interestingly this time the version can be omitted but the encoding must be there.

Replacement

Unfortunately there are complicated rules specifying how entities are used and replaced:

Parameter

%name;

reference in content not

recognized

reference in attribute

value

not

recognized

name as attribute value not

recognized

Intern

&name;

replaced,

processed

reference in entity value replaced not

recognized

reference in DTD replaced,

plus space

Extern

&name;

Unparsed

name

Character

{

validating? error replaced

replaced error error replaced

error error error not

recognized

not

recognized

validating? replaced

error error error error,

depending on

context

Not recognized means that the entity is not recognized as such and remains unchanged.

Only a validating processor is required to obtain external entities; other processor may or may

not consider them.

Looking at this table it is pretty clear why XML-based systems usually define their own text

replacement and inclusion mechanisms...

65


5.5 Document Type Definition

An internal Document Type Definition (DTD), if present, is near the end of the prolog in the

document entity.

[28] doctypedecl ::= ''

[75] ExternalID ::= 'SYSTEM' S SystemLiteral

| 'PUBLIC' S PubidLiteral S SystemLiteral

[11] SystemLiteral ::= ('"' [^"]* '"') |("'" [^']* "'")

[12] PubidLiteral ::= '"' PubidChar* '"' | "'" (PubidChar - "'")* "'"

[13] PubidChar ::= #x20 | #xD | #xA |[a-zA-Z0-9] |[-'()+,./:=?;!*#@$_%]

The syntax for markup assures that the document entity can only contain parameter entity

references just like white space between markup:

[28a] DeclSep ::= PEReference | S

[29] markupdecl ::= elementdecl | AttlistDecl | EntityDecl | NotationDecl

| PI | Comment

A SystemLiteral is a URI which can be relative provoking strange results in some

processors. URIs employ character replacement %xx with hexadecimal digits for UTF-8 bytes.

If a SystemLiteral is used in DOCTYPE it has to reference an extSubset.

If a PEReference, i.e., %name;, is used as DeclSep it’s replacement text also must be an

extSubset:

[30] extSubset ::= TextDecl? extSubsetDecl

[31] extSubsetDecl ::= ( markupdecl | conditionalSect | DeclSep)*

Although the sequence may be different, the internal DTD is considered to logically precede

the external DTD. conditionalSect may only be used in the external DTD and PEReference

can then be used inside markupdecl as well.

Preprocessor

There is a rudimentary mechanism to exclude parts of a document:

[61] conditionalSect ::= includeSect | ignoreSect

[62] includeSect ::= ''

[63] ignoreSect ::= ''

[64] ignoreSectContents ::= Ignore ('' Ignore)*

[65] Ignore ::= Char* - (Char* ('') Char*)

IGNORE can contain an INCLUDE area and vice versa. Such a section can only be completely

contained in the content of a parameter entity or not at all.

IGNORE and INCLUDE are typically controlled using parameter entities:




]]>


]]>

66


Elements

An elementdecl defines the element type for an element Name (generic identifier), i.e., it

defines the content for the element. A Name may only be declared once.

[45] elementdecl ::= ''

[46] contentspec ::= 'EMPTY' | 'ANY' | Mixed | children

[47] children ::= (choice | seq) ('?' | '*' | '+')?

[48] cp ::= (Name | choice | seq) ('?' | '*' | '+')?

[49] choice ::= '(' S? cp ( S? '|' S? cp )+ S? ')'

[50] seq ::= '(' S? cp ( S? ',' S? cp )* S? ')'

[51] Mixed ::= '(' S? '#PCDATA' (S? '|' S? Name)* S? ')*'

| '(' S? '#PCDATA' S? ')'

EMPTY means that the element must be empty (it can but does not have to be represented

using EmptyElemTag). ANY means that the element can contain arbitrary elements (but

presumably no text).

Mixed permits text (parsed character data) combined with elements. Unfortunately, in a DTD

the number or sequence of the elements cannot be controlled in this case.



A variant of EBNF is used to describe nested elements. Sequences employ commas.

Alternatives and sequences must be enclosed in parentheses.


Ambiguity is possible (but should not be allowed by a validating processor):


67


Attribute

An AttlistDecl declares the possible attributes which may, however, be specified in any

order. The possible values can be restricted to some degree and there can be defaults. If there

is more than one declaration for an attribute, the first one takes precedence::

[52] AttlistDecl ::= ''

[53] AttDef ::= S Name S AttType S DefaultDecl

[54] AttType ::= StringType | TokenizedType| EnumeratedType

[55] StringType ::= 'CDATA'

[56] TokenizedType ::= 'ID' | 'IDREF' | 'IDREFS' | 'ENTITY' | 'ENTITIES'

| 'NMTOKEN' | 'NMTOKENS'

[57] EnumeratedType ::= NotationType | Enumeration

[58] NotationType ::= 'NOTATION' S '(' S? Name (S? '|' S? Name)* S? ')'

[59] Enumeration ::= '(' S? Nmtoken (S? '|' S? Nmtoken)* S? ')'

An attribute value can be a string, one or more unique or defined names, or a name from a list.

ID means that the attribute must have a value which is unique in the document. An element

can only have a single attribute of type ID.

IDREF and IDREFS are types for attributes which reference attributes with type ID, i.e., they are

cross-references within a document.

ENTITY and ENTITIES are types which reference unparsed entities which must be specified

as notations.

NMTOKEN and NMTOKENS are arbitrary lists of names with a slightly more permissive definition:

[7] Nmtoken ::= (NameChar)+

[8] Nmtokens ::= Nmtoken (S Nmtoken)*

An EnumeratedType defines the same types but restricts the possible names.

Default

Every attribute must have a default provision:

[60] DefaultDecl ::= '#REQUIRED' |'#IMPLIED' | (('#FIXED' S)? AttValue)

REQUIRED means that the attribute must be specified. IMPLIED means that the attribute can be

omitted. An AttValue is a default value that is used if an attribute was not specified explicitly;

FIXED means that only a certain value may be used.

68


Entities

ENTITY is used to specify names for global entities and for parameter entities.

[70] EntityDecl ::= GEDecl | PEDecl

[71] GEDecl ::= ''

[72] PEDecl ::= ''

[73] EntityDef ::= EntityValue | (ExternalID NDataDecl?)

[74] PEDef ::= EntityValue | ExternalID

[9] EntityValue ::= '"' ([^%&"] | PEReference | Reference)* '"'

| "'" ([^%&'] | PEReference | Reference)* "'"

[76] NDataDecl ::= S 'NDATA' S Name

Global entities are called with &name; and only outside of the DTD. This can also be used to

reference unparsed entities.

Parameter entities are called with %name; and only inside the DTD. They can reference

external entities which, however, must fit.

Notations

A NotationDecl connects a name to an unparsed entity. The name can then be used as a

value following NDATA in an entity definition; the name of the latter can then be used in ENTITY

or ENTITIES attributes, i.e., as a reference across documents or files.

[82] NotationDecl ::= ''

[83] PublicID ::= 'PUBLIC' S PubidLiteral

69


5.6 Namespaces

The XSLT examples indicate that there can be conflicts once elements from different

documents are mixed: an XSLT program contains elements that should simply be output;

things will get tricky once an XSLT program has to create an XSLT program.

Namespaces were introduced to make elements and attributes unique by linking them to

URIs.

The reserved attribute xmlns connects a prefix to a URI without a prefix a URI is defined as

default namespace which is used whenever there is no prefix:





The prefix is visible in the element in which it is defined and in all nested element. The default

namespace is used similarly.

The default namespace is empty if xmlns="" is specified. Elements without a prefix then

belong to no namespace.

The default namespace does not apply to attributes by default attributes are in no

namespace.

A name consists of a prefix and a local name, separated by exactly one colon. Names are

equal if their local parts (following the colon if any) are equal and if, if present, the prefixes link

to URIs with the same sequence of characters (the prefixes may be different).






DTDs are not aware of namespaces. To validate this example all four attributes have to be

declared:








The example turns out to be valid but it is incorrect if namespaces are considered. Things

get really tricky if a namespace is introduced in an external, defaulted attribute.

70


5.7 XML Schema

The XML Schema Recommendation was passed by W3C in May 2001. It is supposed to

replace the DTD as a description for XML based documents. The Primer is relatively easy to

read as an introduction.

The Language Technology Group in Edinburgh provides a Schema-based XML-Validator xsv

implemented in Python and available for Windows and as a Web-Service:

c> xsv po.xml po.xsd



A Schema is specified in XML within the namespace http://www.w3.org/2001/XMLSchema.

XML Schema contains a large, extensible system of types that can be used for elements and

attributes. Nested elements can be associated with a count to indicate how often they may or

must appear. An attribute mixed controls if elements additionally may contain text, i.e., the

mixed content model can now be controlled much better.

Unfortunately an XML Schema is incredibly verbose and very hard to specify.

Types can be extended by restricting their ranges, by pattern matching, and by forming

aggregates but one cannot even use constant expressions.

There is an attribute xsi:schemaLocation to connect a namespace and a Schema address,

but this is only considered to be a hint.

XML Schema is, of course, specified in XML but it adds it’s own include and import

mechanisms and in addition to the XML comments and PIs there are annotation and

appInfo elements.

71


5.8 Examples

Homepage index.xml

index.dtd

Code code/index.xml

code/index.dtd

FTP area ftp/index.xml

ftp/index.dtd

Papers rec/index.xml

rec/index.dtd

Small Examples code/xml/makefile

XML Schema code/xsd/po.xml

code/xsd/po.xsd

72


73

6

Sources

XML can be written by hand, edited with specific editors, created by text systems, or

generated from a database query. This chapter introduces a few, perhaps typical products in

alphabetic order.

The chapter was originally written in the summer of 2001; there should be more products and

further development by now. Comments on current status are in brackets.

6.1 Editors

Amaio xeddy

Commercial product, Java based, uses DTD, edits text and property sheet view; it is based on

Xerces and has a problem with a relative URL for the DTD. [Active]

AnnArbor epic / arbortext

Commercial product, full license quite expensive, edits different views. [Active]


Edinburgh Language Technology Group xed

Freely available, tested under Windows, edits text with tags; saves prolog in read-only window,

should be using XML Schema soon. [Halted]

IBM alphaWorks Xeena

Freely available for testing, Java based, edits tree with attributes; interesting tree operations,

can validate and use XSL. [Halted?]

74


Merlot

Open source, Java based, edits tree with property sheets; looks like it is going to be very

cluttered, seems to have trouble with UTF-8 and the tree view on MacOS X. [Gone?]

Microsoft XML Notepad

Freely available for Windows, compact, edits tree with attributes which are aligned next to the

tree nodes; white space seems to explode during repeated saves. [Halted]

75


Morphon

Commercial product, Java based, uses DTD and CSS, supports XML Schema, edits tree and

text view. [Active]

SoftQuad XMetal

Commercial product, quite expensive, edits different views, among them formatted text linked

by selection with a tree. [Active]

76


XMLSpy

Commercial product, edits nested structures and text with tags; elegant but tends to run out of

screen space for more deeply nested structures. [Active]

77


6.2 Text Systems

Adobe FrameMaker

FrameMaker can save text documents as HTML documents and XML based documents and

has tables for detailed control over the mapping of paragraph and character styles to

elements. [Active]

HTML Mapping Table

FrameMaker Source Item XML Item Include Comments

Element New Web Page? Auto#

P:Body P N N

P:Heading1 H* N N

P:Indented P

Parent = UL

Depth = 0

N N

P:Numbered LI

Parent = OL

Depth = 0

N N

C:Emphasis

XML Mapping Table

EM N N

FrameMaker Source Item XML Item Include Auto# Comments

Element New Web Page?

P:Body Body N N

P:Heading1 Heading1 N N

P:Indented Indented N N

P:Numbered Numbered

Parent = NumberedList

Depth = 0

N N

C:Emphasis Emphasis N N

For HTML the style name is set as CLASS attribute, for XML the style name becomes the

element name. The style is then mapped to CSS, differently for HTML and XML.

78


FrameMaker+SGML can import XML; a DTD is mapped to objects that control the appearance

in FrameMaker+SGML.

I am using FrameMaker and HTML to produce these notes, but I have to post process

navigation and image presentation.

Adobe FrameMaker and Quadralay WebWorks Publisher

Publisher reads FrameMaker documents in Maker Interchange Format (MIF) and maps them

to HTML or XML with CSS or XSL based on -- theoretically editable -- templates. Styles are

mapped to predefined appearances: [Active]

Publisher does not work well under MacOS X; I did not find a template to match my objectives;

but the company wants to sell the design of templates as a service.

Microsoft Word

Word can edit HTML, but it uses a set of styles that only make HTML formatting features

available. [Active]

I was unable to control the mapping to HTML and, therefore, prefer FrameMaker. At the time I

did not find a way to create XML.

79


6.3 Oracle XML-SQL-Utility

XSU uses JDBC and returns the result of an SQL query as a simple XML based document. The

utility is now packaged as part of the Oracle database systems.

Here is a simple command db to send SQL queries from standard input to the database:

xsu/db.java

import java.io.BufferedReader;

import java.io.InputStreamReader;

import java.sql.Connection;

import java.sql.Driver;

import java.sql.DriverManager;

import java.sql.SQLException;

import java.sql.Statement;

/** trivial sql client.

*/

public class db {

/** utility to connect to a database; requires property driver with the

JDBC driver's class name and property db with a url to access the

database.

*/

public static Connection connection () throws ClassNotFoundException,

IllegalAccessException, InstantiationException, SQLException {

String driver = System.getProperty("driver");

if (driver == null)

throw new IllegalArgumentException("no driver property");

}

String db = System.getProperty("db");

if (db == null)

throw new IllegalArgumentException("no db property");

// register driver

DriverManager.registerDriver((Driver)Class.forName(driver).newInstance());

// connect to database

return DriverManager.getConnection(db);

}

/** execute standard input lines as SQL statements; no results...

*/

public static void main (String args []) {

try {

BufferedReader in = new BufferedReader(new InputStreamReader(System.in));

String line;

}

// connect and create statement

Connection connection = db.connection();

Statement statement = connection.createStatement();

while ((line = in.readLine()) != null)

statement.execute(line); // result??

// disconnect

connection.close();

} catch (Exception e) { e.printStackTrace(); }

connection() provides a database connection. A Statement object executes strings.

80


get uses db.connection() to get to a database, takes a SQL query from the command line,

sends it to the database, and hands the result to XSU:

xsu/get.java

import java.sql.Connection;

import java.sql.Statement;

import oracle.xml.sql.OracleXMLSQLException;

import oracle.xml.sql.query.OracleXMLQuery;

/** run SQL query and convert result to XML.

*/

public class get {

/** output result of a command line query to stdout.

*/

public static void main (String args []) throws Exception {

if (args == null || args.length == 0) {

System.err.println("usage: java [-Ddriver=class] [-Ddb=url] get SELECT...");

System.exit(1);

}

try {

// register driver and connect to database

Connection connection = db.connection();

// create SQL statement

Statement statement = connection.createStatement();

// concatenate query

StringBuffer query = new StringBuffer(args[0]);

for (int a = 1; a < args.length; ++ a) query.append(' ').append(args[a]);

// execute query and create xml source

OracleXMLQuery xml = new OracleXMLQuery(connection, query.toString());

// print out the result

System.out.println(xml.getXMLString());

// disconnect

connection.close();

} catch (OracleXMLSQLException e) {

System.err.println("errcode "+e.getErrorCode()+

", parent "+e.getParentException()+

", xml "+e.getXMLErrorString()+

", sql "+e.getXMLSQLErrorString());

throw e;

}

}

}

These programs require a JDBC driver such as mm.mysql.

81


Execution

If mysql is installed, one can use db or a mysql client to create a table and look at it:

c> mysqld-opt

c> mysql -u guest -p

Enter password: guest

mysql> use test

mysql> create table company (

coid integer not null,

name varchar (254),

addr varchar (254),

primary key (coid)) \g

mysql> insert into company values (90, 'CISCO', 'CA'); ... \g

mysql> select * from company \g

+------+------------------+------+

| coid | name | addr |

+------+------------------+------+

| 10 | IBM | NY |

| 20 | CITIBANK | NY |

| 30 | American Express | NY |

| 40 | GM | MI |

| 50 | Microsoft | WA |

| 60 | IBM IGS | FL |

| 70 | HP | CA |

| 80 | Intel | CA |

| 90 | CISCO | CA |

| 100 | Sun Microsystem | CA |

+------+------------------+------+

10 rows in set (0.00 sec)

get can be used to query the table and return the result as an XML-based document:

$ java -Ddb="jdbc:mysql://venus:3306/test?user=guest&password=guest" \

> -Ddriver="org.gjt.mm.mysql.Driver" \

> -classpath .:xsu12.jar:xmlparserv2.jar:mm.mysql-2.0.4-bin.jar:classes12.zip \

> get 'select * from company where name = "CISCO"'




90

CISCO

CA



The XML-document is very flat; some attributes can be suppressed or renamed.

XSU should load a suitable XML-based document by way of JDBC into a database. At least the

mysql JDBC driver by Mark Matthews does not understand the generated SQL syntax

savepoint SYS_XSU_hope_0001000.

If an XML-based document is created from the database and loaded back in, the num attribute

of ROW would turn into a new attribute down- and upload cannot be directly combined.

82


6.4 IBM alphaWorks XML Lightweight Extractor

XLE uses a DTD with source annotation (DTDSA). dtdsaCreator provides a GUI for editing.

The idea is that a DTD uniquely produces an XML based document if for every iteration with ?,

+, and * and for every selection with | and for PCDATA and CDATA it is known, how many and

which values will be inserted.

A DTDSA extends the DTD with value and binding specifications at the relevant positions. It

helps to load the DTD into dtdsaCreator, edit the necessary specifications, and store the result

as a DTDSA.



83

xle/hello.dtd

hello.dtdsa

Strictly speaking the DTD is not correct because as an external DTD it cannot be enclosed by

DOCTYPE. XLE, however, requires DOCTYPE to find the root element, i.e., XLE considers the

DTDSA more or less as a document that just consists of the prolog.

In this case there is no database access; therefore XLE can immediately be executed:

$ cd code/xle; alias run=`make run`

$ run XLE hello.dtdsa


Hello World


A value specification follows after a colon and must produce the value for a selection with |

and for PCDATA and CDATA. The following constructs can be used:

string "Hello World" string value.

parameter in0 command line argument following the

DTDSA.

field row.column value from the database: row is a variable

set by a binding specification, column is

an attribute name.

function field(table, column, row) value from the database: table selects

the table, column is the attribute name,

row is a variable set by a binding

specification.


A binding specification follows after two colons and connects many variable names with lists of

rows from the database which are then processed for iterations with ?, +, and *. Names are

scoped dynamically, i.e., they are available within the activated DTD elements until they are

concealed by an inner assignment.

The lists can be described using functions or SQL statements:

row(table) SQL("SELECT * FROM table")

row(table, ,

SQL("SELECT * FROM table

)

WHERE column=value, ...")

unique_row(...) SQL("SELECT DISTINCT ...")

pjrow(table, ) SQL("SELECT col, ... FROM table")

pjrow(table, ,

SQL("SELECT col, ... FROM table

, ) WHERE column=value, ...")

unique_pjrow(...) SQL("SELECT DISTINCT ...")

value stands for any form of a value specification; with a SQL statement $ must precede a

variable name and a string may only appear in single quotes. The SQL statements are more

powerful than the functions.

The tables can be obtained from several databases and one can write new classes with

static functions to extend the value specifications.

dtdsaCreator is a syntax-driven graphical editor for DTDSA files which works under various

JDK versions with the possible exception of 1.3.0:

A DTD or DTDSA is selected and loaded into the eft window. Buttons appear where bindings

are possible; required bindings are framed. One of the windows at right appears to edit the

bindings; the syntax is checked immediately. The DTDSA source can be viewed and saved.

84


Examples

One example is to produce an extract from the company table shown before:

$ run XLE company.dtdsa MI









GM

MI

xle/company.dtdsa

))>





]>

in0 is the first command line argument following XLE. This way, entries with identical addr

attribute can be collected in individual XML based documents. The same query can be

specified using SQL note the single quotes when using in0 as a string:

xle/company.sql.dtdsa





]>

The database is accessed using JDBC and the following configuration file:

org.gjt.mm.mysql.Driver

jdbc:mysql://murphy.cs.rit.edu:3306/test?user=guest&password=guest

dontcare

dontcare

dontcare

85

xle/access.cfg


XLE is a proof of concept. Information from several tables can be collated very well into an

almost arbitrary document type as is demonstrated by IBM’s own examples but the

information should really just come from the database. It is quite difficult to generate a report

with external texts:







, ))>


]>

$ run XLE html.dtdsa NY

The result in a web browser is something like

The following is/are located in NY

* IBM

* CITIBANK

* American Express

Creating the title requires quite an effort and is a very specific solution.

86

xle/html.dtdsa


6.5 Persistent Java Beans

JDK 1.4 has added a persistence mechanism specifically for Java Beans based on XML. A

bean mostly characterized by get and set methods for it’s visible state and by the

existence of a parameterless constructor can be given to XMLEncoder:

beans/Client.java

import java.beans.ExceptionListener;

import java.beans.Statement;

import java.beans.XMLEncoder;

import java.awt.Button;

/** simple one-way remote method invocation using an XMLEncoder.

*/

public class Client {

public static void main (String args []) {

XMLEncoder enc = new XMLEncoder(System.out);

// intended use: encode a bean, maintain graph

Button b = new Button("hello");

enc.writeObject(b); enc.writeObject(b);

// possible abuse: send statements to a server

Object server = new Object();

enc.setOwner(server);

// needed because Statement is called locally, too

enc.setExceptionListener(new ExceptionListener() {

public void exceptionThrown (Exception e) { }

});

enc.writeStatement(new Statement(server, "show", new Object[]{ b, b }));

enc.close();

}

}

This produces





hello



hello



button0











87


Basically, a bean is mapped as a number of method calls to be executed in order to get it into

it’s current state. The significant property of the algorithm is that object graphs are preserved

by means of ID and IDREF attributes.

XMLEncoder also has provisions for transmitting method calls visions of remote method

invocation. However, as this server shows, the invocation in XMLDecoder by means of

reflection is a very poor design choice:

import java.awt.Button;

import java.beans.XMLDecoder;

/** simple one-way remote method execution using an XMLDecoder.

BUG: both methods fit the call equally well...

*/

public class Server {

public static void main (String args []) {

XMLDecoder dec = new XMLDecoder(System.in, new Server());

dec.close();

}

public void show (Button a, Object b) {

System.out.println("button "+a);

System.out.println(b);

}

public void show (Object a, Button b) {

System.out.println(a);

System.out.println("button "+b);

}

}

The example executes

$ java Client | java Server

button java.awt.Button[button0,0,0,0x0,invalid,label=hello]

java.awt.Button[button0,0,0,0x0,invalid,label=hello]

when in fact it should recognize the ambiguity.

88

beans/Server.java


89

7

Simple API for XML

The Simple API for XML (SAX), now in version 2 with namespace serves to recognize a

XML based document in a Java program. An implementation of SAX is part of version 1.4 of

the Java Development Kit.

SAX defines a XMLReaderFactory which produces a XMLReader. After one or more observers

are registered there one can use parse() to process an input which results in callbacks to the

observers. SAX employs only something like the Handler, page 43. Some parsers

(XMLReader) can be configured by setting various boolean features; additionally, Object

properties can be set or viewed.

7.1 Main Program

Main uses a XML-Parser such as Crimson or Xerces-J to process a file:

sax/Main.java

package sax;

import java.io.IOException;

import java.util.StringTokenizer;

import org.xml.sax.ContentHandler;

import org.xml.sax.DTDHandler;

import org.xml.sax.ErrorHandler;

import org.xml.sax.InputSource;

import org.xml.sax.SAXException;

import org.xml.sax.SAXParseException;

import org.xml.sax.XMLReader;

import org.xml.sax.ext.DeclHandler;

import org.xml.sax.ext.LexicalHandler;

import org.xml.sax.helpers.XMLReaderFactory;

/** processes an XML file with a ContentHandler.

Patterned after McLaughlin's SAXParserDemo.

*/

public class Main {

/** commandline.

*/

public static void main (String args []) {

if (args == null || args.length > 1) {

System.err.println("usage: java -Dorg.xml.sax.driver=classname "+

"sax.Main [file.xml]");

System.err.println(" -Dcontent=classname");

System.err.println(" -Ddtd=classname");

System.err.println(" -Derror=classname");

System.err.println(" -Dfalse='feature ...'");

System.err.println(" -Dlex=classname");

System.err.println(" -Dtrue='feature ...'");

System.err.println(" -Dverbose=true");

System.err.println(" -Dxml=classname");

System.exit(1);

}

InputSource is = args.length == 0

? new InputSource(System.in)

: new InputSource(args[0]);

if (!parse(is)) System.exit(1);

}


sax.Main sends standard input or a URL as a command line argument, i.e. a local file name

or a resource on the net, to parse().

sax.Main is controlled through properties. At least the name of the parser class has to be

defined. Additionally, class names for various SAX observers can be specified and some SAX

features can be set to true or false.

parse() creates the observers, obtains a XMLReader (parser) from the factory and configures

it, and finally applies the parser to the stream or URL:

/** create parser and parse one url.

@return true if parse is successful, else print message.

*/

public static boolean parse (InputSource is) {

ContentHandler contentHandler = (ContentHandler)handler("content");

ErrorHandler errorHandler = (ErrorHandler)handler("error");

}

90

sax/Main.java

try {

XMLReader parser = XMLReaderFactory.createXMLReader();

if (xml != null)

try {

Object h = Class.forName(xml)

.getConstructor(new Class[]{ XMLReader.class })

.newInstance(new Object[]{ parser });

parser.setContentHandler((ContentHandler)h);

parser.setDTDHandler((DTDHandler)h);

parser.setProperty("http://xml.org/sax/properties/declaration-handler",

h);

parser.setProperty("http://xml.org/sax/properties/lexical-handler", h);

} catch (Exception e) { System.err.println(xml+": "+e); }

if (contentHandler != null)

parser.setContentHandler(contentHandler);

if (declHandler != null)

parser.setProperty("http://xml.org/sax/properties/declaration-handler",

declHandler);

if (dtdHandler != null)

parser.setDTDHandler(dtdHandler);

if (errorHandler != null)

parser.setErrorHandler(errorHandler);

if (lexicalHandler != null)

parser.setProperty("http://xml.org/sax/properties/lexical-handler",

lexicalHandler);

setFeature(parser, "false", false);

setFeature(parser, "true", true);

parser.parse(is);

return true;

} catch (IOException e) {

System.err.println(is.getSystemId()+": "+e);

} catch (SAXParseException e) {

Errors.report("fatal syntax error", is.getSystemId(), e);

} catch (SAXException e) {

System.err.println(is.getSystemId()+": "+e);

}

return false;


The following class method creates an observer if the class name is given as a property:

/** convenience method to create a handler from a property.

@return null if property was not specified.

*/

public static Object handler (String propertyName) {

String className = System.getProperty(propertyName);

if (className == null || className.length() == 0) return null;

}

Exception e;

try {

return Class.forName(className).newInstance();

} catch (ClassNotFoundException _e) { e = _e; }

catch (InstantiationException _e) { e = _e; }

catch (IllegalAccessException _e) { e = _e; }

System.err.println(propertyName+"="+className+": "+e); System.exit(1);

return null; // not reached

91

sax/Main.java

A SAX feature is a key/value pair that can be passed to a XMLReader with setFeature(). The

key is a URI such as http://xml.org/sax/features/validation, the value is boolean.

The following class method extracts one or more URIs from a property and sets or clears the

corresponding features:

}

sax/Main.java

/** convenience method to set features from a property.

*/

public static void setFeature (XMLReader parser, String propertyName,

boolean how) {

String features = System.getProperty(propertyName);

if (features != null) {

StringTokenizer st = new StringTokenizer(features);

while (st.hasMoreTokens()) {

String feature = st.nextToken();

try {

parser.setFeature(feature, how);

} catch (SAXException e) {

System.err.println(feature+": "+e);

}

}

}

}

Main works even without an observer but by default only fatal errors mostly lexical problems

are reported, i.e., even validation errors will go largely unnoticed:

$ java -classpath ..:../etc/xml.apache.org/xerces-1_3_0/xerces.jar \

-Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser sax.Main



$ alias run='`make run`'

$ run -Dtrue=http://xml.org/sax/features/validation sax.Main



7.2 ErrorHandler

SAX distinguishes three kinds of problems: warning() hardly ever, error() most

validation problems, and fatalError() mostly lexical errors and syntax errors in the DTD:

sax/Errors.java

package sax;

import org.xml.sax.ErrorHandler;

import org.xml.sax.SAXParseException;

/** gripes about any SAX problem.

Patterned after McLaughlin's SAXParserDemo.

*/

public class Errors implements ErrorHandler {

/** convenience method to report a SaxParseException.

*/

public static void report (String prefix, String url, SAXParseException e) {

String s;

if ((s = e.getPublicId()) == null

&& (s = e.getSystemId()) == null)

s = url;

System.err.println(prefix+": "+

s+"("+e.getLineNumber()+

":"+e.getColumnNumber()+

") "+e.getMessage());

}

/** reports a warning but does not throw an exception.

*/

public void warning (SAXParseException e) {

report("warning", "", e);

}

/** reports an error but parsing can continue.

*/

public void error (SAXParseException e) {

report("syntax error", "", e);

}

/** terminates with a fatal syntax error.

*/

public void fatalError (SAXParseException e) throws SAXParseException {

throw e;

}

}

Any error() should definitely be reported unfortunately this is not the default and only

after a fatalError() should the parser not be continued.

A XML-based document can be validated by registering a suitable ErrorHandler and setting

the appropriate feature:

$ alias run='`make run`'

$ run -Derror=sax.Errors -Dtrue=http://xml.org/sax/features/validation sax.Main



syntax error: (4:20) Attribute "bad" must be declared for element type "root".

92


7.3 ContentHandler

SAX defines a number of methods which are called if the parser recognizes certain parts of a

XML based document. If all methods are defined a document can be copied:

package sax;

import org.xml.sax.Attributes;

import org.xml.sax.ContentHandler;

import org.xml.sax.Locator;

import org.xml.sax.SAXException;

/** shows all tags.

-Dignore=true to skip ignorable whitespace.

-Dverbose=true for tracing.

to dump locator, if any.

Patterned after McLaughlin's SAXParserDemo.

*/

public class Dup implements ContentHandler {

/** document position of callback -- only valid within this scope.

*/

protected Locator locator;

public void setDocumentLocator (Locator locator) {

if (Boolean.getBoolean("verbose"))

System.err.println(toString(locator));

this.locator = locator;

}

/** convenience Method to dump a locator.

*/

public static String toString (Locator locator) {

if (locator == null) return "null";

else {

StringBuffer result = new StringBuffer();

String s = locator.getPublicId();

if (s == null) s = locator.getSystemId();

if (s != null) result.append(s);

result.append('(').append(locator.getLineNumber());

result.append(':').append(locator.getColumnNumber());

result.append(')');

return result.toString();

}

}

93

sax/Dup.java

locator is sent from the parser to the observer for a new input source and is modified as

recognition progresses.

Depending on the parser, locator might have different values:

$ java -classpath ..:xerces.jar \

> -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser \

> -Dcontent=sax.Dup -Dverbose=true sax.Main



(1:1)

(1:1) start document

Null Entity(-1:-1) end document

Xerces produces a strange position indication at the end of standard input.


Crimson does a bit better:

$ java -classpath ..:../etc/xml.apache.org/crimson-1.1/crimson.jar \

> -Dorg.xml.sax.driver=org.apache.crimson.parser.XMLReaderImpl \

> -Dcontent=sax.Dup -Dverbose=true sax.Main



(1:-1)

(1:-1) start document

(3:-1) end document

Different methods (or pairs of methods) are called back for different kinds of markup:

public void startDocument () throws SAXException {

if (Boolean.getBoolean("verbose"))

System.err.println(toString(locator)+" start document");

}

public void endDocument () throws SAXException {

if (Boolean.getBoolean("verbose"))

System.err.println(toString(locator)+" end document");

}

94

sax/Dup.java

public void processingInstruction (String target, String data)

throws SAXException {

if ("sax.Dup".equals(target) && "locator".equals(data))

System.err.println(toString(locator));

else

System.out.print("");

}

public void startPrefixMapping (String prefix, String uri) throws SAXException {

if (Boolean.getBoolean("verbose"))

System.err.println(toString(locator)+" start xmlns:"+prefix+"=\""+uri+"\"");

}

public void endPrefixMapping (String prefix) {

if (Boolean.getBoolean("verbose"))

System.err.println(toString(locator)+" end xmlns:"+prefix);

}

public void startElement (String namespaceURI, String localName,

String rawName, Attributes atts) throws SAXException {

StringBuffer s = new StringBuffer("');

System.out.print(s);

}

public void endElement (String namespaceURI, String localName, String rawName)

throws SAXException {

System.out.print("");

}

public void characters (char[] ch, int start, int end) throws SAXException {

System.out.print(new String(ch, start, end));

}


}

public void ignorableWhitespace (char[] ch, int start, int end)

throws SAXException {

if (!Boolean.getBoolean("ignore"))

System.out.print(new String(ch, start, end));

}

/** should not happen (external entities)...

*/

public void skippedEntity (String name) throws SAXException {

System.err.println(toString(locator)+": "+name+" skipped");

}

The mixed content model results in many characters() messages between elements:

$ java -classpath ..:../etc/xml.apache.org/xerces-1_3_0/xerces.jar \

> -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser \

> -Dcontent=sax.Dup sax.Main


pre

post


pre

post

What is considered to be whitespace depends on whether a validating parser is used or not

it does not depend, however, on whether or not validation is turned on. A validating parser is

supposed to report whitespace with ignorableWhitespace() which here can be removed

under control of -Dignore=true; a non-validating parser should instead use characters().

Unfortunately, different parsers produce different results:

$ java -classpath ..:../etc/xml.apache.org/crimson-1.1/crimson.jar \

> -Dorg.xml.sax.driver=org.apache.crimson.parser.XMLReaderImpl \

> -Dcontent=sax.Dup -Dignore=true sax.Main


]>




$ java -classpath ..:../etc/xml.apache.org/xerces-1_3_0/xerces.jar \

> -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser -Dcontent=sax.Dup \

> -Dignore=true -Dtrue=http://xml.org/sax/features/validation sax.Main


]>





Xerces seems to make a mistake. The general advise is not to rely on whitespace in a XML

based document.

The observer sees a PI through processingInstruction().

95


A non-validating parser would be allowed to report an external entity with skippedEntity(). In

reality Xerces produces an error and Crimson disallows an unparsed entity and inserts a

parsed entity plus there are errors in processing whitespace:

$ java -classpath ..:../etc/xml.apache.org/crimson-1.1/crimson.jar \

> -Dorg.xml.sax.driver=org.apache.crimson.parser.XMLReaderImpl \

> -Dcontent=sax.Dup sax.Main


]>


&entity;

Welcome to Darwin!


LexicalHandler

Some parsers allow registration of a LexicalHandler as an Object property. This kind of

observer gets access to comments, CDATA, some entities, and the position of a DTD:

sax/LDup.java

package sax;

import org.xml.sax.SAXException;

import org.xml.sax.ext.LexicalHandler;

/** interface at the lexical level.

*/

public class LDup implements LexicalHandler {

public void comment (char[] ch, int start, int length) throws SAXException {

System.out.print("");

}

/** content reported through regular events.

*/

public void startDTD (String name, String publicId, String systemId)

throws SAXException {

StringBuffer b = new StringBuffer("


}

/** content reported through regular events.

Calls in attribute values and declarations cannot be reported.

% prepended to parameter entity name.

[dtd] is name of external DTD subset.

*/

public void startEntity (String name) throws SAXException {

if (Boolean.getBoolean("verbose"))

System.err.println("start entity "+name);

}

public void endEntity (String name) throws SAXException {

if (Boolean.getBoolean("verbose"))

System.err.println("end entity "+name);

}

/** content reported through characters().

*/

public void startCDATA () throws SAXException {

System.out.print("");

}

$ java -classpath ..:../etc/xml.apache.org/crimson-1.1/crimson.jar \

> -Dorg.xml.sax.driver=org.apache.crimson.parser.XMLReaderImpl \

> -Dlex=sax.LDup -Dcontent=sax.Dup -Dverbose=true sax.Main


]>


&entity;

(1:-1)

(1:-1) start document


start entity entity

Welcome to Darwin!

end entity entity

(8:-1) end document

97

sax/LDup.java


7.4 DTDHandler and DeclHandler

A DTDHandler sees unparsed entity declarations in the DTD. Additionally, it might be possible

to register a DeclHandler as an Object property with the parser. This handler sees the rest

of a DTD:

package sax;

import org.xml.sax.DTDHandler;

import org.xml.sax.SAXException;

import org.xml.sax.ext.DeclHandler;

/** interfaces to DTD information.

*/

public class DDup implements DTDHandler, DeclHandler {

// DTDHandler

98

sax/DDup.java

public void notationDecl (String name, String publicId, String systemId)

throws SAXException {

StringBuffer b = new StringBuffer("');

System.out.println(b); // \n does not get reported.

}


DeclHandler

}

99

sax/DDup.java

public void elementDecl (String name, String model) throws SAXException {

System.out.println("


7.5 Arithmetic Expressions

Arithmetic expressions can be represented in XML, see Arithmetic Expressions in XML, page

51. The following ContentHandler sends literals and operators to a Cpu object and displays

the result:

sax/Eval.java

/** connects expression based on expr.dtd with a cpu, prints result.

Throws all errors as exceptions.

*/

public class Eval extends DefaultHandler {

/** evaluates an expression.

Add methods like void add () throws SAXException; for operator tags.

*/

public interface Cpu {

void reset();

Object push (Object value);

Object pop () throws EmptyStackException;

}

}

public void startDocument () throws SAXException {

cpu.reset();

}

public void endDocument () throws SAXException {

System.out.println(cpu.pop());

}

public void startElement (String namespaceURI, String localName,

String rawName, Attributes atts) throws SAXException {

if ("literal".equals(localName)) {

String type = atts.getValue("type");

if (type.indexOf('.')


A trivial Cpu just remembers the last literal. A PI defines the class of the Cpu object where

the current value stack gets copied:

101

sax/Eval.java

/** by default: trivial cpu remembers last value.

*/

protected Cpu cpu = new Cpu() {

public void reset() { }

public Object push (Object value) {

this.value = new Object[] { value }; return value;

}

private Object[] value;

public Object pop () {

if (value == null) throw new EmptyStackException();

Object result = value[0]; value = null; return result;

}

};

/** replaces cpu.

*/

public void processingInstruction (String target, String data)

throws SAXException {

if ("sax.Eval".equals(target)) {

try {

cpu = (Cpu)Class.forName(data)

.getConstructor(new Class[] { Cpu.class })

.newInstance(new Object[] { cpu });

} catch (Exception e) {

throw new SAXException ("error in creating cpu", e);

}

}

}

The following class implements integer arithmetic:

public class IntCpu extends Stack implements Cpu {

public IntCpu () { }

public void reset () { removeAllElements(); }

public void add () {

Number right = (Number)pop();

push(new Integer(((Number)pop()).intValue() + right.intValue()));

}

// ...

For floating point arithmetic the operators have to be replaced:

public class FloatCpu extends IntCpu {

public FloatCpu (Cpu cpu) { super(cpu); }

public FloatCpu () { }

public void add () {

Number right = (Number)pop();

push(new Float(((Number)pop()).floatValue() + right.floatValue()));

}

sax/IntCpu.java

sax/FloatCpu.java


Each constructor imports the current stack:

/** retrieve stack from incoming cpu.

*/

public IntCpu (Cpu cpu) {

if (cpu != null) {

Stack kcats = new Stack();

try {

for (;;) kcats.addElement(cpu.pop());

} catch (EmptyStackException e) { }

try {

for (;;) addElement(kcats.pop());

} catch (EmptyStackException e) { }

}

}

This way the type of arithmetic can be changed using PIs directed to sax.Eval:






























102

sax/IntCpu.java

sax/expr.xml


7.6 FrameMaker

FrameMaker can create XML. This text contains references to source code and extracts from

the sources. sax.Code is a ContentHandler to extract both however, this only works with

Xerces, Crimson produces garbage:

$ java -classpath ..:../etc/xml.apache.org/xerces-1_3_0/xerces.jar \

> -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser \

> -Dcontent=sax.Code sax.Main sax.fm.xml

file: "sax/Main.java"

body: "package sax;

import java.io.IOException;

import java.util.StringTokenizer;

import org.xml.sax.ContentHandler;

import org.xml.sax.DTDHandler;

One could write a SAX based program to check links in a FrameMaker document. However,

considerable effort is required to deal correctly with white space.

7.7 JAXP

JAXP is a factory model provided by Sun and integrated in JDK version 1.4 to access SAX

parsers, DOM document builders, and XSLT transformers in a transparent fashion. A simple

application was shown in Server-side XML with XSLT, page 14.

103


104


105

8

Document Object Model

Je nach Speicherplatz kann man kleinere XML-basierte Dokumente auch komplett im

Speicher als Baum darstellen. Man benötigt dazu eine Klassenhierarchie für die Baumknoten.

8.1 XOML

Als primitiver Baumknoten genügt eine ArrayList, in der die verschachtelten Texte als

String und die verschachtelten Baumknoten der Reihe nach gespeichert werden. Außerdem

enthält jedes derartige Element eine HashMap, die Attributnamen und -werte als String

einander zuordnet.

Mit einem ContentHandler wie Tree kann ein SAX-Parser ein Dokument als Baum darstellen.

Es ist zweckmäßig, wenn man den Baum auch wieder in XML darstellen kann. Problematisch

ist allenfalls der Umgang mit whitespace.

$ java -classpath ..:xerces.jar:oops.jar \

> -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser \

> -Dcontent=xoml.Tree -Derror=sax.Errors -Dverbose=true sax.Main



Der Gedanke liegt nahe, eine einfache Sprache xoml per Interpreter zu implementieren, mit

der man derartige Bäume manipulieren kann. Xoml ist ein Actor der zu xoml.ebnf ein

Programm in Befehle für einen Interpreter VM übersetzt.

Auf der Basis von Schleifen kann ein Programm wie chef.xoml zum Beispiel in einem

Dokument personen.xml Mitarbeiter ihren Chefs zuordnen. Mit Hash-Tabellen, die syntaktisch

als Arrays angesprochen werden, kann man zum Beispiel zählen, wie häufig bestimmte Werte

eines Attributs auftreten.

Ein Programm mit rekursiven Funktionen wie traverse.xoml kann ein unbekanntes Dokument

traversieren und darstellen.

Läßt man Zuweisungen an Baumknoten und Attribute zu, kann ein Programm wie html.xoml

auch ein neues Dokument erzeugen.

Die Implementierung ist naiv, aber durch gezielten Einsatz der Java Foundation Classes

bereits sehr mächtig. Man könnte xoml beispielsweise mit regulären Ausdrücken, weiteren

String-Funktionen, Syntax zur Validierung von Dokumenten und direktem Zugriff auf neue

Java-Klassen erweitern.


8.2 jaxb

Sun arbeitet an einem Java API for XML Binding (JAXB). Es geht darum, aus einem

Dokument-Schema (derzeit nur einer DTD) Dokument-spezifische Klassen zu erzeugen, die

ein Dokument einlesen, validieren und ausgeben können.

$ xjc expr.dtd -roots expr

./Add.java

./Div.java

./Expr.java

./Literal.java

./Minus.java

./Mul.java

./Sub.java

Aus der früher betrachteten DTD für arithmetische Ausdrücke entstehen zum Beispiel Klassen,

die arithmetische Ausdrücke repräsentieren können. Da überall mehrere komplexe Elemente

verschachtelt werden können, verwenden alle Klassen jeweils List als Knoteninhalt.

Mit einem einfachen Hauptprogramm kann man einen Ausdruck als XML-Dokument einlesen

das allerdings keine PIs enthalten darf oder durch Konstruktion von Objekten erzeugen.

Ein Dokument kann durch einen einfachen Methodenaufruf ganz oder teilweise validiert und

ausgegeben werden:

$ java -classpath .:jaxb-rt-1.0-ea.jar Main






xjc berücksichtigt ein in XML formuliertes Binding Schema, das Einfluß auf die generierten

Klassen nimmt. Damit kann man zum Beispiel von den meisten Klassen verlangen, daß sie

die binär verschachtelten Knoten als explizite, einzelne Properties left und right und nicht

als List speichern und folglich korrekt validieren. Die Wurzel Expr verwendet aber trotz aller

Versuche für die Validierung inkorrekt eine List. Das Binding Schema ist auch nicht in

der Lage, bei einem Literal zwei Attribute so zu verknüpfen, daß zie zusammen (oder

einzeln) die generierte Klasse beeinflussen. Man kann aber für ein Element die

Repräsentierung der Properties durch spezielle Java-Klassen definieren.

Schließlich kann man die generierten Klassen auch ableiten und um eigene Methoden wie

zum Beispiel eval() erweitern. Hinterlegt man im sogenannten Dispatcher eine Abbildung

der ursprünglichen zu den abgeleiteten Klassen, wird ein eingelesenes Dokument in den

neuen Klassen repräsentiert.

JAXB ist ein interessanter Ansatz. Ein einfaches Dokument wie eine Personalliste kann mit

einem einfachen Binding Schema so abgebildet werden, daß ein Java-Programm sehr

problemspezifisch codiert werden kann.

Leider ist das System keineswegs ausgereift; wenn XML Schema berücksichtigt wird, sollten

sich außerdem die Abbildungen zu Java-Typen sehr vereinfachen. Derzeit definiert ein

Binding Schema vieles doppelt, was eigentlich schon in der DTD steht, und reproduziert in

XML einiges an Syntax, was man viel eleganter in Java ausdrücken kann. Ohne Java-

Programmierung ist das System ohnedies nicht benutzbar.

106


8.3 DOM

Das Document Object Model des W3C (DOM) spezifiziert eine Klassenhierarchie, mit der

XML-basierte Dokumente repräsentiert werden sollen.

Die Spezifikation ist sprachunabhängig, weil sie auf der Basis der Interface Definition

Language der Common Request Broker Architecture erfolgte. CORBA definiert dann

Language Mappings von der IDL auf sehr viele verschiedene Programmiersprachen, und man

kann einen DOM-Baum prinzipiell über CORBA verteilen. In der Realität gibt es da gewisse

Haken siehe Stefan Rauch’s Diplomarbeit.

Die Java-Spezifikation entspricht nicht dem Mapping, das ein IDL-Compiler aus der

Spezifikation erzeugen würde: Ein Interface wie Document enthält dort auch CORBA-

Fähigkeiten; nur ein Interface wie DocumentOperations bezieht sich nur auf die in der IDL-

Spezifikation angegebenen Attribute.

Ein XML-Parser wird normalerweise mit einer DOM-Implementierung geliefert. JAXP

implementiert auch ein Factory-Muster für Parser, die DOM-Bäume abliefern. Die DOM-

Implementierungen für zwei verschiedene Parser müssen aber nur die gleiche, deklarative

IDL-Spezifikation erfüllen; es handelt sich nicht um die gleichen Klassen. Man kann aber

manchmal interoperieren. Stefan Rauch hat demonstriert, daß sich die Implementierungen

deutlich in Funktionalität und Effizienz unterscheiden.

DOM hat eine etwas merkwürdige Hierarchie, bei der insbesondere die Anbindung mancher

Methoden überrascht:

DOMImplementation Zugriff auf spezielle Methoden (anstelle von Konstruktoren)

NamedNodeMap änderbar, Resultat von getAttributes() in Node

Node Basis-Interface, definiert Obermenge von Methoden

Attr beschreibt Attribut, gilt nicht als Knoten

CharacterData Methoden zum Zugriff auf Text in einer Node

Comment

Text

beschreibt

CDATASection beschreibt

Document Wurzel eines Dokuments, oberhalb des äußeren Elements

DocumentFragment Container für unstrukturierte Knoten

DocumentType beschreibt (rudimentär) DTD

Element beschreibt ...

Entity kann

NodeList änderbar, Resultat von getChildNodes() in Node

Dokumente kann man editieren, wobei sie allerdings nicht unbedingt valid bleiben (kein

Transaktionskonzept). Holt man Knoten in eine NodeList und verschiebt man sie von dort

zum Beispiel über ein DocumentFragment in ein neues Dokument, dann werden sie aus

dem ursprünglichen Dokument entfernt. Es gibt aber cloneNode(boolean deep).

Eine DOM-Implementierung muß selbst weder Serializable noch als XML auszugeben sein.

Man kann aber mit einem org.apache.xml.serialize.XMLSerializer auch ein fremd

implementiertes DOM als XML ausgeben, siehe dom.Main.

107


DOM Level 2 definiert insbesondere Range, um einen Bereich in einem Dokument zu

manipulieren, und DocumentTraversal, um Teilbäume iterativ oder als Baum zu traversieren,

siehe dom.Chef.

108


8.4 JDOM

McLaughlin’s JDOM ist eine speziell auf Java zugeschnittene Klassenbibliothek zur

Bearbeitung von XML-basierten Dokumenten. JDOM stützt sich auf die Java Collection

Classes (ab JDK 1.2) und vermeidet den zusätzlichen Aufwand, den sich DOM durch seine

allgemeine Spezifikation einhandelt.

Die Klassen erlauben elegantere Codierungen als DOM, siehe jdom.Chef, aber um weitere

Entwicklungen wie XPath oder XSLT zu verwenden, muß man einen JDOM-Baum zunächst als

DOM-Baum exportieren.

109


110


111

9

XSL Transformations

The Extensible Stylesheet Language (XSL) consists of two parts with completely different

goals: an XSL transformer like Saxon or Xalan transformes a tree typically a XML based

document under control of a stylesheet normally another XML based document into

another tree and outputs it as XML, HTML, or text. An XSL formatter like FOP accepts a tree of

Formatting Objects typically a XML based document transformed with XSLT and

produces printable texts, e.g., in Portable Document Format (PDF).

XSLT, the stylesheet language, is a XML application which contains elements of conventional

programming languages and combines it with pattern matching and substitution on trees.

XPath expressions with additional functions take over for arithmetic expressions in more

conventional languages. The book by Michael Kay is a definitive description with lots of

examples and useful hints.

9.1 XPath

XPath selects parts of a tree therefore, XPath defines what XSLT considers to be a tree; this

is more or less the same as the DOM. Xalan contains org.apache.xml.xpath and other

packages which can be used to implement a program to test XPath expressions, see

xpath.Main:

$ java -classpath ..:xerces.jar:xalan.jar \

> -Djavax.xml.parsers.DocumentBuilderFactory=\

>org.apache.xerces.jaxp.DocumentBuilderFactoryImpl xpath.Main pets.xml






Black



Golden



...


//*[@breed='Labrador']/color


Black


Golden


Language Elements

XPath deals with floating point values, strings, logical values, and tree nodes, and has very

tolerant conversion rules:

12e3

floating point values

Inf -Inf NaN

+ - * div mod arithmetic operations with the usual precedence

- unary, sign change

ceiling(...)

arithmetic and conversion functions

floor(...)

number()

round()

count(...) number of nodes in a nodeset

last() number of nodes in a context

position() in context, from 1 to last(), starts at rear in some contexts

string-length(...) string length

sum(...) node set is converted to number() and added up

true()

logical values

false()

and or

not(...)

A location path selects a nodeset; it takes precedence over these operations.

112

logical operations, usual precedence, evaluated from left to right with

preemption

= != < >= comparisons with logical result, precedence as in C, might require

character entities such as <, node comparisons are strange

boolean(...) zero, NaN, and empty sets are false(), rest is true()

contains(s, s) true() if second string is contained in first

’...’ "..." Unicode strings, might require character entities

concat(..., ...)

starts-with(...)

substring(...)

substring-after(...)

substring-before(...)

translate(...)

string and conversion functions

normalize-space(...) removes surrounding and multiple white space and line separators

local-name()

name()

namespace-uri()

of context or first element in argument

string(...) conversion, for a nodeset the text value of the first node

location path selects nodeset

| combines two nodesets

id(...) returns nodeset containing ID(s); tricky if not valid


Location Paths

location-path: "/"? location-step ( "/" location-step )*;

location-step: axis "::" node-test ( "[" predicate "]" )*;

An absolute location-path begins with / and starts at the document node. A relative

location-path starts at a context node for xpath.Main this is the last node shown.

Each location-step does a node-test along an axis and thus select a nodeset. A

predicate is a test which reduces the nodeset to those for which the logical value or

numerical position is valid.

The result is the context for the next location-step. There are lots of abbreviations...

axis

child default nodes directly nested into the context; no attributes or

namespaces (nodes for xmlns attribute)

descendant all nodes contained in context; no attributes or namespaces

descendant-or-self // context and descendant

parent .. nodes into which context is directly nested

ancestor document node and all nodes into which context is nested

ancestor-or-self ancestor and context

following-sibling all nodes after context, which are directly nested into the

same node; attributes and namespaces have no siblings

preceding-sibling all nodes before context, which are directly nested into the

same node

following all nodes after context, but no attributes and namespaces

preceding all nodes before context

attribute @ all attributes of context

namespace alle namespaces which apply to the context

self . context

node-test

name all elements or all attributes with this name

prefix:name all elements or all attributes in the same namespace with

this name

* all elements or all attributes or all namespaces

prefix:* all elements or all attributes in the same namespace

comment() all comment nodes

text() all text nodes

node() all nodes of whatever type

processing-instruction() all PI’s, also for a specific target

113


A predicate is either a number, i.e., the position of a node in a nodeset beginning with 1

potentially counted from back to front or a logical value; in the example above:

//dog[2]



Golden


//dog[@breed='Labrador'][color='Black']



Black


//dog[starts-with(.//text()[2], 'B')]



Black


The inner position is necessary because there are two text nodes with whitespace.

A location-step moves on to a new context, a predicate only prunes the current context.

114


9.2 Transformation Paradigm

A stylesheet contains templates which are applied to a tree. A minimal stylesheet is as follows:

xslt/trivial.xsl


If it is applied to the previous example it simply extracts the text:

$ java -classpath saxon.jar com.icl.saxon.StyleSheet pets.xml trivial.xsl


Black

Golden

...

This stylesheet does not define any templates. It relies on the predefined templates:

document operates on all elements

element operates on all nested elements

attribute copied as text

text() copied

comment ignored

PI ignored

namespace ignored

The predefined templates do not enter attributes; this has to be done explicitly:

xslt/text.xsl






This stylesheet prints all attributes and texts node() selects all kinds of nodes but the

attributes are on their own axis.

$ java -classpath saxon.jar com.icl.saxon.StyleSheet pets.xml text.xsl


Labrador

Black

...

115


There are subtle differences. * matches element nodes, node() matches all.

xslt/node1.xsl






This matches text nodes, too; therefore, only the attributes are shown:

$ java -classpath saxon.jar com.icl.saxon.StyleSheet pets.xml node1.xsl

LabradorLabradorSiameseBurmeseTortoiseshell

It is a bit tricky to redifferentiate the nodes:

xslt/node2.xsl






colored













$ java -classpath saxon.jar com.icl.saxon.StyleSheet pets.xml node2.xsl

Labrador colored Black

Labrador colored Golden

Siamese colored Cream

Burmese colored Grey

Tortoiseshell colored Brown

A transformation always starts by searching for a template matching the document node. This

template can be overwritten to perform special actions and/or it can continue the search for

other nodes using apply-templates.

Templates are selected by import order and then by priority. The latter depends on how

specific match is. The predefined templates are always found last the only way to prevent

that is to overwrite them. mode partitions template recognition this is useful to traverse a

tree differently for different purposes.

116


9.3 Controlling the Traversal

By defining a template to fit a node one controls how a node is processed:

xslt/pets.xsl











Dogs











Nodes below pets are only processed because apply-templates is called.

If the blue apply-templates is missing, the output only contains the Dogs title and the text for

the cats. If it is present, table rows are added for the dogs.

The text for the cats can be omitted by restricting the red apply-templates:


Now the cats are not reached and the predefined templates are not applied to them.

The order in which nodes are processed first depends on the order in which applytemplates

is specified. The blue apply-templates can be revised:



Now the dogs do not appear in document order. A selection can be sorted:




Now the dogs are sorted in descending order of colors.

More examples: rendering the relations of bosses to workers, rendering a list of instructors.

117


Patterns

apply-templates@select uses an XPath expression to select a nodeset as context nodes.

This nodeset is then targeted by all template@match patterns.

For efficiency a pattern is a restricted XPath expression. Here are the possible constructs with

increasing precedence:

| connects alternatives

/ matches the document node and connects location steps

// only connects location steps; matches deep nesting, too

id(’x’) matches ID attribute

key(’x’, ...) matches key with name x

attribute:: @ refers to attributes

child:: default refers to elements directly nested into context

node-test just as for XPath

predicate just as for XPath, but without variables and current()

A pattern matches a node if the node or any enclosing node used as context for the pattern as

a XPath expression produces into a nodeset containing the node.

Ob ein Pattern auf einen Knoten zutrifft, hängt davon ab, ob der Knoten selbst oder irgendein

umgebender Knoten als Kontext mit dem Pattern als XPath-Ausdruck eine Knotenmenge

liefert, die den Knoten enthält.

This means that a pattern is usually evaluated from right to left beginning before the

predicates. select delivers nodes which themselves have to satisfy the last node test in the

pattern.

match="foo//bar[position()=last()-3]"

This is only matched by bar nodes which are somehow nested into a foo node.

The predicate does not apply to the nodeset delivered by select. It applies to the position of

the node in the preceding step in the pattern.

118


9.4 Language Elements for Programming

XSLT contains simple control structures. Here is an implementation of Euclid’s algorithm for

the greatest common divisor as an example for more or less conventional programming:

xslt/euclid.xsl




36 54


27 99

variable name=’name’ defines global and local variables which are referenced as $name in

XPath expressions.

select=’wert’ defines the value there is no assignment later.

Alternatively, the content of the element can define the value.


pairs:

119

xslt/euclid.xsl

message communicates with the XSLT processor.

Typically, the content is written as diagnostic output.

terminate=’yes’ should terminate execution.

copy shallow copy; content is inserted into document or element node.

copy-of select=’rtf’ deep copy of a result tree fragments.
















Caution: Only the second selection above produces the desired result:

120

xslt/euclid.xsl

for-each select=’nodeset’ applies it’s content to nodeset.

sort can control sequence; lots of attributes.

call-template name=’name’ calls template by name; may be recursive.

with-param name=’name’ sends value to template parameter,

select=’wert’ defines value; content of element can be used instead.

literal defines result node.

name=’{wert}’ { ... } within attributes is replaced just like value-of.

element name=’name’ defines result node; also comment and processing-

namespace=’uri’ instruction.

attribute name=’name’ defines attribute in current result node;

can be combined with -set; content can also be value.

$ java -classpath saxon.jar com.icl.saxon.StyleSheet /dev/fd/0 euclid.xsl


pairs: 36542799


Only the result elements b are correct.



















The language as a worthy successor to DSSSL resembles Lisp. There is no

conventional assignment and there are no loops.

Loops have to be replaced by (tail) recursion, see the gcd template.

121

xslt/euclid.xsl

template match=’pattern’ defines template; with pattern for apply-templates.

mode=’name’ restricts; same name for apply-templates.

name=’name’ defines template; with name for call-template.

priority=’number’ helps disambiguate selection.

param name=’name’ declares Parameter; can be global.

select=’wert’ sets default; content can be used instead.

value-of select=’wert’ defines result string: text of (first) node or by converting from

numerical values.

disable-outputescaping=’yes’

prevents escaping of < and &.

text defines result string by it’s text content; space is significant.

disable-output- prevents escaping of < and &.

escaping=’yes’

choose switch.

when test=’boolean’ case in choose; content is executed depending on test.

otherwise last alternative in choose.

if test=’boolean’ decision; no else; content is executed depending on test.

Functions can be made by embedding call-template in a variable which then stores the

result.

Data structures can be built as contents of variable and used with for-each and through

XPath.

Precisely controlling white space output is quite hard.


9.5 More Examples

Search pets.xml

key.xsl

Color table colors.xsl

doColors.xsl

makefile

RGB table rgb.txt

rgb2xml

rgb.xsl

Homepage index.xml

index.xsl

Code code/index.xml

code/index.xsl

FTP area ftp/index.xml

ftp/index.xsl

Papers rec/index.xml

rec/index.xsl

colors.23.html

colors.127.html

rgb.html

index.html

home.html

title-frame.html

Michael Kay shows how to program the Knight’s Tour using XSLT... (however, this problem can

also be solved with a Turing machine).

122

create keys with and search with

key().

import, recursion for loops, output not wellformed.

code/index.html collecting paths.

several output files; computing calendar

days.

ftp/index.html several output files; collecting table entries

for rowspan.

rec/index.html


123

A

Glossary

ASCII American Standard Code for Information Interchange: defines special

characters, letters, digits, and punctuation with values between 0 and 127;

contains essentially no national characters such as German umlauts, etc.

CSS Cascading Style Sheets: for detailed control of formatting XML based and

HTML documents.

DSSSL Document Style Semantics and Specification Language: ISO norm in

1996, language defined by Clark et al. to transform and represent SGML

and thus XML based documents.

DOM Document Object Model: specification of a class hierarchy to represent

XML based documents.

DTD Document Type Definition: grammar for a validatable, XML based

document; used with a similar meta syntax for SGML., too. There are

published DTDs:





Document Type Declaration: doctypedecl in the prolog of an XML based

document, contains or references a markupdecl, which then contain a

grammar, which finally is the Document Type Definition.

DTDSA DTD with Source Annotation: XLE uses this with JDBC database queries

to generate XML based documents.

element Node in the document tree of a XML based document; limited by tags.

FO Formatting Objects: SGML and XML based documents are transformed

into these and then rendered in languages like PDF and RTF.

GML Generalized Markup Language: developed 1969 by Goldfarb, Mosher and

Lorie for document production at IBM.

grammar Rules to describe the syntax of a language; often used to mechanically

generate a parser.

HTML Hypertext Markup Language: developed 1989 by Berners-Lee for

electronic document transmission among physicists at CERN.

HTTP Hypertext Transfer Protocol: preferred transport protocol for HTML.

JAXB Java API for XML Binding: classes and a compiler from Sun which create

document-specific classes from a DTD which can be used to read

documents. validate them, and write them (very early access).

JAXP Java API for XML Processing: factory classes from Sun which provide

standardized access to XML parser and XSLT transformers.


JDBC Java Database Connectivity: Java package for SQL based access to

databases; separates access and communication; the latter is the

responsibility of a database-specific JDBC adapter.

Jade James’ DSSSL Engine: Clark’s C++ based implementation of the

representation part of DSSSL.

markup All but the pure text in a XML based document, see Markup, page 62.

namespace Association of a name with a URI in a XML based document, to

differentiate simple names. The following might be important:

xmlns:Cocoon="http://xml.apache.org/cocoon/"

xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns:svg="http://www.w3.org/2000/svg"

xmlns:xhtml="http://www.w3.org/1999/xhtml"

xmlns:xlink="http://www.w3.org/1999/xlink"

xmlns:xsi="http://www.w3.org/1999/XMLSchema/instance"

xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

parser Program (part) to recognize a language.

PDF Portable Document Format, a language by Adobe for printer and screen

output.

PI Processing Instruction: in place of an element in a

XML based document; contains data which a special processor (target)

is supposed to interpret.

representation Rules which define the representation of symbols of a language using an

input alphabet; often used to mechanically generate a scanner.

RTF Rich Text, Microsoft language used in text systems and for data exchange.

SAX Simple API for XML: defined 2000 by David Megginson; interface between

an XML parser and it’s semantic actions.

scanner Program (part) to recognize the symbols of a language.

SGML Standard Generalized Markup Language: 1986 ISO Standard, ancestor of

XML.

SQL Standard Query Language: for access to databases.

syntax Rules that govern the sequence of symbols in sentences in a

programming language.

tag markup at the beginning and end of an element; the start tag can

contain key value pairs, the end tag looks like . In XML an element

can be a single tag like which may also contain key value pairs.

Unicode Character set encoding which starts with the same characters as ASCII

but which adds symbols and characters from very many languages.

URI Universal Resource Identifier: combines URL and URN; at present more

or less synonymous with URL.

URL Universal Resource Locator: address in the internet which can be used to

access a resource; e.g., protocol://host:port/path.

URN Universal Resource Name: is supposed to uniquely identify a resource in

the Internet; thus far undefined.

124


UTF-8 A representation if Unicode, which is for ASCII characters identical to

ASCII and which uses up to 3 bytes for other characters.

valid A XML based document is valid if it satisfies it’s DTD.

validation Check by an XML parser whether a XML based document is valid; it

always must be well-formed.

well-formed A XML based document is well-formed if it’s representation is correct

this should always be the case.

XHTML HTML newly defined with XML: all elements must be closed and case is

significant in element names.

XLE XML Lightweight Extractor from IBM alphaWorks: uses a DTDSA and

JDBC to create XML based documents.

XML Extensible Markup Language: defined 1997 by Bosak, Clark, et al.

simplification of SGML.

XML Schema Replacement for DTD: defines grammar and especially data types for

XML based documents.

XPath Path based syntax to address parts of a XML based document.

XSD XML Schema.

XSL Extensible Style Sheet Language: defined by Clark et al. to transform and

render XML based documents.

XSL-FO Formatting part of XSL.

XSLT Transform part of XSL.

125


126

More magazines by this user
Similar magazines