SAX document processing

While DOM provides an object-oriented model, SAX is event-driven. As the parser encounters tags and data in the input document, methods provided by a user-supplied object are called. The parser can also be replaced while parsing is taking place.

This ability to replace the parser is significant as parsers could be installed based on the input tags. You could, for example, have a certain class of parsers specialized for certain constructs. In the case of a database query result set, for example, we might have a separate parser for the document body and another one for the returned records.

Before we begin to develop code for this application, we need to have a framework. The actual SAX parser is org.apache.xerces.parser.SAXParse. Your objects should implement org.xml.sax.ContentHandler, which specifies 11 methods. Call setContentHandler on the parser with your object reference as the argument.

We're getting there but we now need to specify the input document. As with the DOM parser, it's not very difficult. The org.xml.sax.InputSource class has constructors for a couple of different input streams and a system identifier. Finally, call the parse method on the SAX parser and provide the InputSource object as the argument.

Here's an abbreviated version of source code:

import	org.apache.xerces.parsers.*;
import	org.xml.sax.*;
import	java.io.FileReader;

public class MyParser implements ContentHandler {

	SAXParser	parser;

	public MyParser( SAXPserver parser ) {
		this.parser = parser;
	}

	...

	public static void main( String args[] ) {
		SAXParser	parser = new SAXParser();
		MyParser	app = new MyParser( parser );
		InputSource	source = null;

		parser.setContentHandler( app );

		...

		try {
			source = new InputSource( new FileReader( args[0] ) );
		}
		catch( Exception e ) {
			...
		}

		try {
			parser.parse( source );
		}
		catch( Exception e ) {
			...
		}
	}
}

We instantiate the SAXParser and our own application, MyParser which implemenets ContentHandler. Note that we include an object reference to the SAXParser in our constructor. This will be discussed shortly. We call the setContentHandler method on the parser, giving our new object reference. Finally, we create a new InputSource and pass it to the parse method of the parser.

Once parsing begins, your methods will be called as tags and text are detected. The method called when a tag is discovered is startElement. Suppose your application recognizes the tag as requiring special treatment. That's why we kept a handle to the original parser in the constructor. We could use some code such as the following:

	public void startElement( String namespaceURI, String localName,
	  String qName, org.xml.sax.Attributes attrs ) {
		SpecialParser	special = null;

		if( localName.equals( "SPECIAL" ) ) {
			special = new SpecialParser( parser, this );
			parser.setContentHandler( special );
		}

The fourth argument to startElement is an Attributes object. This class implements methods which permit you to obtain the names and values of attributes specified in tags. Of course, what would be the utility of all this if there wasn't a method which is able to access the document contents? It's called character and is called with a pointer to a character array, the offset and length of the contents.

The SpecialParser class referred to above could be partially defined as follows:

public class SpecialParser implements ContentHandler {

	SAXParser	parser;
	ContentHandler	parent;

	public SpecialParser( SAXParser parser, ContentHandler parent ) {
		this.parser = parser;
		this.parent = parent;
	}

At this point, we've noted a special tag in the input stream in our main ContentHandler and created a new instance of a SpecialParser. We then call the setContentHandler of the parser so that our new object will receive events instead. The opposite of startElement is not surpisingly called endElement. The final step is to restore the original application (MyParser) as the ContentHandler. The following code will accomplish what we want:

	public void endElement( String namespaceURI, String localName,
	  String qName ) {
		if( localName.equals( "SPECIAL" ) )
			parser.setContentHandler( parent );

All parsing events will now be directed to the parent. Of course, this is all only by way of example. Depending on the complexity of the DTD we might have recursion and so have to keep track of our "depth" in the stack. Even so, if we have to handle large documents then the SAX approach is a much lighter-weight alternative to DOM. Some advocates have also pointed to the fact that DOM has to read a document twice while it only requires a single pass with SAX.

Also, unlike DOM, there is no framework for creating XML documents. Then again, any competent programmer should be able to create simple text files in any programming language. While this has been a simpler treatise than the corresponding one on DOM, I'm going to be adding the full source code which receives a record set in an applet and writes it to a text area. Come back soon!