XML Parsing – Do it the right way!

I recently came across a situation where there was a performance hit in a web application due to the wrong selection of XML parser. That’s when I decided to write a blog on how to do XML parsing the right way.

There is an ambiguity in choosing the right one from the numerous tools available in the industry. It is important to understand that each tool is created to serve specific purposes. There are a lot of XML parsing frameworks available like JDOM, Xerces, Crimson, Woodstox etc. But all these are derived from the core specifications/standards like:

DOM
SAX
StAX
JAXB

DOM and SAX were the specifications derived initially; eventually StAX was introduced as an improvement over SAX. JAXB is most often used alongside JAX-WS (Java API for XML Web Services). But for small XMLs, JAXB might be overkill. But if you have complex XML schema and lot of data contained, then JAXB is the right option.

We will discuss about these in details in the coming sections of the article.

DOM

DOM (Document Object Model) API is intended to work with XML which is first loaded into the memory as an object graph (a tree like structure). The API then will traverse first through the XML and create a DOM structures corresponding to the nodes in XML file.

Most of you will be confused as to what is a DOM? Is it what you write in an XML or HTML, a DOM? Is it what you see through ‘View Source’ of your browser a DOM? The answer is ‘NO’. The View Source will show you probably the exact HTML/XML that you wrote, but that is not a DOM. Let me explain this with a small piece of HTML and JavaScript code.

Consider the below HTML code:

<div id=”empName”></div>
A small piece of JavaScript has been added to the HTML code –
<script>
var empName = document.getElementById(“empName”);
empName.innerHTML = “John”;
</script>

Now what you see in the DOM should be:

This means, what you have written is not a DOM completely. It is generated by the API (here in this case, it is the browser that generated the DOM) by following the instructions provided in the HTML/XML and then loaded into the memory for manipulations.

SAX

SAX (Simple API for XML) is another way for XML parsing. Unlike DOM, it doesn’t create a structure in the memory. It works by iterating over the XML and call certain methods on a “listener” object when it meets certain structural elements in XML. SAX parsers are event based. The parser triggers a handler method when it encounters an event. An event can be a Start Document/End Document, Start Element/End Element, Encounter a Processing Instruction, Comments or Characters.

Let me explain this with an example. In the below XML:

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”myfile.xsl” ?>
<bookstore>
<book style=”autobiography”>
<author>
<first-name>Joe</first-name>
<last-name>Bob</last-name>
<award>Trenton Literary Review Honorable Mention</award>
</author>
<price>12</price>
</book>
<book>
******
</book>
<book>
******
</book>
</bookstore>

Consider a scenario in which you need to initialize a Java Bean and put inside a List every time you encounter a <book> element and when you encounter </bookstore> (End tag) you need to push the List to a DAO class for persistence. So here you will make use of the handler methods to write what you need when the parser encounters specific XML elements.

StAX

SAX pushes the XML events; where in StAX (Streaming API for XML) pulls the XML events. StAX leaves it up to you to determine where the pulled XML data/events have to be received in your program.

Consider the below StAX parser code block:

int event = reader.next();
if(event == XMLStreamConstants.START_ELEMENT) {
// .. your logic here for start element
}

Here the parser will read through the XML content, pull the data and then provide you a method to handle them. StAX lets you structure your parsing code according to the XML structure.

JAXB

JAXB (Java Architecture for XML Binding) allows Java developers to map Java classes to XML representations. JAXB allows applications to access the data in the XML from the object rather than using the DOM or SAX to retrieve the data from a direct representation of the XML itself. It is most often used alongside Java API for XML Web Services (JAX-WS) and makes objects creation and mapping easy.

Here is the sample code:

File file = new File(“C:\\file.xml”);
JAXBContext jaxbContext = JAXBContext.newInstance(Customer.class);
Unmarshaller jaxbUnmarshaller = jaxbContext.createUnmarshaller();
Customer customer = (Customer) jaxbUnmarshaller.unmarshal(file);

Here an XML file is loaded and the contents are fetched out by an Unmarshaller object and then a Customer (POJO) object is initialized.

Most of the time you can stick to these implementations or what is available in your JDK/application servers. But at times you might want to switch over from the default implementation and that is when you will find JAXP useful. JAXP is a wrapper frame work which allows you to switch the implementations through configuration without modifying the code.

Here is a quick comparison of all the above discussed XML specifications:

Pros	Cons
I – DOM
1. In-memory object model	Memory hog for larger XML documents (Suitable for XML documents less than 10MB)
2. Preserves element order	Slow
3. Bi-directional	Generic model
4. Read and write API
5. Supports XML manipulations
6. Easy to use
7. Supports schema validation

II – SAX
1. Event based	No object model, you have to tap into the events and create your self
2. Memory efficient	Single parse of the xml and can only go forward
3. Faster than DOM	Read only API
4. Supports schema validation	No XPath support
5.	Hard to use

III – StAX
1. Ease of DOM and efficiency of SAX	No schema validation support
2. Memory efficient	Can only go forward
3. Pull model	No XML manipulation
4. Read and write API
5. Supports Sub parsing
6. Can read multiple documents same time in one single thread
7. Parallel processing of XML is easier

IV – JAXB
1. Allows you to access and process XML data without having to know XML	Can only parse valid XML
2. Bi-directional
3. More memory efficient than DOM
4. SAX and DOM are generic parsers where as JAXB creates a parser specific to your XML Schema
5. Data conversion: JAXB can convert XML to POJOs
6. Supports XML manipulation via object API

Now, there are two things to keep in mind while choosing an XML parser for your project – the Standard and the Implementation. There are many implementations available but choosing one completely depends upon your requirement. All these standards/specifications are created over a period to solve different problems or use cases. So make the right choice and do it right!