I recently came across a situation where there was a performance hit in a web application due to the wrong selection of XML parser. That’s when I decided to write a blog on how to do XML parsing the right way.

There is an ambiguity in choosing the right one from the numerous tools available in the industry. It is important to understand that each tool is created to serve specific purposes. There are a lot of XML parsing frameworks available like JDOM, Xerces, Crimson, Woodstox etc. But all these are derived from the core specifications/standards like:

  • DOM
  • SAX
  • StAX
  • JAXB

DOM and SAX were the specifications derived initially; eventually StAX was introduced as an improvement over SAX. JAXB is most often used alongside JAX-WS (Java API for XML Web Services). But for small XMLs, JAXB might be overkill. But if you have complex XML schema and lot of data contained, then JAXB is the right option.

We will discuss about these in details in the coming sections of the article.

DOM

DOM (Document Object Model) API is intended to work with XML which is first loaded into the memory as an object graph (a tree like structure).  The API then will traverse first through the XML and create a DOM structures corresponding to the nodes in XML file.

Most of you will be confused as to what is a DOM? Is it what you write in an XML or HTML, a DOM? Is it what you see through ‘View Source’ of your browser a DOM? The answer is ‘NO’. The View Source will show you probably the exact HTML/XML that you wrote, but that is not a DOM.  Let me explain this with a small piece of HTML and JavaScript code.

Consider the below HTML code:

<div id=”empName”></div>
A small piece of JavaScript has been added to the HTML code –
<script>
var empName = document.getElementById(“empName”);
empName.innerHTML = “John”;
</script>

Now what you see in the DOM should be:

<div id=”empName”>John</div>

This means, what you have written is not a DOM completely. It is generated by the API (here in this case, it is the browser that generated the DOM) by following the instructions provided in the HTML/XML and then loaded into the memory for manipulations.

SAX

SAX (Simple API for XML) is another way for XML parsing. Unlike DOM, it doesn’t create a structure in the memory. It works by iterating over the XML and call certain methods on a “listener” object when it meets certain structural elements in XML. SAX parsers are event based. The parser triggers a handler method when it encounters an event. An event can be a Start Document/End Document, Start Element/End Element, Encounter a Processing Instruction, Comments or Characters.

Let me explain this with an example. In the below XML:

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”myfile.xsl” ?>
<bookstore>
<book style=”autobiography”>
<author>
<first-name>Joe</first-name>
<last-name>Bob</last-name>
<award>Trenton Literary Review Honorable Mention</award>
</author>
<price>12</price>
</book>
<book>
******
</book>
<book>
******
</book>
</bookstore>

Consider a scenario in which you need to initialize a Java Bean and put inside a List every time you encounter a <book> element and when you encounter </bookstore> (End tag) you need to push the List to a DAO class for persistence. So here you will make use of the handler methods to write what you need when the parser encounters specific XML elements.

StAX

SAX pushes the XML events; where in StAX (Streaming API for XML) pulls the XML events.  StAX leaves it up to you to determine where the pulled XML data/events have to be received in your program.

Consider the below StAX parser code block:

int event = reader.next();
if(event == XMLStreamConstants.START_ELEMENT) {
// .. your logic here for start element
}

Here the parser will read through the XML content, pull the data and then provide you a method to handle them. StAX lets you structure your parsing code according to the XML structure.

JAXB

JAXB (Java Architecture for XML Binding) allows Java developers to map Java classes to XML representations. JAXB allows applications to access the data in the XML from the object rather than using the DOM or SAX to retrieve the data from a direct representation of the XML itself. It is most often used alongside Java API for XML Web Services (JAX-WS) and makes objects creation and mapping easy.

Here is the sample code:

File file = new File(“C:\\file.xml”);
JAXBContext jaxbContext = JAXBContext.newInstance(Customer.class);
Unmarshaller jaxbUnmarshaller = jaxbContext.createUnmarshaller();
Customer customer = (Customer) jaxbUnmarshaller.unmarshal(file);

Here an XML file is loaded and the contents are fetched out by an Unmarshaller object and then a Customer (POJO) object is initialized.

Most of the time you can stick to these implementations or what is available in your JDK/application servers. But at times you might want to switch over from the default implementation and that is when you will find JAXP useful. JAXP is a wrapper frame work which allows you to switch the implementations through configuration without modifying the code.

Here is a quick comparison of all the above discussed XML specifications:

[table id=13 /]

Now, there are two things to keep in mind while choosing an XML parser for your project – the Standard and the Implementation. There are many implementations available but choosing one completely depends upon your requirement. All these standards/specifications are created over a period to solve different problems or use cases. So make the right choice and do it right!

Latest posts by Ratheesh Narayanan (see all)