XML Parsing with DOM in C++

XML-C++Having the ability to parse XML files is a requirement for a lot of applications these days. XML is a standard format for exchanging data between programs and storing configuration data.

If you want to parse XML documents in C++ you can benefit from using an external library like the Xerces-C++ XML Parser. Xerces provides an elaborate, but somewhat complex API for navigating XML files. To simplify matters, I’ll describe a C++ class that encapsulate the Xerces calls to index and retrieve XML element values and attributes.

XML Parsing Models

XML Elements

XML documents consistent of elements that are denoted by beginning and ending tags. XML elements are of the general form:

where value consists of either a string value or additional XML elements. An attribute is a value associated with the given element.

<element attribute>
    <element>value</element>
</element>

Here is an example of an XML document that is intended to represent two books contained in a bookstore. The bookstore element contains two book elements each with a category attribute. Each book element contains fields to that describe the book.

<bookstore>
    <book category="cooking">
        <title lang="en">Everyday Italian</title>
        <author>Giada De Laurentis</author>
        <year>2005</year>
        <price>30.00</price>
    </book>
    <book category="children">
        <title lang="en">Harry Potter and the Half-Blood Prince</title>
        <author>J. K. Rowling</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
</bookstore>

The bookstore in this case is analogous to a database table with two book rows and the title, author, year and price fields are the colums of the rows.

SAX Model

XML files can be parsed using two different XML models, SAX and DOM (Document Object Model). Parsing with SAX utilizes mechanisms where the XML document is traversed and as XML elements are visited the contents are passed back to the calling application.  When the beginning and ending elements of a section, e.g. the book sections in the example XML, are encountered, the caller is notified so it can keep track of each section and so it knows that other elements will follow.

Since SAX parsing visits each element one at time, it is fast and does not make heavy demands on memory. It is also possible to process XML documents of arbitrary sizes. However, SAX requires the calling application to do all the heavy lifting when it comes to storing the XML field values.

DOM Model

With DOM parsing the entire XML document is read into memory and organized in the form of a tree as shown in the following diagram.

dom

Example XML Document Diagram
Source: XML DOM Node Tree by w3schools.com

The root element is the bookstore and child elements are book. The bookstore is the parent element of the book elements. Each book element is the parent of four child elements, title, author, year and price.

When using DOM it is possible to index through each parent and child element, so the calling application does not have to maintain the document structure as it does with SAX.

The downside of using DOM is that the size of document you can parse with it is limited by the amount of memory an application has to work with and parsing is less efficient.

Xerces Installation

Before diving into the XML DOM parsing API, let’s go over how to install Xerces. You can get the Xerces library in binary form for various platforms, but I was built my example on MacOS so I elected to build from source.

  1. Download Xerces 3.1.1 from the download site.
  2. Place the tarball in your home directory or wherever.
  3. tar zxvf xerces-c-3.1.1.tar.gz
  4. cd xerces-c-3.1.1/
  5. ./configure
  6. make
  7. sudo su
  8. make install

This will place the Xerces headers and library in /usr/local on your system.

Xerces Platform Initialization

Before we do any parsing the Xerces the platform must first be initialized, which involves the following 3 steps:

  1. Call XMLPlatformUtils::Initialize()
  2. Create an XmlDOMParser object.
  3. Create an error handler for the parser.

For convenience we’ll do these three steps in a single function call.

XercesDOMParser*   parser = NULL;
ErrorHandler*      errorHandler = NULL;

void createParser()
{
    if (!parser)
    {
        XMLPlatformUtils::Initialize();
        parser = new XercesDOMParser();
        errorHandler = (ErrorHandler*) new XmlDomErrorHandler();
        parser->setErrorHandler(errorHandler);
    }
}

We only need one parser so createParser() does the platform intialization and parser creation just once. The error handler class is derived from the Xerces HandlerBase class as follows:

class XmlDomErrorHandler : public HandlerBase
{
  public:
    void fatalError(const SAXParseException &exc) {
        printf("Fatal parsing error at line %d\n", (int)exc.getLineNumber());
        exit(-1);
    }
};

When an exception in thrown within the Xerces platform it will be caught here and an error message will be displayed indicating the line number of the offending code.

XmlDOMDocument Class

The XmlDomDocument class encapsulates the Xerces DOM API. The class interface and definition are contained in the XmlDomDocument.h and XmlDomDocument.cpp files respectively.  Note that the createParser() code in the previous section is also defined in the XmlDomDocument.cpp file.

#include <xercesc/parsers/XercesDOMParser.hpp>
#include <xercesc/dom/DOM.hpp>
#include <xercesc/sax/HandlerBase.hpp>
#include <xercesc/util/XMLString.hpp>
#include <xercesc/util/PlatformUtils.hpp>
#include <string>

using namespace std;
using namespace xercesc;

class XmlDomDocument
{
    DOMDocument* m_doc;

  public:
    XmlDomDocument(const char* xmlfile);
    ~XmlDomDocument();

    string getChildValue(const char* parentTag, int parentIndex, 
                         const char* childTag);
    string getAttributeValue(const char* elementTag,  
                             int elementIndex, 
                             const char* attributeTag);
    int getChildCount(const char* parentTag, int parentIndex, 
                      const char* childTag);

  private:
    XmlDomDocument();
    XmlDomDocument(const XmlDOMDocument&); 
};

Constructor

The constructor calls createParser(), which is defined in the XmlDomDocument.cpp file and visable outside this file, to initialize the Xerces platform then XercesDOMParser::parse() to parse the given XML and produce a DOMDocument object the pointer which is stored in the m_doc member variable. The XmlDomDocument default and copy constructors are declared private since we only want this object created one way, with a constructor that accepts an XmlDOMDocument pointer and the name of the XML file to be parsed.

XmlDomDocument::XmlDomDocument(const char* xmlfile) : m_doc(NULL)
{
    createParser();
    m_doc = parser->parse(xmlfile);
}

Destructor

Since the DOMDocument is “adopted” by the XmlDOMDocument, we must release the memory consumed by the document when XmlDOMDocument is destroyed.

XmlDomDocument::~XmlDomDocument()
{
    if (m_doc) m_doc->release();
}

Get Child Element Value

The XmlDomDocument::getChildValue() takes the name of a parent tag and the index of the parent tag in the XML file. For example, if I want to get the price of the Harry Potter book from the example XML file, the parent tag is “book”, the parent index would be “1″ – like with C/C++ indexing starts from 0 – and the child tag is “price”.

string XmlDomDocument::getChildValue(const char* parentTag, 
                                     int parentIndex, 
                                     const char* childTag)
{
    XMLCh* temp = XMLString::transcode(parentTag);
    DOMNodeList* list = m_doc->getElementsByTagName(temp);
    XMLString::release(&temp);

    DOMElement* parent = 
        dynamic_cast<DOMElement*>(list->item(parentIndex));
    DOMElement* child = 
        dynamic_cast<DOMElement*>(parent->getElementsByTagName(
                          XMLString::transcode(childTag))->item(0));
    string value;
    if (child) {
        char* temp2 = XMLString::transcode(child->getTextContent());
        value = temp2;
        XMLString::release(&temp2);
    }
    else {
        value = "";
    }
    return value;
}

[Lines 5-7] Instead of strings Xerces uses its own XMLString objects, so whenever we want to exchange strings with the platform we must convert from C++ strings to XMLStrings with a call to XMLString::transcode() which returns an XMLCh pointer when passed a pointer to a character string. The XMLCh pointer is then used in the call to DOMDocument::getElementByTagName() which returns a pointer to a DOMNodeList object. After we are done with the XMLString object we must release its memory back to the heap with a call to XMLString::release(). This a very common Xerces string usage pattern.

[Lines 9-13] In the Xerces DOM model an XML file is a collection of DOMNodeList objects each with a single root element that has 0 or more parent elements, retrievable by index, and each parent has 0 or more children, retrievable by child name. Getting back to our Harry Potter book example, the root element is “bookstore”, we want the second “book” parent referenced by index “1″ and we want the child referenced by name “price”. DOMNodeList::item() returns a pointer to a the parent list object at the given index, which is cast to a DOMElement pointer. Similarly a pointer to the child element object for this parent is returned with a call to DOMElement::getElementsByTagName() the pointer to which is cast to a DOMElement pointer. The child element we want is always the first item in the child element list.

[Lines 14-23] If we get a non-NIULL child element, its value can be obtained from a call to DOMElement::getTextContent() which returns a ponter to an XMLString then copied to a string object and returned to the caller. Otherwise the string with a NULL value is returned.

Get Element Attribute Value

Retrieving XML element attribute values is very similar to retrieving child element values. The element tag and index – analogous to the parent tag and index – and the attribute tag are specified. For example if we wanted the book category for the Harry Potter book, the element tag is “book”, the element index is “1″ and the attribute tag is “category”.

string XmlDomDocument::getAttributeValue(const char* elementTag, 
                                         int elementIndex, 
                                         const char* attributeTag)
{
    XMLCh* temp = XMLString::transcode(elementTag);
    DOMNodeList* list = m_doc->getElementsByTagName(temp);
    XMLString::release(&temp);

    DOMElement* element = 
        dynamic_cast<DOMElement*>(list->item(elementIndex));
    temp = XMLString::transcode(attributeTag);
    char* temp2 = XMLString::transcode(element->getAttribute(temp));

    string value = temp2;
    XMLString::release(&temp);
    XMLString::release(&temp2);
    return value;
}

[Lines 5-12] Retrieve the attribute value in a manner simliar to the child value, except we get the value directly from the element itself.

[Lines 14-17] Convert the XMLString object containing the attribute value to a standard string then return to the caller.

Get Child Count

To get the number of elements contained under a given parent, we call DOMDocumentElement::getElementsByName() with the parent name, which returns a list of parent elements. We get parent element at parentIndex then call DOMDocumentElement::getElementsByName(), this time with the childTag. As before this gives us a pointer to a DOMNodeList from which we can get the child count directly with a call to DOMNodeList::getLength().

int XmlDomDocument::getChildCount(const char* parentTag, int parentIndex, 
                                  const char* childTag)
{
    XMLCh* temp = XMLString::transcode(parentTag);
    DOMNodeList* list = m_doc->getElementsByTagName(temp);
    XMLString::release(&temp);

    DOMElement* parent = dynamic_cast<DOMElement*>(list->item(parentIndex));
    DOMNodeList* childList = parent->getElementsByTagName(XMLString::transcode(childTag));
    return (int)childList->getLength();
}

Test Application

Code

The test application is defined in the main.cpp file. It uses the XML file then gets all the books their attribute and child values then prints the values to stdout.

#include <stdio.h>
#include <stdlib.h>
#include <string>
#include <iostream>
#include "XmlDomDocument.h"

int main(int argc, char** argv)
{
    string value;
    XmlDomDocument* doc = new XmlDomDocument("./bookstore.xml");
    if (doc) {
        for (int i = 0; i < doc->getChildCount("bookstore", 0, "book"); i++) {
            printf("Book %d\n", i+1);
            value = doc->getAttributeValue("book", i, "category");
            printf("book category - %s\n", value.c_str());
            value = doc->getChildValue("book", i, "title");
            printf("book title - %s\n", value.c_str());
            value = doc->getChildValue("book", i, "author");
            printf("book author - %s\n", value.c_str());
            value = doc->getChildValue("book", i, "year");
            printf("book year - %s\n", value.c_str());
            value = doc->getChildValue("book", i, "price");
            printf("book price - %s\n", value.c_str());
        }
        delete doc;
    }
    exit(0);
}

Build and Run

You can get the source code for the project from Github - https://github.com/vichargrave/xmldom.git. To build it just cd into the project directory and type make.

After building the test app run it as follows:

$ ./xmldom 
Book 1
book category - cooking
book title    - Everyday Italian
book author   - Giada De Laurentis
book year     - 2005
book price    - 30.00
Book 2
book category - children
book title    - Harry Potter and the Half-Blood Prince
book author   - J. K. Rowling
book year     - 2005
book price    - 29.99

Author:

FacebookTwitterGoogle+LinkedInBufferPrintFriendlyEmailShare
This entry was posted in Programming and tagged , , , , , . Bookmark the permalink.

23 Responses to XML Parsing with DOM in C++

  1. Pingback: XML Parsing With DOM in Java | VicHargrave.com

  2. mike says:

    Great roundup on xml parsing, i was wondering whether you have used an xml editor at all for parsing or what you think of them as a parsing tool?

    • vic says:

      They are fine for visualizing an entire file, but I don’t know of any that you can embed in your programs, like sed, awk or xerces, to parse files.

  3. Michael Knafo says:

    Thanks, it helped me a lot, your example is simple and clear.

  4. Boris Nasir says:

    Thank you for your information this is a very good example for beginners but when I try to execute your example I get

    make: *** No targets specified and no makefile found. Stop.

    What is the problem ?

    • Boris Nasir says:

      Sorry, I named the Makefile as MakeFile, this correction solved the problem.

      However this time I get a lot of undefined reference error and exit with
      make: *** [xmldom] Error 1

      full form of error :

      g++ -lxerces-c main.o xmldom.o -o xmldom
      main.o: In function `xercesc_3_1::XMLAttDefList::~XMLAttDefList()’:
      main.cpp:(.text._ZN11xercesc_3_113XMLAttDefListD2Ev[_ZN11xercesc_3_113XMLAttDefListD5Ev]+0×37): undefined reference to `xercesc_3_1::XMemory::operator delete(void*)’
      main.o: In function `xercesc_3_1::XMLAttDefList::~XMLAttDefList()’:
      main.cpp:(.text._ZN11xercesc_3_113XMLAttDefListD0Ev[_ZN11xercesc_3_113XMLAttDefListD5Ev]+0×20): undefined reference to `xercesc_3_1::XMemory::operator delete(void*)’
      main.o: In function `xercesc_3_1::DTDEntityDecl::~DTDEntityDecl()’:

      collect2: ld returned 1 exit status
      make: *** [xmldom] Error 1

      • vic says:

        It appears you don’t have xerces installed. Download and insta it then you should get better results.

        • Boris Nasir says:

          Thank you for your answer, but the problem was your makefile :) I tried it with the makefile below and it worked.

          • vic says:

            OK glad you got it to work. However the Makefile I provided works fine on Linux and Mac OS. There must have been a problem with your installation.

  5. Chris says:

    Thanks a lot for this very good tutorial. Unfortunatelly, the makefile did not work for me. Instead I used the following makefile:

    CPPFLAGS=-g -ggdb3
    LDFLAGS=-g -ggdb3
    LDLIBS= -lxerces-c

    SRCS=$(wildcard *.cpp)
    OBJS=$(subst .cpp,.o,$(SRCS))

    OUTFILE=xmldemo
    all: $(OUTFILE)

    $(OUTFILE): $(OBJS)
    g++ $(LDFLAGS) -o $(OUTFILE) $(OBJS) $(LDLIBS)

    depend: .depend

    .depend: $(SRCS)
    rm -f ./.depend
    $(CXX) $(CPPFLAGS) -MM $^>>./.depend;

    clean:
    $(RM) $(OBJS)

    dist-clean: clean
    $(RM) *~ .dependtool

    include .depend

    (taken from http://stackoverflow.com/questions/2481269/how-to-make-simple-c-makefile )

    PS.: The Captcha is almost impossible to get right and the comment is gone afterwards…

  6. parveen says:

    Really good work…thanks a lot.

  7. Elvis says:

    Hi
    I have problem with parsing a string instead of an xml file. I tried with MemBufInputSource but still XercesDOMParser can’t parse it. here is my source code:

    XercesDOMParser* parser = new XercesDOMParser();
    ErrorHandler* errHandler = (ErrorHandler*) new HandlerBase();
    parser->setErrorHandler(errHandler);

    MemBufInputSource source((const XMLByte*) clearInput.c_str(), clearInput.length(),”dummy”);
    parser->parse( source );

    • vic says:

      I’m not familiar with the MemBufInputSource class, but there is an example of how to use it in the xerces-c-3.1.1/samples/src/MemParse/MemParse.cpp file.

  8. Soumya Prasad Ukil says:

    Does it support xpath-based query?

  9. kp says:

    vichargrave ur code is working fine..but when i try to build in vc++ 2010..i am getting a error like “fatal error parsing line 0″…fatal error comes only when the xml file is corrupted but when i tried to parse with some example i am getting the same error..can u pls suggest?..thanks in advance

  10. milton ortiz says:

    great article, the better and clearer i’ve seen regarded to xerces, do you plan making a tutorial about parsing with sax? i need to read a xml of maybe 250 articles similar to your bookstore and the memory available is pretty limited, is sax the more convenient way to do this? how can i determine the memory amount, jus a vage idea is helpfull.
    congrats for your nice tutorial

  11. milton ortiz says:

    sorry if duplicated, i just don’t see my previous intend…
    i liked a lot you information, pretty usefull, is there any chance you could make a sax parsing tutorial? i am trying to read a xml, it is similar structure as your bookstore example, maybe 250 items and i wonder if sax is the recommended approach since i am pretty short on memory.
    there’s any way to have an idea on the memory to be used?
    thanks a lot in advance

    • vic says:

      Sorry I have to approve comments before they appear. I get some weird stuff sometimes that I have to filter.

      SAX parsing will work if your application is memory constrained. Note, however, it does not let you search for fields the way a DOM parsing scheme does. Here is a pretty good tutorial on SAX parsing: http://www.mkyong.com/java/how-to-read-xml-file-in-java-sax-parser/.

      • milton ortiz says:

        thanks a lot for your answer, i’ve seen that tutorial but i need this to be done in c++ because a library is written in c++, i’ll try dom method to see if it works in my project.
        another question and is the last one, how can i search for the parent by name and not by index in your example? let’s say i have 250 items and i want to find “alice in wonderland” and retrieve all it’s child values but i don’t now the index of that book?
        really appreciate your help

  12. Eugin says:

    Helpful article. Please, what do you know about standart methods in C++ for DOM in last Microsoft libraries?

  13. Grant says:

    Have you ever seen this? The default error handler is called if an XML size is over 700k bytes, on Solaris 10:

    =>[1] __lwp_kill(0×0, 0×6, 0×0, 0×6, 0xffbffeff, 0×0), at 0xff24ebd4
    [2] raise(0×6, 0×0, 0xff2c7080, 0xff22e0f0, 0xffffffff, 0×6), at 0xff1e7bb0
    [3] abort(0×21133238, 0×1, 0xff0f54b4, 0xffb04, 0xff2c5518, 0×0), at 0xff1c29f0
    [4] __Cimpl::default_terminate(0×21133238, 0xff2c7940, 0x1c00, 0x1793c, 0×0, 0xff0f5010), at 0xff0f5014
    [5] __Cimpl::ex_terminate(0xff10d618, 0×0, 0×0, 0xff10d618, 0xff10cd10, 0×1), at 0xff0f4e24
    —- hidden frames, use ‘where -h’ to see them all —-
    [8] xercesc_3_1::AbstractDOMParser::parse(0x18dc010, 0x9e27b950, 0xb4db9000, 0xb4621a4c, 0xb4642c00, 0×0), at 0xb445a878

    Any idea how to handle this.

    Appreciate your help!

    • vic says:

      I have not seen that, but then I have not been working with XML documents that big. It may be that you are running into the limits of what DOM can handle. You may want to consider using SAX parsing which does not load the entire parsed document into memory.

Leave a Reply

Your email address will not be published. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>