Having the ability to parse XML files is a requirement for a lot of applications these days. XML is a standard format for exchanging data between programs and storing configuration data.
If you want to parse XML documents in C++ you can benefit from using an external library like the Xerces-C++ XML Parser. Xerces provides an elaborate, but somewhat complex API for navigating XML files. To simplify matters, I’ll describe two C++ classes that encapsulate the Xerces calls to index and retrieve XML element values and attributes.
Contents
XML Parsing Models
XML Elements
XML documents consistent of elements that are denoted by beginning and ending tags. XML elements are of the general form:
where value consists of either a string value or additional XML elements. An attribute is a value associated with the given element.
<element attribute>
<element>value</element>
</element>
Here is an example of an XML document that is intended to represent two books contained in a bookstore. The bookstore element contains two book elements each with a category attribute. Each book element contains fields to that describe the book.
<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="children">
<title lang="en">Harry Potter and the Half-Blood Prince</title>
<author>J. K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
</bookstore>
The bookstore in this case is analogous to a database table with two book rows and the title, author, year and price fields are the colums of the rows.
SAX Model
XML files can be parsed using two different XML models, SAX and DOM (Document Object Model). Parsing with SAX utilizes mechanisms where the XML document is traversed and as XML elements are visited the contents are passed back to the calling application. When the beginning and ending elements of a section, e.g. the book sections in the example XML, are encountered, the caller is notified so it can keep track of each section and so it knows that other elements will follow.
Since SAX parsing visits each element one at time, it is fast and does not make heavy demands on memory. It is also possible to process XML documents of arbitrary sizes. However, SAX requires the calling application to do all the heavy lifting when it comes to storing the XML field values.
DOM Model
With DOM parsing the entire XML document is read into memory and organized in the form of a tree as shown in the following diagram.
Example XML Document Diagram
Source: XML DOM Node Tree by w3schools.com
The root element is the bookstore and child elements are book. The bookstore is the parent element of the book elements. Each book element is the parent of four child elements, title, author, year and price.
When using DOM it is possible to index through each parent and child element, so the calling application does not have to maintain the document structure as it does with SAX.
The downside of using DOM is that the size of document you can parse with it is limited by the amount of memory an application has to work with and parsing is less efficient.
Xerces Installation
Before diving into the XML DOM parsing API, let’s go over how to install Xerces. You can get the Xerces library in binary form for various platforms, but I was built my example on MacOS so I elected to build from source.
- Download Xerces 3.1.1 from the download site.
- Place the tarball in your home directory or wherever.
- tar zxvf xerces-c-3.1.1.tar.gz
- cd xerces-c-3.1.1/
- ./configure
- make
- sudo su
- make install
This will place the Xerces headers and library in /usr/local on your system.
XmlDOMParser Class
Interface
Our DOM parsing API consists of two classes.
- XmlDOMParser - encapsulates the Xerces API to parse XML with the DOM model.
- XmlDOMDocument - uses the XmlDOMParser to parse a given document and provides methods for retrieiving XML element values from this document.
For this discussion, both class interfaces will be included in a single file - xmldom.h - that start off with the required Xerces and Standard C++ headers. Both std and xerces namespaces are declared so method calls to either facility do not have to be explicitly qaulified with std:: or xerces::. Similarly the methods for both classes will be defined in a single file - xmldom.cpp.
#include <xercesc/parsers/XercesDOMParser.hpp> #include <xercesc/dom/DOM.hpp> #include <xercesc/sax/HandlerBase.hpp> #include <xercesc/util/XMLString.hpp> #include <xercesc/util/PlatformUtils.hpp> #include <string> using namespace std; using namespace xercesc;
The XmlDOMParser class contains pointers to the XercesDOMParser and ErrorHandler which are required to intialize the Xerces framework. Notice that the constructor is declared as private. As you’ll see in the next section, the constuctor has the job of initializing the Xerces platform which should only be done once, so the XmlDOMParser has to be declared as a Singleton class. Calling applications can only get the one and only XmlDOMParser object through a call to getInstance().
class XmlDOMParser
{
XercesDOMParser* m_parser;
ErrorHandler* m_errHandler;
public:
~XmlDOMParser();
XmlDOMParser* getInstance();
DOMDocument* parse(const char* xmlfile);
private:
XmlDOMParser();
};
The parse() method accepts the name of an XML file to parse and returns the XML element objects in an Xerces DOMDocument.
Construction
The Xerces platform should only be initialized once so the XmlDOMParser is constructed as a Singleton object. To do this the XmlDOMParser constructor is declared private and a static getInstance() method is provided to create and return a pointer to the parser.
The domParser variable is statically defined in the xmldom.cpp file. It will contain a pointer the one and only XmlDOMParser object which is created by the getInstance(). When getInstance() is called it will create the XmlDOMParser if the domParser is set to NULL or return a pointer to the XmlDOMParser if the instance has already been created.
#include <stdio.h>
#include "xmldom.h"
static XmlDOMParser* domParser;
XmlDOMParser* getInstance()
{
if (domParser == NULL) {
domParser = new XmlDOMParser();
}
return domParser;
}
The XmlDOMParser default constructor intializes the Xerces platform, creates XercesDOMParser and XmlDOMErrorHandler objects then sets the XercesDOMParser error handler to point to the XmlDOMErrorHandler the object.
XmlDOMParser::XmlDOMParser() : m_parser(NULL), m_errHandler(NULL)
{
XMLPlatformUtils::Initialize();
m_parser = new XercesDOMParser();
m_errHandler = (ErrorHandler*) new XmlDOMErrorHandler();
m_parser->setErrorHandler(m_errHandler);
}
Error Handling
The HandlerBase class is derived from ErrorHandler and supplies methods that are called when the XercesDOMParser encounters errors while parsing an XML file. As shown below XmlDOMErrorHandler is in turn subclassed from HandlerBase and overrides HandlerBase::fatalError(). The overrriden method displays a message on stdout indicating that an exception was encountered when parsing a given XML file and the line number of the end of the text where the exception occurred.
class XmlDOMErrorHandler : public HandlerBase
{
public:
void fatalError(const SAXParseException &exc) {
printf("Fatal parsing error at line %d\n",
(int)exc.getLineNumber());
exit(-1);
}
};
Destructor
The XmlDOMParser destructor deletes the Xerces parser if the Xerces platform has been initialized, i.e. when the m_parser member variable has a non-NULL value assigned to it. Then XMLPlatformUtils::Terminate() is called to shutdown the Xerces facility. Presumably this would happen only when the application exits. Finally the domParser pointer is set to NULL in the event that another XmlDOMParser is created before the application terminates.
XmlDOMParser::~XmlDOMParser()
{
if (m_parser) {
delete m_parser;
XMLPlatformUtils::Terminate();
domParser = NULL;
}
Parse an XML File
The XercesDOMParser::parse() is called to parse the given XML file. XercesDOMParser::adoptDocument() returns a pointer to a DOMDocument which is a Xerces native object that will be used to create the XmlDOMDocument object described in the next section. Adopting a document from the Xerces platform means the caller is responsible for releasing the document memory when the caller is done with it.
DOMDocument* XmlDOMParser::parse(const char* xmlfile)
{
m_parser->parse(xmlfile);
return m_parser->adoptDocument();
}
XmlDOMDocument Class
Interface
The XmlDOMDocument default and copy constructors are declared private since we only want this object created one way, with a constructor that accepts an XmlDOMDocument pointer and the name of the XML file to be parsed.
class XmlDOMDocument
{
DOMDocument* m_doc;
public:
XmlDOMDocument(XmlDOMParser* parser, const char* xmlfile);
~XmlDOMDocument();
string getChildValue(const char* parentTag, int parentIndex,
const char* childTag);
string getAttributeValue(const char* elementTag,
int elementIndex,
const char* attributeTag);
int getChildCount(const char* elementName);
private:
XmlDOMDocument();
XmlDOMDocument(const XmlDOMDocument&);
};
The three get methods wrap the complexities of the Xerces XML element search and value retrieval.
Constructor
The constructor is simple, it just calls the XmlDOMParser::parse() method to parse the given XML and produce a DOMdocument object the pointer which is stored in the m_doc member variable.
XmlDOMDocument::XmlDOMDocument(XmlDOMParser* parser,
const char* xmlfile) : m_doc(NULL)
{
m_doc = parser->parse(xmlfile);
}
Destructor
Since the DOMDocument is “adopted” by the XmlDOMDocument, we must release the memory consumed by the document when XmlDOMDocument is destroyed.
XmlDOMDocument::~XmlDOMDocument()
{
if (m_doc) m_doc->release();
}
Get Child Element Value
The XmlDOMDocument::getChildValue() takes the name of a parent tag and the index of the parent tag in the XML file. For example, if I want to get the price of the Harry Potter book from the example XML file, the parent tag is “book”, the parent index would be “1″ – like with C/C++ indexing starts from 0 – and the child tag is “price”.
string XmlDOMDocument::getChildValue(const char* parentTag,
int parentIndex,
const char* childTag)
{
XMLCh* temp = XMLString::transcode(parentTag);
DOMNodeList* list = m_doc->getElementsByTagName(temp);
XMLString::release(&temp);
DOMElement* parent =
dynamic_cast<DOMElement*>(list->item(parentIndex));
DOMElement* child =
dynamic_cast<DOMElement*>(parent->getElementsByTagName(
XMLString::transcode(childTag))->item(0));
string value;
if (child) {
char* temp2 = XMLString::transcode(child->getTextContent());
value = temp2;
XMLString::release(&temp2);
}
else {
value = "";
}
return value;
}
[Lines 5-7] Instead of strings Xerces uses its own XMLString objects, so whenever we want to exchange strings with the platform we must convert from C++ strings to XMLStrings with a call to XMLString::transcode() which returns an XMLCh pointer when passed a pointer to a character string. The XMLCh pointer is then used in the call to DOMDocument::getElementByTagName() which returns a pointer to a DOMNodeList object. After we are done with the XMLString object we must release its memory back to the heap with a call to XMLString::release(). This a very common Xerces string usage pattern.
[Lines 5-7] In the Xerces DOM model an XML file is a collection of DOMNodeList objects each with a single root element that has 0 or more parent elements, retrievable by index, and each parent has 0 or more children, retrievable by child name. Getting back to our Harry Potter book example, the root element is “bookstore”, we want the second “book” parent referenced by index “1″ and we want the child referenced by name “price”. DOMNodeList::item() returns a pointer to a the parent list object at the given index, which is cast to a DOMElement pointer. Similarly a pointer to the child element object for this parent is returned with a call to DOMElement::getElementsByTagName() the pointer to which is cast to a DOMElement pointer. The child element we want is always the first item in the child element list.
[Lines 14-23] If we get a non-NIULL child element, its value can be obtained from a call to DOMElement::getTextContent() which returns a ponter to an XMLString then copied to a string object and returned to the caller. Otherwise the string with a NULL value is returned.
Get Element Attribute Value
Retrieving XML element attribute values is very similar to retrieving child element values. The element tag and index – analogous to the parent tag and index – and the attribute tag are specified. For example if we wanted the book category for the Harry Potter book, the element tag is “book”, the element index is “1″ and the attribute tag is “category”.
string XmlDOMDocument::getAttributeValue(const char* elementTag,
int elementIndex,
const char* attributeTag)
{
XMLCh* temp = XMLString::transcode(elementTag);
DOMNodeList* list = m_doc->getElementsByTagName(temp);
XMLString::release(&temp);
DOMElement* element =
dynamic_cast<DOMElement*>(list->item(elementIndex));
temp = XMLString::transcode(attributeTag);
char* temp2 = XMLString::transcode(element->getAttribute(temp));
string value = temp2;
XMLString::release(&temp);
XMLString::release(&temp2);
return value;
}
[Lines 5-11] Retrieve the attribute value in a manner simliar to the child value, except we get the value directly from the element itself.
[Lines 13-17] Convert the XMLString object containing the attribute value to a standard string then return to the caller.
Get Element Count
To get the number of elements contained under an element of specified name, we just call DOMDocumentElement::getElementsByName() with the element name. As before this gives us a pointer to a DOMNodeList from which we can get the element count directly with a call to DOMNodeList::getLength().
int XmlDOMDocument::getElementCount(const char* elementTag)
{
DOMNodeList* list =
m_doc->getElementsByTagName(XMLString::transcode(elementName));
return (int)list->getLength();
}
Test Application
Code
The test application is defined in the main.cpp file. It uses the XML file then gets all the books their attribute and child values then prints the values to stdout.
#include <stdio.h>
#include <stdlib.h>
#include <string>
#include <iostream>
#include "xmldom.h"
int main(int argc, char** argv)
{
string value;
XmlDOMParser* parser = XmlDOMParser::getInstance();
if (parser) {
XmlDOMDocument* doc = new XmlDOMDocument(parser, "./bookstore.xml");
if (doc) {
for (int i = 0; i < doc->getElementCount("book"); i++) {
printf("Book %d\n", i+1);
value = doc->getAttributeValue("book", i, "category");
printf("book category - %s\n", value.c_str());
value = doc->getChildValue("book", i, "title");
printf("book title - %s\n", value.c_str());
value = doc->getChildValue("book", i, "author");
printf("book author - %s\n", value.c_str());
value = doc->getChildValue("book", i, "year");
printf("book year - %s\n", value.c_str());
value = doc->getChildValue("book", i, "price");
printf("book price - %s\n", value.c_str());
}
delete doc;
}
delete parser;
}
exit(0);
}
Build and Run
You can get the source code for the project from Github - https://github.com/vichargrave/xmldom.git. To build it just cd into the project directory and type make.
After building the test app run it as follows:
$ ./xmldom Book 1 book category - cooking book title - Everyday Italian book author - Giada De Laurentis book year - 2005 book price - 30.00 Book 2 book category - children book title - Harry Potter and the Half-Blood Prince book author - J. K. Rowling book year - 2005 book price - 29.99
Author: Vic Hargrave







10 Comments
Pingback: XML Parsing With DOM in Java | VicHargrave.com
mike on February 21, 2013 at 7:18 am.
Great roundup on xml parsing, i was wondering whether you have used an xml editor at all for parsing or what you think of them as a parsing tool?
vic on February 21, 2013 at 8:09 am.
They are fine for visualizing an entire file, but I don’t know of any that you can embed in your programs, like sed, awk or xerces, to parse files.
Michael Knafo on April 11, 2013 at 5:47 am.
Thanks, it helped me a lot, your example is simple and clear.
Boris Nasir on April 19, 2013 at 1:24 am.
Thank you for your information this is a very good example for beginners but when I try to execute your example I get
make: *** No targets specified and no makefile found. Stop.
What is the problem ?
Boris Nasir on April 19, 2013 at 3:03 am.
Sorry, I named the Makefile as MakeFile, this correction solved the problem.
However this time I get a lot of undefined reference error and exit with
make: *** [xmldom] Error 1
full form of error :
g++ -lxerces-c main.o xmldom.o -o xmldom
main.o: In function `xercesc_3_1::XMLAttDefList::~XMLAttDefList()’:
main.cpp:(.text._ZN11xercesc_3_113XMLAttDefListD2Ev[_ZN11xercesc_3_113XMLAttDefListD5Ev]+0×37): undefined reference to `xercesc_3_1::XMemory::operator delete(void*)’
main.o: In function `xercesc_3_1::XMLAttDefList::~XMLAttDefList()’:
main.cpp:(.text._ZN11xercesc_3_113XMLAttDefListD0Ev[_ZN11xercesc_3_113XMLAttDefListD5Ev]+0×20): undefined reference to `xercesc_3_1::XMemory::operator delete(void*)’
main.o: In function `xercesc_3_1::DTDEntityDecl::~DTDEntityDecl()’:
…
collect2: ld returned 1 exit status
make: *** [xmldom] Error 1
vic on April 19, 2013 at 8:38 am.
It appears you don’t have xerces installed. Download and insta it then you should get better results.
Boris Nasir on April 23, 2013 at 11:54 pm.
Thank you for your answer, but the problem was your makefile
I tried it with the makefile below and it worked.
vic on April 24, 2013 at 8:12 am.
OK glad you got it to work. However the Makefile I provided works fine on Linux and Mac OS. There must have been a problem with your installation.
Chris on April 20, 2013 at 5:46 pm.
Thanks a lot for this very good tutorial. Unfortunatelly, the makefile did not work for me. Instead I used the following makefile:
CPPFLAGS=-g -ggdb3
LDFLAGS=-g -ggdb3
LDLIBS= -lxerces-c
SRCS=$(wildcard *.cpp)
OBJS=$(subst .cpp,.o,$(SRCS))
OUTFILE=xmldemo
all: $(OUTFILE)
$(OUTFILE): $(OBJS)
g++ $(LDFLAGS) -o $(OUTFILE) $(OBJS) $(LDLIBS)
depend: .depend
.depend: $(SRCS)
rm -f ./.depend
$(CXX) $(CPPFLAGS) -MM $^>>./.depend;
clean:
$(RM) $(OBJS)
dist-clean: clean
$(RM) *~ .dependtool
include .depend
(taken from http://stackoverflow.com/questions/2481269/how-to-make-simple-c-makefile )
PS.: The Captcha is almost impossible to get right and the comment is gone afterwards…