http://www.xml.com/lpt/a/2001/11/14/xml-libxml.html
By Kip Hampton
Introduction
The vast majority of Perl's XML modules are built on top of XML::Parser,
Larry Wall and Clark Cooper's Perl interface to James Clark's expat
parser. The expat-XML::Parser combination is not the only full-featured XML
parser available in the Perl World. This month we'll look at XML::LibXML,
Matt Sergeant and Christian Glahn's Perl interface to Daniel Velliard's
libxml2.
Why Would You Want Yet Another XML Parser?
Expat and XML::Parser have proven themselves to be quite capable, but they
are not without limitations. Expat was among the first XML parsers available
and, as a result, its interfaces reflect the expectations of users at the
time it was written. Expat and XML::Parser do not implement the Document
Object Model, SAX, or XPath language interfaces (things that most modern XML
users take for granted) because either the given interface did not exist or
was still being heavily evaluated and not considered "standard" at the time
it was written.
The somewhat unfortunate result of this is that most of the available Perl
XML modules are built upon one of XML::Parser's non- or not-quite-standard
interfaces with the presumption that the input will be some sort of textual
representation of an XML document (file, filehandle, string, socket stream)
that must be parsed before proceeding. While this works for many simple
cases, most advanced XML applications need to do more than one thing with a
given document and that means that for each stage in the process, the
document must be serialized to a string and then re-parsed by the next
module.
By contrast libxml2 was written after the DOM, XPath, and SAX interfaces
became common, and so it implements all three. In-memory trees can be built
by parsing documents stored in files, strings, and so on, or generated from
a series of SAX events. Those trees can then be operated on using the W3C
DOM and XPath interfaces or used to generate SAX events that are handed off
to external event handlers. This added flexibility, which reflects current
XML processing expectations, makes XML::LibXML a strong contender for
XML::Parser's throne.
Using XML::LibXML
This month's column may be seen as a addendum to the Perl/XML Quickstart
Guide published earlier this year, when XML::LibXML was in its infancy, and
we'll use the same tests from the Quickstart to put XML::LibXML though its
paces. For a detailed overview of the test cases see the first installment
in the Quickstart; but, to summarize, the two tests illustrate how to
extract and print data from an XML document, and how to build and print,
programmatically, an XML document from data stored in a Perl HASH using the
facilities offered by a given XML module.
Reading
For accessing the data stored in XML documents, XML::LibXML provides a
standard W3C DOM interface. Documents are treated as a tree of nodes and the
data those nodes contain are accessed by calling methods on the node objects
themselves.
use strict;
use XML::LibXML;
my $file = 'files/camelids.xml';
my $parser = XML::LibXML->new();
my $tree = $parser->parse_file($file);
my $root = $tree->getDocumentElement;
my @species = $root->getElementsByTagName('species');
foreach my $camelid (@species) {
my $latin_name = $camelid->getAttribute('name');
my @name_node = $camelid->getElementsByTagName('common-name');
my $common_name = $name_node[0]->getFirstChild->getData;
my @c_node = $camelid->getElementsByTagName('conservation');
my $status = $c_node[0]->getAttribute('status');
print "$common_name ($latin_name) $status \n";
}
One of the more exciting features of XML::LibXML is that, in addition to the
DOM interface, it allows you to select nodes using the XPath language. The
following illustrates how to achieve the same effect as the previous example
using XPath to select the desired nodes:
use strict;
use XML::LibXML;
my $file = 'files/camelids.xml';
my $parser = XML::LibXML->new();
my $tree = $parser->parse_file($file);
my $root = $tree->getDocumentElement;
foreach my $camelid ($root->findnodes('species')) {
my $latin_name = $camelid->findvalue('@name');
my $common_name = $camelid->findvalue('common-name');
my $status = $camelid->findvalue('conservation/@status');
print "$common_name ($latin_name) $status \n";
}
What makes this exciting is that you can you can mix and match methods from
the DOM and XPath interfaces to best suit the needs of your application,
while operating on the same tree of nodes.
Writing
To create an XML document programmatically with XML::LibXML you simply use
the provided DOM interface:
use strict;
use XML::LibXML;
my $doc = XML::LibXML::Document->new();
my $root = $doc->createElement('html');
$doc->setDocumentElement($root);
my $body = $doc->createElement('body');
$root->appendChild($body);
foreach my $item (keys (%camelid_links)) {
my $link = $doc->createElement('a');
$link->setAttribute('href', $camelid_links{$item}->{url});
my $text = XML::LibXML::Text->new($camelid_links{$item}->{description});
$link->appendChild($text);
$body->appendChild($link);
}
print $doc->toString;
An important difference between XML::LibXML and XML::DOM is that libxml2's
object model conforms to the W3C DOM Level 2 interface, which is better able
to cope with documents containing XML Namespaces. So, where XML::DOM is
limited to:
@nodeset = getElementsByTagName($element_name);
and
$node = $doc->createElement($element_name);
XML::LibXML also provides:
@nodeset = getElementsByTagNameNS($namespace_uri, $element_name);
and
$node = $doc->createElementNS($namespace_uri, $element_name);
The Joy of SAX
We've seen the DOM and XPath goodness that XML::LibXML provides, but the
story does not end there. The libxml2 library also offers a SAX interface
that can be used to create DOM trees from SAX events or generate SAX events
from DOM trees.
The following creates a DOM tree programmatically from a SAX driver built on
XML::SAX::Base. In this example, the initial SAX events are generated from a
custom driver implemented in the CamelDriver class that calls the handler
events in the XML::LibXML::SAX::Builder class to build the DOM tree.
use XML::LibXML;
use XML::LibXML::SAX::Builder;
my $builder = XML::LibXML::SAX::Builder->new();
my $driver = CamelDriver->new(Handler => $builder);
my $doc = $driver->parse(%camelid_links);
# doc is an XML::LibXML::Document object
print $doc->toString;
package CamelDriver;
use base qw(XML::SAX::Base);
sub parse {
my $self = shift;
my %links = @_;
$self->SUPER::start_document;
$self->SUPER::start_element({Name => 'html'});
$self->SUPER::start_element({Name => 'body'});
foreach my $item (keys (%camelid_links)) {
$self->SUPER::start_element({Name => 'a',
Attributes => {
'href' => $links{$item}->{url}
}
});
$self->SUPER::characters({Data => $links{$item}->{description}});
$self->SUPER::end_element({Name => 'a'});
}
$self->SUPER::end_element({Name => 'body'});
$self->SUPER::end_element({Name => 'html'});
$self->SUPER::end_document;
}
1;
You can also generate SAX events from an existing DOM tree using
XML::LibXML::SAX::Generator. In the following snippet, the DOM tree created
by parsing the file camelids.xml is handed to XML::LibXML::SAX::Generator's
generate() method which in turn calls the event handlers in
XML::Handler::XMLWriter to print the document to STDOUT.
use strict;
use XML::LibXML;
use XML::LibXML::SAX::Generator;
use XML::Handler::XMLWriter;
my $file = 'files/camelids.xml';
my $parser = XML::LibXML->new();
my $doc = $parser->parse_file($file);
my $handler = XML::Handler::XMLWriter->new();
my $driver = XML::LibXML::SAX::Generator->new(Handler => $handler);
# generate SAX events that are captured
# by a SAX Handler or Filter.
$driver->generate($doc);
This ability to accept and emit SAX events is especially useful in light of
the recent discussion in this column of generating SAX events from non-XML
data and writing SAX filter chains. You could, for example, use a SAX driver
written in Perl to emit events based on data returned from a database query
that creates a DOM object, which is then transformed in C-space for display
using XSLT and the mind-numbingly fast libxslt library (which expects
libxml2 DOM objects), and then emit SAX events from that transformed DOM
tree for further processing using custom SAX filters to provide the
finishing touches — all without once having had to serialize the document
to a string for re-parsing. Wow.
Conclusions
As we have seen, XML::LibXML offers a fast, updated approach to XML
processing that may be superior to the first-generation XML::Parser for many
cases. Do not misunderstand, XML::Parser and its dependents are still quite
useful, well-supported, and are not likely to go away any time soon. But it
is not the only game in town, and given the added flexibility that
XML::LibXML provides, I would strongly encourage you to give XML::LibXML a
closer look before beginning your next Perl/XML project.
Resources