General

Disclaimer

Here you will find the notes that I took while I was learning Perl and XML, i.e., useful articles that I archived when searching the Internet for toturials. All articles have links to the original sites and places. The copyright belongs to the respective authors and sites.

By putting and arranging them into a central location, it allows me to go through them in logical order. Hope you find it helps you as well. Moreover, I hate giving merely a html link to those articles, because the urls might be changed due to site redesign or anything, and worse, the valuable article might vanish from the Internet any time. Still, if the copyright is a concern to you, please click and browse the original articles. Further, not all hyper links are preserved, so it might be a good idea to see the original articles as well.

To the authors and site owners, if you think my notes violate your right, please drop me a note, and I'll be happy to remove your article.

documented on: 2006.10.09

Perl-XML Frequently Asked Questions

http://perl-xml.sourceforge.net/faq/

Grant McLean Last updated: December 12, 2005

Tutorial and Reference Sources

1.1. Where can I get a gentle introduction to XML and Perl?
1.2. Where can I find an XML tutorial?
1.3. Where can I find reference documentation for the various XML Modules?

Selecting a Parser Module

2.1. Don't select a parser module.
2.2. Tree versus stream parsers
2.3. Pros and cons of the tree style
2.4. Pros and cons of the stream style
2.5. How to choose a parser module
2.6. Rolling your own parser

CPAN Modules

3.1. XML::Parser
3.2. XML::LibXML
3.3. XML::XPath
3.4. XML::DOM
3.5. XML::Simple
3.6. XML::Twig
3.7. Win32::OLE and MSXML.DLL
3.8. XML::PYX
3.9. XML::SAX
3.10. XML::SAX::Expat
3.11. XML::SAX::Machines
3.12. XML::XPathScript
3.13. How can I install XML::Parser under Windows?
3.14. How can I install other binary modules under Windows?
3.15. What if a module is not available in PPM format?
3.16. "could not find ParserDetails.ini"

XSLT Support

4.1. XML::LibXSLT
4.2. XML::Sablotron
4.3. XML::XSLT
4.4. XML::Filter::XSLT
4.5. AxKit

Encodings

5.1. Why do we need encodings?
5.2. What is UTF-8?
5.3. What can I do with a UTF-8 string?
5.4. What can Perl do with a UTF-8 string?
5.5. What can Perl 5.8 do with a UTF-8 string?
5.6. How can I convert from UTF-8 to another encoding?
5.7. What does 'use utf8;' do?
5.8. What are some commonly encountered problems with encodings?

Validation

6.1. DTD Validation Using XML::Checker
6.2. DTD Validation Using XML::LibXML
6.3. W3C Schema Validation With XML::LibXML
6.4. W3C Schema Validation With XML::Xerces
6.5. W3C Schema Validation With XML::Validator::Schema
6.6. Simple XML Validation with Perl
6.7. XML::Schematron

Common Coding Problems

7.1. How should I handle errors?
7.2. Why is my character data split into multiple events?
7.3. How can I split a huge XML file into smaller chunks

Common XML Problems

8.1. 'xml processing instruction not at start of external entity'
8.2. 'junk after document element'
8.3. 'not well-formed (invalid token)'
8.4. 'undefined entity'
8.5. 'reference to invalid character number'
8.6. Embedding Arbitrary Text in XML

Miscellaneous

9.1. Is there a mailing list for Perl and XML?
9.2. How do I unsubscribe from the perl-xml mailing list?
9.3. What happened to Enno?

By far the best Perl-XML entry level document.

http://perl-xml.sourceforge.net/faq/

documented on: 2006.10.09

XML related perl modules default installed in RH8

Foomatic::DB_perl_xml
Pod::Perldoc::ToXml
XML::Checker
XML::Checker::DOM
XML::Checker::Parser
XML::DOM
XML::DOM::DOMException
XML::DOM::NamedNodeMap
XML::DOM::NodeList
XML::DOM::PerlSAX
XML::DOM::ValParser
XML::Dumper
XML::ESISParser
XML::Encoding
XML::Filter::DetectWS
XML::Filter::Reindent
XML::Filter::SAXT
XML::Grove
XML::Grove::AsCanonXML
XML::Grove::AsString
XML::Grove::Builder
XML::Grove::Factory
XML::Grove::IDs
XML::Grove::Path
XML::Grove::PerlSAX
XML::Grove::Sub
XML::Grove::Subst
XML::Grove::XPointer
XML::Handler::BuildDOM
XML::Handler::CanonXMLWriter
XML::Handler::Composer
XML::Handler::PrintEvents
XML::Handler::Sample
XML::Handler::Subs
XML::Handler::XMLWriter
XML::Parser
XML::Parser::Expat
XML::Parser::PerlSAX
XML::PatAct::ActionTempl
XML::PatAct::Amsterdam
XML::PatAct::MatchName
XML::PatAct::PatternTempl
XML::PatAct::ToObjects
XML::Perl2SAX
XML::RegExp
XML::SAX2Perl
XML::Twig
XML::UM
XML::XQL
XML::XQL::DOM
XML::XQL::Date
XML::XQL::Debug
XML::XQL::DirXQL
XML::XQL::Parser
XML::XQL::Plus
XML::XQL::Strict

Perl XML Quickstart: The Perl XML Interfaces

http://www.xml.com/pub/a/2001/04/18/perlxmlqstart1.html http://www.xml.com/lpt/a/2001/04/18/perlxmlqstart1.html

By Kip Hampton

Introduction

A recent flurry of questions to the Perl-XML mailing list points to the need for a document that gives new users a quick, how-to overview of the various Perl XML modules. For the next few months I will be devoting this column solely to that purpose.

The XML modules available from CPAN can be divided into three main categories: modules that provide unique interfaces to XML data (usually concerned with translating data between an XML instance and Perl data structures), modules that implement one of the standard XML APIs, and special-purpose modules that seek to simplify the execution of some specific XML-related task. This month we will be looking the first of these, the Perl-specific XML interfaces.

use Disclaimer qw(:standard);

This is not an exercise in comparative performance benchmarking, nor is it my intention to suggest that any one module is inherently more useful than another. Choosing the right XML module for your project depends largely upon the nature of the project and your past experience. Different interfaces lend themselves to different kinds of tasks and to different kinds of people. My only goal is to offer working examples of the various interfaces by defining two simple tasks, and then showing how to achieve the same net result using each of the selected modules.

The Tasks

While the uses for XML are rich and varied, most XML-related tasks can be divided into two groups: those related to extracting data from existing XML documents, and those related to creating a new XML documents using data from other sources. With this in mind, the examples that we will use for our module introductions will consist of extracting a specific set data from an XML file, and and marking up a Perl data structure in a specific XML format.

Task One: Extracting Information

First, consider the following XML fragment:

<?xml version="1.0"?>
<camelids>
  <species name="Camelus dromedarius">
    <common-name>Dromedary, or Arabian Camel</common-name>
    <physical-characteristics>
      <mass>300 to 690 kg.</mass>
      <appearance>
        The dromedary camel is characterized by a long-curved
        neck, deep-narrow chest, and a single hump.
        ...
      </appearance>
    </physical-characteristics>
    <natural-history>
       <food-habits>
         The dromedary camel is an herbivore.
         ...
       </food-habits>
       <reproduction>
         The dromedary camel has a lifespan of about 40-50 years
         ...
       </reproduction>
       <behavior>
         With the exception of rutting males, dromedaries show
         very little aggressive behavior.
         ...
       </behavior>
       <habitat>
         The camels prefer desert conditions characterized by a
         long dry season and a short rainy season.
         ...
       </habitat>
    </natural-history>
    <conservation status="no special status">
      <detail>
        Since the dromedary camel is domesticated, the camel has
        no special status in conservation.
      </detail>
    </conservation>
  </species>
  ...
</camelids>

Now let's say that the complete document (available with this month's sample code) contains the same information for all the members of Camelidae family, not just our friend the single-humped Dromedary Camel. To illustrate how each module might be used to extract a subset of the data stored in this document, we will write a tiny script that parses the camelids.xml document and, for each species found, prints a line to STDOUT containing that species' common name, Latin name (in parentheses), and conservation status. So, having processed the entire document, the output of each script should yield the following result:

Bactrian Camel (Camelus bactrianus) endangered
Dromedary, or Arabian Camel (Camelus dromedarius) no special status
Llama (Lama glama) no special status
Guanaco (Lama guanicoe) special concern
Vicuna (Vicugna vicugna) endangered

Task Two: Creating An XML Document

To demonstrate how each of the selected modules may be used to create XML documents from other data sources, we will write a small script that marks up a simple Perl hash containing URLs to a few cool camelid-related pages on the Web as a simple XHTML document.

Here's the hash:

my %camelid_links = (
    one   => { url         => '
    http://www.online.discovery.com/news/picture/may99/photo20.html',
               description => 'Bactrian Camel in front of Great ' .
                              'Pyramids in Giza, Egypt.'},
    two   => { url         => 'http://www.fotos-online.de/english/m/09/9532.htm',
               description => 'Dromedary Camel illustrates the ' .
                              'importance of accessorizing.'},
    three => { url         => 'http://www.eskimo.com/~wallama/funny.htm',
               description => 'Charlie - biography of a narcissistic llama.'},
    four  => { url         => 'http://arrow.colorado.edu/travels/other/turkey.html',
               description => 'A visual metaphor for the perl5-porters ' .
                              'list?'},
    five  => { url         => 'http://www.galaonline.org/pics.htm',
               description => 'Many cool alpacas.'},
    six   => { url         => 'http://www.thpf.de/suedamerikareise/galerie/vicunas.htm',
               description => 'Wild Vicunas in a scenic landscape.'}
);

And here is an example of the document that we hope to create from that hash:

<?xml version="1.0">
<html>
  <body>
    <a href="http://www.eskimo.com/~wallama/funny.htm">Charlie -
      biography of a narcissistic llama.</a>
    <a href="http://www.online.discovery.com/news/picture/may99/photo20.html">Bactrian
      Camel in front of Great Pyramids in Giza, Egypt.</a>
    <a href="http://www.fotos-online.de/english/m/09/9532.htm">Dromedary
      Camel illustrates the importance of accessorizing.</a>
    <a href="http://www.galaonline.org/pics.htm">Many cool alpacas.</a>
    <a href="http://arrow.colorado.edu/travels/other/turkey.html">A visual
      metaphor for the perl5-porters list?</a>
    <a href="http://www.thpf.de/suedamerikareise/galerie/vicunas.htm">Wild
      Vicunas in a scenic landscape.</a>
  </body>
</html>

It's important to note that while the resulting XML is indented for readability (as shown above), this sort of fine-grained whitespace handling is not part of our sample requirement. All we care about is that the resulting document is well-formed XML, and that it accurately reflects the data stored in our hash.

With our tasks defined, let's get straight to the code samples.

Samples of the Perl-specific XML Interfaces

XML::Simple

Originally created to simplify the task of reading and writing config files in an XML format, XML::Simple translates data between XML documents and native Perl data structures with no intervening abstract interface. Elements and attributes are accessed using nested references.

Reading

use XML::Simple;

my $file = 'files/camelids.xml';
my $xs1 = XML::Simple->new();

my $doc = $xs1->XMLin($file);

foreach my $key (keys (%{$doc->{species}})){
   print $doc->{species}->{$key}->{'common-name'} . ' (' . $key . ') ';
   print $doc->{species}->{$key}->{conservation}->final . "\n";
}

Writing

use XML::Simple;

require "files/camelid_links.pl";
my %camelid_links = get_camelid_data();

my $xsimple = XML::Simple->new();

print $xsimple->XMLout(\%camelid_links,
                       noattr => 1,
                       xmldecl => '<?xml version="1.0">');

the requirements of the data-to-document task reveals one of XML::Simple's few weaknesses: it doesn't allow us to decide which keys in our hash should be returned as elements and which should be returned as attributes. The output from the sample above would be close to the requirement, but it wouldn't be close enough. For those cases where we prefer to manipulate the contents of an XML document using native Perl data structures, but need finer control over the output, a combination of XML::Simple and XML::Writer works nicely.

The following illustrates how to use XML::Writer to meet the output requirement.

use XML::Writer;

require "files/camelid_links.pl";
my %camelid_links = get_camelid_data();

my $writer = XML::Writer->new();

$writer->xmlDecl();
$writer->startTag('html');
$writer->startTag('body');

foreach my $item ( keys (%camelid_links) ) {
    $writer->startTag('a', 'href' => $camelid_links{$item}->{url});
    $writer->characters($camelid_links{$item}->{description});
    $writer->endTag('a');
}

$writer->endTag('body');
$writer->endTag('html');

$writer->end();

XML::SimpleObject

XML::SimpleObject provides an object-oriented interface to XML data using accessor methods that are reminiscent of the Document Object Model.

Reading

use XML::Parser;
use XML::SimpleObject;

my $file = 'files/camelids.xml';

my $parser = XML::Parser->new(ErrorContext => 2, Style => "Tree");
my $xso = XML::SimpleObject->new( $parser->parsefile($file) );

foreach my $species ($xso->child('camelids')->children('species')) {
    print $species->child('common-name')->{VALUE};
    print ' (' . $species->attribute('name') . ') ';
    print $species->child('conservation')->attribute('status');
    print "\n";
}

Writing

XML::SimpleObject has no facility for creating new XML documents from scratch. It can, however, easily be used in conjunction with XML::Writer in the way illustrated in the XML::Simple example above.

XML::TreeBuilder

The XML::TreeBuilder distribution ships with two modules; XML::Element, for creating or accessing the contents of XML element nodes, and XML::TreeBuilder, a factory package that simplifies the building of document trees from existing XML files. Those who have had past experience with the venerable HTML::Element and HTML::Tree modules will find XML::TreeBuilder very easy to use, since the interfaces are identical apart from a few XML-specific methods.

Reading

use XML::TreeBuilder;

my $file = 'files/camelids.xml';
my $tree = XML::TreeBuilder->new();

$tree->parse_file($file);

foreach my $species ($tree->find_by_tag_name('species')){
    print $species->find_by_tag_name('common-name')->as_text;
    print ' (' . $species->attr_get_i('name') . ') ';
    print $species->find_by_tag_name('conservation')->attr_get_i('status');
    print "\n";
}

Writing

use XML::Element;

require "files/camelid_links.pl";
my %camelid_links = get_camelid_data();


my $root = XML::Element->new('html');
my $body = XML::Element->new('body');
my $xml_pi = XML::Element->new('~pi', text => 'xml version="1.0"');
$root->push_content($body);

foreach my $item ( keys (%camelid_links) ) {
    my $link = XML::Element->new('a', 'href' => $camelid_links{$item}->{url});
    $link->push_content($camelid_links{$item}->{description});
    $body->push_content($link);
}

print $xml_pi->as_XML;
print $root->as_XML();

XML::Twig

XML::Twig stands apart from the other Perl-only XML interfaces in that it combines an inventive Perlish interface with many of the features found in the standard XML APIs. For a more detailed introduction to XML::Twig see this XML.com article.

Reading

use XML::Twig;

my $file = 'files/camelids.xml';
my $twig = XML::Twig->new();

$twig->parsefile($file);

my $root = $twig->root;

foreach my $species ($root->children('species')){
    print $species->first_child_text('common-name');
    print ' (' . $species->att('name') . ') ';
    print $species->first_child('conservation')->att('status');
    print "\n";
}

Writing

use XML::Twig;

require "files/camelid_links.pl";
my %camelid_links = get_camelid_data();

my $root = XML::Twig::Elt->new('html');
my $body = XML::Twig::Elt->new('body');
$body->paste($root);

foreach my $item ( keys (%camelid_links) ) {
    my $link = XML::Twig::Elt->new('a');
    $link->set_att('href', $camelid_links{$item}->{url});
    $link->set_text($camelid_links{$item}->{description});
    $link->paste('last_child', $body);
}

print qq|<?xml version="1.0"?>|;
$root->print;

These examples have illustrated the basic usage for the more generic Perl XML modules. My goal has been to give just enough example code to give you a feel for what it is like to work with each of these modules. Next month we will look at those Perl modules that implement one of the standard XML interfaces; specifically, XML::DOM, XML::XPath, and the various SAX and SAX-like modules.

Resources

Download sample code.
A complete list of the XML modules available from CPAN
Perl-XML mailing list archives
Using XML::Twig

Perl XML Quickstart: The Standard XML Interfaces

http://www.xml.com/pub/a/2001/05/16/perlxml.html

by Kip Hampton May 16, 2001

Introduction

This is the second part in a series of articles meant to quickly introduce some of the more popular Perl XML modules. This month we look at the Perl implementations of the standard XML APIs: The Document Object Model, The XPath language, and the Simple API for XML.

As stated in part one, this series is not concerned with comparing the relative merits of the various XML modules. My only goal is to provide enough sample code to help you decide for yourself which module or approach is most appropriate for your situation by showing you how to achieve the same result with each module given two simple tasks. Those tasks are 1) extracting data from an XML document and 2) producing an XML document from a Perl hash. Please see last month's column for a complete description of the sample requirements. http://www.xml.com/pub/a/2001/04/18/perlxmlqstart1.html Local cache copy.

Samples of the Perl Implementations of the Standard XML Interfaces

The Document Object Model (XML::DOM)

The Document Object Model, or DOM for short, provides a language neutral interface to XML data by representing the document's contents as a hierarchical structure of objects whose properties describe the relationships between one object and another. The Perl implementation of the DOM is called, unsurprisingly, XML::DOM.

Reading

use XML::DOM;
use XML::DOM;

my $file = 'files/camelids.xml';
my $parser = XML::DOM::Parser->new();

my $doc = $parser->parsefile($file);

foreach my $species ($doc->getElementsByTagName('species')){
  print $species->getElementsByTagName('common-name')->item(0)
            ->getFirstChild->getNodeValue;
  print ' (' . $species->getAttribute('name') . ') ';
  print $species->getElementsByTagName('conservation')->item(0)

            ->getAttribute('status');
  print "\n";
}

Writing

use XML::DOM;

require "files/camelid_links.pl";
my %camelid_links = get_camelid_data();

my $doc = XML::DOM::Document->new;
my $xml_pi = $doc->createXMLDecl ('1.0');
my $root = $doc->createElement('html');
my $body = $doc->createElement('body');
$root->appendChild($body);

foreach my $item ( keys (%camelid_links) ) {
  my $link = $doc->createElement('a');
  $link->setAttribute('href', $camelid_links{$item}->{url});
  my $text = $doc->createTextNode($camelid_links{$item}->?description});
  $link->appendChild($text);
  $body->appendChild($link);
}

print $xml_pi->toString;
print $root->toString;

XPath (XML::XPath)

Originally developed to provide a node matching syntax for the eXtensible Stylesheet Language (XSLT) and, later, for XPointer projects, the XPath language provides an interface to an XML document's contents using a compact set of expressions and functions that, like the DOM, treats the data as a tree of nodes. XPath differs significantly from the DOM in that it allows developers fine-grained access to a document's contents based on both the structural relationships between nodes (paths) and the properties of those nodes (expression evaluation). For example, in XPath syntax you can say, "give me all the div elements that have a background attribute with the value of blue" by writing //div[@background="blue"].

Reading

use XML::XPath;

my $file = 'files/camelids.xml';
my $xp = XML::XPath->new(filename => $file);

foreach my $species ($xp->find('//species')->get_nodelist){
    print $species->find('common-name')->string_value;
    print ' (' . $species->find('@name') . ') ';
    print $species->find('conservation/@status');
    print "\n";
}

Writing

use XML::XPath;

require "files/camelid_links.pl";
my %camelid_links = get_camelid_data();

my $xp = XML::XPath->new();
my $xml_pi = XML::XPath::Node::PI->new('xml', 'version="1.0"');
my $root = XML::XPath::Node::Element->new('html');
my $body = XML::XPath::Node::Element->new('body');
$root->appendChild($body);

foreach my $item ( keys (%camelid_links) ) {
    my $link = XML::XPath::Node::Element->new('a');
    my $href = XML::XPath::Node::Attribute->new('href',
         $camelid_links{$item}->{url});
    $link->appendAttribute($href);
    my $text = XML::XPath::Node::Text->new(
         $camelid_links{$item}->{description});
    $link->appendChild($text);
    $body->appendChild($link);
}

print $xml_pi->toString;
print $root->toString

SAX 1 (XML::Parser::PerlSAX)

The SAX, or Simple API for XML, interface provides access to XML data using an event model in which the contents of an XML document are made available through callback subroutines, which it calls handlers. In contrast to the DOM and XPath APIs, the SAX interface does not build an internal representation of the entire XML document. Instead, data is passed to the handlers in response to the various events (the beginning of an element, the end of an element, etc.) that occur as the document is parsed. This makes SAX extremely fast and memory efficient, but it leaves the task defining node relationships entirely up to the developer.

Reading

use XML::Parser::PerlSAX;
my $file = "files/camelids.xml";

my $handler = CamelHandler->new();
my $parser = XML::Parser::PerlSAX->new(Handler => $handler);

$parser->parse(Source => { SystemId => $file});

package CamelHandler;

use strict;

sub new {
    my $type = shift;
    return bless {}, $type;
}

my $current_element = '';
my $latin_name = '';
my $common_name = '';

sub start_element {
    my ($self, $element) = @_;

    my %attrs = %{$element->{Attributes}};
    $current_element = $element->{Name};

    if ($current_element eq 'species') {
        $latin_name = $element->{Attributes}->{'name'};
    }
    elsif ($current_element eq 'conservation') {
        print $common_name .' (' . $latin_name .') '
        .  $element->{Attributes}->{'status'} . "\n";
    }
}

sub end_element {
    my ($self, $element) = @_;

    if ($element->{LocalName} eq 'species') {
        $common_name = undef;
        $latin_name  = undef;
    }
}

sub characters {
    my ($self, $characters) = @_;
    my $text = $characters->{Data};
    $text =~ s/^\s*//;
    $text =~ s/\s*$//;
    return '' unless $text;

    if ($current_element eq 'common-name') {
        $common_name = $text;
    }
}

1;

Writing

Unlike DOM and XPath, SAX offers no in-memory representation of an XML document and, consequently, has no API facilities for directly creating such a representation. However, there is theoretically no limit to the logic that can embedded in the various event handlers, so creating one or more XML documents based on the SAX events generated by another is quite common.

SAX 2 (Orchard::SAXDriver::Expat)

The most important difference between the SAX 1 and SAX2 APIs is SAX 2's support for XML namespaces. A complete SAX 2 implementation is available as part of Ken MacLeod's Orchard project. Since a sample for Orchard::SAXDriver::Expat would look largely the same as the previous, SAX 1 example, I omit it here. However, if you are curious, you can browse orchard_saxdriver_read.pl in this month's sample code.

Familiarity with the standard XML APIs, their strengths and weaknesses relative to a given task, is key to a mature understanding of XML technology. Much has been written about the interfaces covered here, and I strongly encourage you to follow the links in this month's "Resources" section for more information.

Up to this point each module we've looked at shares the common goal of providing a generic interface to the contents any well-formed XML document. Next month we will depart from this pattern a bit by exploring some of the modules that, while perhaps less generically useful, seek to simplify the execution of some specific XML-related task.

Resources

Download sample code. http://xml.com/2001/05/16/files/perlxmlkickstart2.zip

Perl XML Quickstart: Convenience Modules

http://www.xml.com/pub/a/2001/06/13/perlxml.html

by Kip Hampton June 13, 2001

This is the third and final part of a series of articles meant to give quick introductions to some of the more popular Perl XML modules. In the last two months we have looked at the modules that implement the standard XML APIs and those that provide more Perlish XML interfaces. This month we will be looking at some of the modules that seek to simplify a specific XML-related task.

Keeping It ( Real | Simple )

Getting started with XML processing in Perl can be a daunting task. A quick CPAN search reveals 77 XML-related distributions containing more than 200 modules. How do you know which one to choose? Selecting a module based on its ability to cover common use cases (usually a useful guide) seems a bit absurd in light of the fact that XML is being used successfully for everything from storing simple configuration data to enabling communications between complex AI systems. If you find yourself wondering where to begin, don't despair. Despite the apparent complexities, you do not need expert knowledge of Perl or XML to put their combined power to work for you.

check the rest at

http://www.xml.com/pub/a/2001/06/13/perlxml.html

All Kip Hampton's Articles

http://www.xml.com/pub/au/83

documented on: 2006.10.09