XML::Parser And Tutorials

mod:XML::Parser

NAME

XML::Parser - A multi-thread perl module for parsing XML documents

DESCRIPTION

This module provides ways to parse XML documents. It is built on top of XML::Parser::Expat, which is a lower level interface to James Clark's expat library.

Each call to one of the parsing methods creates a new instance of XML::Parser::Expat which is then used to parse the docu- ment. Expat options may be provided when the XML::Parser object is created. These options are then passed on to the Expat object on each parse call. They can also be given as extra arguments to the parse methods, in which case they override options given at XML::Parser cre- ation time.

The behavior of the parser is controlled either by ""Style"" and/or ""Handlers"" options, or by "setHandlers" method. These all provide mechanisms for XML::Parser to set the handlers needed by XML::Parser::Expat. If neither "Style" nor "Handlers" are specified, then parsing just checks the document for being well-formed.

When underlying handlers get called, they receive as their first parameter the Expat object, not the Parser object.

XML::Simple

Since you have to install XML::Parser anyway, you might as well use XML::Simple (a nice wrapper for XML::Parser) that will parse the whole thing and return back a hash ref of all the data. That would cut this document down to 1 page of formatting it html. This should have at least mentioned XML::Simple or it should have done the parsing manually (using regex's).

Using Perl with XML (part 1)

http://www.devshed.com/Server_Side/Perl/PerlXML/PerlXML1/

By icarus 2002-01-15

Here's a quick list of the types of events that the parser can handle, together with a list of their key names (as expected by the setHandlers() method) and a list of the arguments that the corresponding callback function will receive.

Key	Arguments	Event
	to callback
Final	parser handle	Document parsing completed

Start	parser handle,	Start tag found
	element name,
	attributes

End	parser handle,	End tag found
	element name

Char	parser handle,	CDATA found
	CDATA

Proc	parser handle,	PI found
	PI target,
	PI data

Comment	parser handle,	Comment found
	comment

Unparsed	parser handle, entity,	Unparsed entity found
	base, system ID, public
	ID, notation

Notation	parser handle, notation,	Notation found
	base, system ID, public
	ID

XMLDecl	parser handle,	XML declaration found
	version, encoding,
	standalone

ExternEnt	parser handle, base,	External entity found
	system ID, public ID

Default	parser handle, data	Default handler

Perl XML::Parser module - Mini HowTo

http://www.insite.com.br/~nferraz/projetos/xml-parser.html

Nelson Ferraz, nferraz at insite dot com dot br v0.3, Jun 14 2001

We're so used to manipulate text with regular expressions and other Perl features, that we are surprised to find out that parsing an XML file isn't a trivial task.

The What's Wrong with Perl and XML? article give us a clue, when it says that "regexp processing is not fully useful when dealing with the majority of XML processing tags".

The objective of this Mini HOWTO is to show the most important concepts to use the XML::Parser module, so you'll be able to read and parse XML documents with ease.

Introducing XML::Parser

The XML::Parser module is event-oriented, that means that the module will parse your XML file and, whenever it founds a new openning or closing tag, or any text between tags, an event will be triggered and a sub in your program will be called.

The most important thing that you must know is what events can be triggered by XML::Parser and its parameters, so you can use them.

Events

Here's a description of the most common events, its parameters and when they occur.

The first parameter is always an instance of Expat, a module used internally to parse the document.

It's safe to ignore the Expat parameter unless you need one of the features provided by it, as noted by Clark Cooper, who told me that "there are many things that become more difficult or impossible unless you use the methods attached to that object" - see "perldoc XML::Parser::Expat").

Handler (parameters) When it occurs Sample
Init (Expat) just before the parsing starts
Final (Expat) just before the parsing finishes
Start (Expat, Element [, Attr, Val [,...]]) when an XML start tag is found

<TAG attr1="val1" attr2="val2">

End (Expat, Element) when an XML end tag is found

</TAG>

Char (Expat, String) when non-markup is found
Comment (Expat, Data) when a comment is found

<-- some data here... -->

Default (Expat, String) if there isn't a corresponding handler registered

Another interesting note is that empty tags, such as <foo/>, will trigger both Start and End events.

How the events are handled?

Triggering an event means that a sub, in your program, will be called. In order to let XML::Parser call the correct subs when they are needed, you must set a few handlers, indicating which event will be handled by which sub.

The first step is to initialize XML::Parser:

#!/usr/bin/perl
use XML::Parser;
my $parser = new XML::Parser ();

Now we can set the handlers:

$parser->setHandlers (
          Start => \&Start_handler,
            End => \&End_handler,
        Default => \&Default_handler
);

Now we're going to read a filename from the command line:

my $filename = shift;
die "Can't find '$filename': $!\n" unless -f $filename;

And here is the line that will make everything work (BTW, make sure that you read the XML::Parser documentation, because there are many other ways of calling it!):

$parser->parsefile ($filename);

But wait!!! We can't forget to include the subs that will actually handle the events. We have defined their names, here they are:

### HANDLERS ###

sub Start_handler {
  my $p  = shift;
  my $el = shift;

  print "<$el>\n";
  while (my $key = shift) {
    my $val = shift;
    print "  $key = $val\n";
  }
  print "\n";
}

###

sub End_handler {
  my ($p,$el) = @_;
  print "</$el>\n";
}

###

sub Default_handler {
  my ($p,$str) = @_;
  print "  default handler found '$str'\n";
}

Conclusion

Althought Perl really shines when we have to parse free-form text files, the XML::Parser module may help us to work with XML files, which are much more structured, with ease.

Parsing XML documents with Perl

Shelley Doll | July 17, 2002

http://www.zdnet.com.au/builder/program/web/story/0,2000034810,20266751,00.htm

http://builder.com.com/5100-6371-1044612.html http://builder.com.com/5100-6371_14-1044612-2.html

Parsing XML documents with Perl

This article focuses on one of the earliest and most frequently referenced core modules, [47]XML::Parser.

When it comes to working with XML in Perl, you have almost five hundred [49]CPAN modules to choose from, each supporting various aspects of integrating Web services. In addition, the Perl core library includes several modules to support XML.

XML::Parser lineage

The original Perl XML parser, XML::Parser::Expat, was written several years ago by Larry Wall and has since been maintained by Clark Cooper. The module is an interface to the [50]Expat XML parser written in C by James Clark, which has been adopted by several scripting languages.

Expat is an event-based parser, meaning certain conditions trigger handling functions. For example, a start or end tag will trigger the appropriate user-defined subroutine. The XML::Parser module was built upon the Expat functionality for general use.

Note that Expat does not validate XML prior to parsing and will die when an error is encountered. But these limitations help make the XML::Parser module extremely fast.

XML::Parser in brief

Anybody can write an XML parser in Perl. After all, youre merely processing text that comes in an expected format. But since the XML::Parser module is written in C, it's much more efficient than any purely Perl implementation you could come up with. And it's already been written for you, so you can spend your time doing something more useful, as Larry Wall would put it.

XML::Parser's Expat functionality allows you to define the style of parse you want to use. The most commonly used styles are Tree and Stream. The Tree style processes your XML input and creates nested hashes and arrays that contain the elements and data from your file. You can then manipulate this structure as youd like. The Stream style breaks the parse into stages, processed at the start of an event. To use the Stream style parse, you must define handlers when you instantiate the module and associate them with user-defined subroutines that describe what is to be done when the event is encountered.

Other types of styles include Subs, which allows you to define functions specific to a type of XML tag, Debug, which displays the document to standard output, and Objects, which is similar to the Tree style but returns objects. You can also set a custom style by defining a subclass to the XML::Parser class.

A Streamlined example

For this example, I'll be using the XML::Parser class to create a Stream style parse. I'll walk through a simple script that will parse an XML file to standard output. You can see the script (xmlparse.pl) in [51]Listing A, and the XML file (data.xml) in [52]Listing B. In this case, I chose not to parse the URL element since this is a command-line script. To execute the script, at the command prompt, type:

perl xmlparse.pl data.xml

The script first references the appropriate module:

use XML::Parser;
Next, it grabs the file from the command-prompt input:
my $xmlfile = shift;
die "Cannot find file \"$xmlfile\""
       unless -f $xmlfile;

The script sets some initial variables:

$count = 0;
$tag = "";

Then, it creates our parser instance:

my $parser = new XML::Parser;

Now, we define our event handlers. I included handlers for start tags, end tags, and character data. Purely for the sake of example, I also included a default handler, which will parse everything not explicitly covered by the other event handler definitions. If you plan to discard additional data, the default handler will execute automatically without requiring a definition.

$parser->setHandlers( Start => \&startElement,
                         End => \&endElement,
                         Char => \&characterData,
                         Default => \&default);

The main portion of the script winds up by instructing the parser instance to stream through the XML data file: $parser->parsefile($xmlfile);

All that's left is to define what to do in the case of each type of event.

When the script encounters a start tag, it will execute this subroutine because it was defined in the setHandlers method above. I chose to flip through and display some text for each element I'm interested in.

The variables I defined in each subroutine that follows are automatically passed by the XML::Parser module. For the start tag handler, these variables represent the parser instance, the tag name, and an array of any attributes that tag may have. If the tag has no attributes, an empty array is passed to the subroutine.

sub startElement {
       my( $parseinst, $element, %attrs ) = @_;
       SWITCH: {
              if ($element eq "article") {
                    $count++;
                    $tag = "article";
                    print "Article $count:\n";
                    last SWITCH;
              }
             if ($element eq "title") {
                    print "Title: ";
                    $tag = "title";
                    last SWITCH;
              }
              if ($element eq "summary") {
                    print "Summary: ";
                    $tag = "summary";
                    last SWITCH;
              }
       }
}

The endElement subroutine will be called whenever an end tag is encountered in the XML data file. Here, I decided to provide some line breaks. The variables that are passed by the XML::Parser in this case are the parser instance and the tag name.

sub endElement {
       my( $parseinst, $element ) = @_;
       if ($element eq "article") {
              print "\n\n";
       } elsif ($element eq "title") {
              print "\n";
       }
}

Since we're on the command line, I used the character data handler to strip out any line and tab formatting that might have been included in the XML data file and opted to show the content if it came from a title or summary tag.

sub characterData {
       my( $parseinst, $data ) = @_;
       if (($tag eq "title") || ($tag eq "summary")) {
              $data =~ s/\n|\t//g;
              print "$data";
       }
}

Finally, I defined a subroutine to handle any other types of elements that might be encountered. This includes character encoding definitions, document type definitions, and comments. Anything that isn't explicitly covered by my start tag, end tag, and character data event handlers gets passed here.

sub default {
       my( $parseinst, $data ) = @_;
       # you could do something here
}

Summary

Once you've become familiar with the XML::Parser's Expat functionality, you can use it as a jumping-off point to get into any of the hundreds of available CPAN XML modules. The Stream style we looked at here is only one type of parse the XML::Parser module has available, and you may find one of the others better suited for your task. Perl has offered XML capabilities almost since the first working draft was available, and it's a great implementation, whatever your needs.

Listing A

use XML::Parser;

my $xmlfile = shift;

die "Cannot find file \"$xmlfile\""
       unless -f $xmlfile;

$count = 0;

 $tag = "";

my $parser = new XML::Parser;

$parser->setHandlers(      Start => \&startElement,
                                         End => \&endElement,
                                         Char => \&characterData,
                                         Default => \&default);

$parser->parsefile($xmlfile);

sub startElement {

      my( $parseinst, $element, %attrs ) = @_;
        SWITCH: {
                if ($element eq "article") {
                        $count++;
                        $tag = "article";
                        print "Article $count:\n";
                        last SWITCH;
                }
                if ($element eq "title") {
                        print "Title: ";
                        $tag = "title";
                        last SWITCH;
                }
                if ($element eq "summary") {
                        print "Summary: ";
                        $tag = "summary";
                        last SWITCH;
                }
        }

 }

sub endElement {

      my( $parseinst, $element ) = @_;
        if ($element eq "article") {
                print "\n\n";
        } elsif ($element eq "title") {
                print "\n";
        }

 }

sub characterData {

      my( $parseinst, $data ) = @_;
        if (($tag eq "title") || ($tag eq "summary")) {
                $data =~ s/\n|\t//g;
                print "$data";
        }

 }

sub default {

      my( $parseinst, $data ) = @_;
        # do nothing, but stay quiet

 }

Listing B

<?xml version="1.0" encoding="utf-8"?>

<series>
  <article>
    <url>http://builder.com.com/article.jhtml?id=u00220020327adm01.htm</url>
    <title>Remedial XML for programmers: Basic syntax</title>
    <summary>In this first installment in a three-part series, I'll introduce you to XML and its basic syntax.</summary>
  </article>

  <article>
    <url>http://builder.com.com/article.jhtml?id=u00220020401adm01.htm</url>
    <title>Remedial XML: Enforcing document formats with DTDs</title>
    <summary>To enforce structure requirements for an XML document, you have to turn to one of XML's attendant technologies, data type definition (DTD).</summary>
  </article>

  <article>
    <url>http://builder.com.com/article.jhtml?id=u00320020418adm01.htm</url>
    <title>Remedial XML: Using XML Schema</title>
    <summary>In this article, we'll briefly touch on the shortcomings of DTDs and discuss the basics of a newer, more powerful standard: XML Schemas.</summary>
  </article>

  <article>
    <url>http://builder.com.com/article.jhtml?id=u00220020522adm01.htm</url>
    <title>Remedial XML: Say hello to DOM</title>
    <summary>Now it's time to put on your programmer's hat and get acquainted with Document Object Model (DOM), which provides easy access to XML documents via a tree-like set of objects.</summary>
  </article>

  <article>
    <url>http://builder.com.com/article.jhtml?id=u00220020527adm01.htm</url>
    <title>Remedial XML: Learning to play SAX</title>
    <summary>In this fifth installment in our Remedial XML series, I'll introduce you to the SAX API, and provide some links to SAX implementations in several different languages.</summary>
  </article>

</series>

XML::Parser Tutorial

http://perlmonks.thepen.com/62782.html

by OeufMayo on Mar 07, 2001

Introduction

We all agree that Perl does a really good job when it comes to text extraction, particulary with regular expressions.

The XML is based on text, so one might think that it would be dead easy to take any XML input and have it converted in the way one wants.

Unfortunately, that is wrong. If you think you'll be able to parse a XML file with your own homegrown parser you did overnight, think again, and look at the XML specs closely. It's as complex as the CGI specs, and you'll never want to waste precious time trying to do something that will surely end up wrong anyway. Most of the background discussions on why you have to use CGI.pm instead of your own CGI-parser apply here.

The aim of this tutorial is not to show you how XML should be structured and why you shouldn't parse it by hand but how to use the proper tool to do the right job. I'll focus on the most basic XML module you can find, XML::Parser. It's written by Larry Wall and Clark Cooper, and I'm sure we can trust the former to make good software (rn and patch are his most famous programs)

Okay, enough talk, let's jump into the module!

This tutorial will only show you the basics of XML parsing, using the easiest (IMHO) methods. Please refer to the perldoc XML::Parser for more detailed info. I'm aware that there are a lot of XML tools available, but knowing how to use XML::Parser can surely help you a lot when you don't have any other module to work with, and it also helped me to understand how other XML modules worked, since most of them are built on top of XML::Parser.

The data

The example I'll use for this tutorial is the Perlmonks Chatterbox ticker that some of you may have already used. It looks like this:

<CHATTER><INFO site="http://perlmonks.org[]" sitename="Perl Monks">
Rendered by the Chatterbox XML Ticker</INFO>
    <message author="OeufMayo" time="20010228112952">
test</message>
    <message author="deprecated" time="20010228113142">
pong</message>
    <message author="OeufMayo" time="20010228113153">
/me test again; :)</message>
    <message author="OeufMayo" time="20010228113255">
&lt;a href="#"&gt;please note the use of HTML
tags&lt;/a&gt;</message>
</CHATTER>

The code

Let's assume we want to output this file in a readable way (though it'll still be barebone). It doesn't handles links and internal HTML entities. It only gets the CB ticker, parses it and prints it, you have to launch it again to follow the wise meditations and the brilliant rethoric of the other fine monks present at the moment.

00001: !/usr/bin/perl -w
00002: se strict;
00003: se XML::Parser;
00004: se LWP::Simple;  # used to fetch the chatterbox ticker
00005:
00006: y $message;      # Hashref containing infos on a message
00007:
00008: y $cb_ticker = get("http://perlmonk?s.org/index.pl?node=?chatterbox+xml+ticke?r[]");
00009:  we should really check if it succeeded or not
00010:
00011: my $parser = new XML::Parser ( Handlers => {   # Creates our parser object
00012:                             Start   => \&hdl_start,
00013:                             End     => \&hdl_end,
00014:                             Char    => \&hdl_char,
00015:                             Default => \&hdl_def,
00016:                           });
00017: $parser->parse($cb_t?icker);
00018:
00019: # The Handlers
00020: sub hdl_start{
00021:     my ($p, $elt, %atts) = @_;
00022:     return unless $elt eq 'message';  # We're only interrested in what's said
00023:     $atts{'_str'} = '';
00024:     $message = \%atts;
00025: }
00026:
00027: sub hdl_end{
00028:     my ($p, $elt) = @_;
00029:     format_message($mess?age) if $elt eq 'message' && $message && $message->{'_str'} =~ /\S/;
00030: }
00031:
00032: sub hdl_char {
00033:     my ($p, $str) = @_;
00034:     $message->{'_str'} .= $str;
00035: }
00036:
00037: sub hdl_def { }  # We just throw everything else
00038:
00039: sub format_message { # Helper sub to nicely format what we got from the XML
00040:     my $atts = shift;
00041:     $atts->{'_str'} =~ s/\n//g;
00042:
00043:     my ($y,$m,$d,$h,$n,$s) = $atts->{'time'} =~ m/^(\d{4})(\d{2})(\d?{2})(\d{2})(\d{2})(\?d{2})$/;
00044:
00045:     # Handles the /me
00046:     $atts->{'_str'} = $atts->{'_str'} =~ s/^\/me// ?
00047:     "$atts->{'author'} $atts->{'_str'}"   :
00048:     "<$atts->{'author'}>?: $atts->{'_str'}";
00049:     $atts->{'_str'} = "$h:$n " . $atts->{'_str'};
00050:     print "$atts->{'_str'}\n";
00051:     undef $message;
00052: }

Step-by-step code walkthrough:

Lines 1 to 4

Initialisation of the basics needed for this snippet, XML::Parser, of course, and LWP::Simple to get the chatterbox ticker.

Line 8

LWP::Simple get the requested URL, and put the content of the page in the $cb_ticker scalar.

Lines 11 to 16

The most interesting part, no doubt. We create here a new XML::Parser object. The Parser can come in different styles, but when you have to deal with simple data, like the CB ticker, the Handlers way is the easiest (see also the Subs style, as it is really close to this one).

For this object, we define four handlers subs, each representing a different state in the parsing process.

The 'Start' handler is called whenever a new element (or tag, HTML-wise) is found. The sub given is called with the expat object, the name of the element, and a hash containing all the atrributes of this element.
The 'End' is called whenever an element is closed, and is called with the same parameters as the 'Start', minus the attributes.
The 'Char' handler is called when the parser finds something which is not mark-up (in our case, the text enclosed in the <message> tag).
Finally, the 'Default' handler is called, well, by default, when anything else matching the three other handlers is called.

Line 17

The line that does all the magic, parsing and calling all your subs for you at the right moment.

Lines 20-25: the Start handler

We only want to deal with the <message> elements (those containing what it is being said in the Chatterbox) so we'll happily skip every other element.

We got a hash with the attributes of the element, and we're going to use this hash to store the string that will contain the text to be displayed in the $atts{'_str'}

Lines 27-30: the End handler

Once we've reached the end of a message element, we format all the info we have gathered and prints them via the format_message sub.

Lines 32-35: the Char handler

This sub gets all the strings returned by the parser and appends it to the string to be finally displayed

Line 37: the Default handler

It does nothing, but it doesn't have to figure out what to do with this!

Lines 39-52

This subroutine mangles all the info we got from the XML file, with bad regexes and all, and prints the formatted text in a hopefully readable way. Please note that XML::Parser handled all of the decoding of the < and > entities that were included in the original XML file

Summary

We now have a complete and simple parser, ready to analyse, extract, report everything inside the Chatterbox XML ticker!

That's all for now, here are some links you may find useful:

Most of mirod's nodes (and especially his review of XML::Parser) http://perlmonks.thepen.com/mirod.html
davorg's Data Munging with Perl

subclassing vs. global variables

> This is nice, but I would rather know how to use XML::Parser by subclassing
> it. All of my attempts to do this ended up in very unclean, OOP-unfriendly
> code. I ended up with storing results in package-global variables rather
> than object attributes. This is both ugly and thread-unsafe.
>
> Is there some clean way how to subclass XML::Parser?

The problem is probably that XML::Parser is an object factory: it generates XML::Parser::Expat objects with each parse or parsefile call. The handlers then receive XML::Parser::Expat objects and not XML::Parser objects.

There is a way to store data in the XML::Parser object and to access it in the handlers though: use the 'Non-Expat-Options' argument when creating the XML::Parser:

#!/bin/perl -w
use strict;
use XML::Parser;

my $p= new XML::Parser(
       'Non-Expat-Options' => { my_option => "toto" },
       Handlers => { Start => \&start, }
                     );
$p->parse( '<a />');

sub start
 { my( $pe, $elt, %atts)= @_;
   print "my option: ", $pe->{'Non-Expat-Options'}->{my_option}, "\n";
 }

This is certainly ugly but it works!

Update: note that the data is still stored in the XML::Parser object though, as shown by this code:

#!/bin/perl -w
use strict;
use XML::Parser;

my $p= new XML::Parser(
       'Non-Expat-Options' => { my_option => "1" },
       Handlers => { Start => \&start, }
                     );
$p->parse( '<a />');
$p->parse( '<b />');

sub start
 { my( $pe, $elt, %atts)= @_;
   print "element: $elt - my option: ",
         $pe->{'Non-Expat-Opt?ions'}->{my_option}+?+, "\n";
   $p->parse( '<c />')
      unless( $pe->{'Non-Expat-Opt?ions'}->{my_option} > 3);
 }

Which outputs:

element: a - my option: 1
element: c - my option: 2
element: c - my option: 3
element: b - my option: 4

Comments

Why do you want to subclass it? It works much better as a "has-a" than an "is-a", unless you want to get *very* cozy from the base class implementation, which is a maze of twisty tiny packages all alike.

Just delegate the methods that you want to provide in your interface, and handle the rest. Make a hash with one of the elements being your "inherited" parser. I believe it's called the "wrapper" pattern, but I don't name my patterns — I just use them!

Randal L. Schwartz, Perl hacker

Comments

Well, but …. (there is allways a 'but') :-)

Suppose I do not subclass XML::Parser. But then, how do I pass parameters to XML::Parser handler methods and collect results of their run without using global variables of XML::Parser package? Only class that I get to handler methods is expat itself and there is no place for any aditional parameters/results of handler methods.

And if I subclass XML::Parser, only advantage that I gain is using my own package namespace for global variables instead of XML::Parser's namespace. This do not looks to me like a good example of object oriented programming style.

Possible silution is the one mirod suggested using Non-Expat-Options but it is just a little bit less ugly than these two.

There best solution will be forcing XML::Parser to use my custom subclass of XML::Parser::Expat instead of XML::Parser::Expat itself. Is there some way how to do that?

Comments

The way to do this, without relying on the fact that the $p is a hashref, is to pass a closure as the handlers, and have an object that you created saved in the closure. This is how PerlSAX is implemented.

Witness:

my $handler = bless {}, "MyHandler";
my $p = XML::Parser->new(Handlers => {
   Start => sub { $handler->handle_start(@_) }
});

package MyHandler;

sub handle_start {
  my ($handler, $p, $element, %attribs) = @_;
  ...
}

by Anonymous Monk