Perl XML::Parser module - Mini HowTo 

http://www.insite.com.br/~nferraz/projetos/xml-parser.html

Nelson Ferraz, nferraz at insite dot com dot br v0.3, Jun 14 2001

We're so used to manipulate text with regular expressions and other Perl features, that we are surprised to find out that parsing an XML file isn't a trivial task.

The What's Wrong with Perl and XML? article give us a clue, when it says that "regexp processing is not fully useful when dealing with the majority of XML processing tags".

The objective of this Mini HOWTO is to show the most important concepts to use the XML::Parser module, so you'll be able to read and parse XML documents with ease.

Introducing XML::Parser 

The XML::Parser module is event-oriented, that means that the module will parse your XML file and, whenever it founds a new openning or closing tag, or any text between tags, an event will be triggered and a sub in your program will be called.

The most important thing that you must know is what events can be triggered by XML::Parser and its parameters, so you can use them.

Events 

Here's a description of the most common events, its parameters and when they occur.

The first parameter is always an instance of Expat, a module used internally to parse the document.

It's safe to ignore the Expat parameter unless you need one of the features provided by it, as noted by Clark Cooper, who told me that "there are many things that become more difficult or impossible unless you use the methods attached to that object" - see "perldoc XML::Parser::Expat").

Handler (parameters) When it occurs Sample
Init (Expat) just before the parsing starts
Final (Expat) just before the parsing finishes
Start (Expat, Element [, Attr, Val [,...]]) when an XML start tag is found

<TAG attr1="val1" attr2="val2">

End (Expat, Element) when an XML end tag is found

</TAG>

Char (Expat, String) when non-markup is found
Comment (Expat, Data) when a comment is found

<-- some data here... -->

Default (Expat, String) if there isn't a corresponding handler registered

Another interesting note is that empty tags, such as <foo/>, will trigger both Start and End events.

How the events are handled? 

Triggering an event means that a sub, in your program, will be called. In order to let XML::Parser call the correct subs when they are needed, you must set a few handlers, indicating which event will be handled by which sub.

The first step is to initialize XML::Parser:

#!/usr/bin/perl
use XML::Parser;
my $parser = new XML::Parser ();

Now we can set the handlers:

$parser->setHandlers (
          Start => \&Start_handler,
            End => \&End_handler,
        Default => \&Default_handler
);

Now we're going to read a filename from the command line:

my $filename = shift;
die "Can't find '$filename': $!\n" unless -f $filename;

And here is the line that will make everything work (BTW, make sure that you read the XML::Parser documentation, because there are many other ways of calling it!):

$parser->parsefile ($filename);

But wait!!! We can't forget to include the subs that will actually handle the events. We have defined their names, here they are:

### HANDLERS ###
sub Start_handler {
  my $p  = shift;
  my $el = shift;
  print "<$el>\n";
  while (my $key = shift) {
    my $val = shift;
    print "  $key = $val\n";
  }
  print "\n";
}
###
sub End_handler {
  my ($p,$el) = @_;
  print "</$el>\n";
}
###
sub Default_handler {
  my ($p,$str) = @_;
  print "  default handler found '$str'\n";
}

Conclusion 

Althought Perl really shines when we have to parse free-form text files, the XML::Parser module may help us to work with XML files, which are much more structured, with ease.