mod:XML::Parser
NAME
XML::Parser - A multi-thread perl module for parsing XML documents
DESCRIPTION
This module provides ways to parse XML documents. It is built on top of
XML::Parser::Expat, which is a lower level interface to James Clark's expat
library.
Each call to one of the parsing methods creates a new instance of
XML::Parser::Expat which is then used to parse the docu- ment. Expat
options may be provided when the XML::Parser object is created. These
options are then passed on to the Expat object on each parse call. They can
also be given as extra arguments to the parse methods, in which case they
override options given at XML::Parser cre- ation time.
The behavior of the parser is controlled either by ""Style"" and/or
""Handlers"" options, or by "setHandlers" method. These all provide
mechanisms for XML::Parser to set the handlers needed by XML::Parser::Expat.
If neither "Style" nor "Handlers" are specified, then parsing just checks
the document for being well-formed.
When underlying handlers get called, they receive as their first parameter
the Expat object, not the Parser object.
XML::Simple
Since you have to install XML::Parser anyway, you might as well use
XML::Simple (a nice wrapper for XML::Parser) that will parse the whole thing
and return back a hash ref of all the data. That would cut this document
down to 1 page of formatting it html. This should have at least mentioned
XML::Simple or it should have done the parsing manually (using regex's).
Using Perl with XML (part 1)
http://www.devshed.com/Server_Side/Perl/PerlXML/PerlXML1/
By icarus
2002-01-15
Here's a quick list of the types of events that the parser can handle,
together with a list of their key names (as expected by the setHandlers()
method) and a list of the arguments that the corresponding callback function
will receive.
Key
|
Arguments
|
Event
|
|
to callback
|
|
Final
|
parser handle
|
Document parsing completed
|
|
|
|
Start
|
parser handle,
|
Start tag found
|
|
element name,
|
|
|
attributes
|
|
|
|
|
End
|
parser handle,
|
End tag found
|
|
element name
|
|
|
|
|
Char
|
parser handle,
|
CDATA found
|
|
CDATA
|
|
|
|
|
Proc
|
parser handle,
|
PI found
|
|
PI target,
|
|
|
PI data
|
|
|
|
|
Comment
|
parser handle,
|
Comment found
|
|
comment
|
|
|
|
|
Unparsed
|
parser handle, entity,
|
Unparsed entity found
|
|
base, system ID, public
|
|
|
ID, notation
|
|
|
|
|
Notation
|
parser handle, notation,
|
Notation found
|
|
base, system ID, public
|
|
|
ID
|
|
|
|
|
XMLDecl
|
parser handle,
|
XML declaration found
|
|
version, encoding,
|
|
|
standalone
|
|
|
|
|
ExternEnt
|
parser handle, base,
|
External entity found
|
|
system ID, public ID
|
|
|
|
|
Default
|
parser handle, data
|
Default handler
|
Perl XML::Parser module - Mini HowTo
http://www.insite.com.br/~nferraz/projetos/xml-parser.html
Nelson Ferraz, nferraz at insite dot com dot br
v0.3, Jun 14 2001
We're so used to manipulate text with regular expressions and other Perl
features, that we are surprised to find out that parsing an XML file isn't a
trivial task.
The What's Wrong with Perl and XML? article give us a clue, when it says
that "regexp processing is not fully useful when dealing with the majority
of XML processing tags".
The objective of this Mini HOWTO is to show the most important concepts to
use the XML::Parser module, so you'll be able to read and parse XML
documents with ease.
Introducing XML::Parser
The XML::Parser module is event-oriented, that means that the module will
parse your XML file and, whenever it founds a new openning or closing tag,
or any text between tags, an event will be triggered and a sub in your
program will be called.
The most important thing that you must know is what events can be triggered
by XML::Parser and its parameters, so you can use them.
Events
Here's a description of the most common events, its parameters and when they
occur.
The first parameter is always an instance of Expat, a module used internally
to parse the document.
It's safe to ignore the Expat parameter unless you need one of the features
provided by it, as noted by Clark Cooper, who told me that "there are many
things that become more difficult or impossible unless you use the methods
attached to that object" - see "perldoc XML::Parser::Expat").
Handler (parameters) When it occurs Sample
Init (Expat) just before the parsing starts
Final (Expat) just before the parsing finishes
Start (Expat, Element [, Attr, Val [,...]]) when an XML start tag is found
<TAG attr1="val1" attr2="val2">
End (Expat, Element) when an XML end tag is found
</TAG>
Char (Expat, String) when non-markup is found
Comment (Expat, Data) when a comment is found
<-- some data here... -->
Default (Expat, String) if there isn't a corresponding handler registered
Another interesting note is that empty tags, such as <foo/>, will trigger
both Start and End events.
How the events are handled?
Triggering an event means that a sub, in your program, will be called. In
order to let XML::Parser call the correct subs when they are needed, you
must set a few handlers, indicating which event will be handled by which
sub.
The first step is to initialize XML::Parser:
#!/usr/bin/perl
use XML::Parser;
my $parser = new XML::Parser ();
Now we can set the handlers:
$parser->setHandlers (
Start => \&Start_handler,
End => \&End_handler,
Default => \&Default_handler
);
Now we're going to read a filename from the command line:
my $filename = shift;
die "Can't find '$filename': $!\n" unless -f $filename;
And here is the line that will make everything work (BTW, make sure that you
read the XML::Parser documentation, because there are many other ways of
calling it!):
$parser->parsefile ($filename);
But wait!!! We can't forget to include the subs that will actually handle
the events. We have defined their names, here they are:
sub Start_handler {
my $p = shift;
my $el = shift;
print "<$el>\n";
while (my $key = shift) {
my $val = shift;
print " $key = $val\n";
}
print "\n";
}
sub End_handler {
my ($p,$el) = @_;
print "</$el>\n";
}
sub Default_handler {
my ($p,$str) = @_;
print " default handler found '$str'\n";
}
Conclusion
Althought Perl really shines when we have to parse free-form text files, the
XML::Parser module may help us to work with XML files, which are much more
structured, with ease.
Parsing XML documents with Perl
Shelley Doll | July 17, 2002
http://www.zdnet.com.au/builder/program/web/story/0,2000034810,20266751,00.htm
http://builder.com.com/5100-6371-1044612.html
http://builder.com.com/5100-6371_14-1044612-2.html
Parsing XML documents with Perl
This article focuses on one of the earliest and most frequently referenced
core modules, [47]XML::Parser.
When it comes to working with XML in Perl, you have almost five hundred
[49]CPAN modules to choose from, each supporting various aspects of
integrating Web services. In addition, the Perl core library includes
several modules to support XML.
XML::Parser lineage
The original Perl XML parser, XML::Parser::Expat, was written several years
ago by Larry Wall and has since been maintained by Clark Cooper. The module
is an interface to the [50]Expat XML parser written in C by James Clark,
which has been adopted by several scripting languages.
Expat is an event-based parser, meaning certain conditions trigger handling
functions. For example, a start or end tag will trigger the appropriate
user-defined subroutine. The XML::Parser module was built upon the Expat
functionality for general use.
Note that Expat does not validate XML prior to parsing and will die when an
error is encountered. But these limitations help make the XML::Parser module
extremely fast.
XML::Parser in brief
Anybody can write an XML parser in Perl. After all, youre merely processing
text that comes in an expected format. But since the XML::Parser module is
written in C, it's much more efficient than any purely Perl implementation
you could come up with. And it's already been written for you, so you can
spend your time doing something more useful, as Larry Wall would put it.
XML::Parser's Expat functionality allows you to define the style of parse
you want to use. The most commonly used styles are Tree and Stream. The
Tree style processes your XML input and creates nested hashes and arrays
that contain the elements and data from your file. You can then manipulate
this structure as youd like. The Stream style breaks the parse into stages,
processed at the start of an event. To use the Stream style parse, you must
define handlers when you instantiate the module and associate them with
user-defined subroutines that describe what is to be done when the event is
encountered.
Other types of styles include Subs, which allows you to define functions
specific to a type of XML tag, Debug, which displays the document to
standard output, and Objects, which is similar to the Tree style but returns
objects. You can also set a custom style by defining a subclass to the
XML::Parser class.
A Streamlined example
For this example, I'll be using the XML::Parser class to create a Stream
style parse. I'll walk through a simple script that will parse an XML file
to standard output. You can see the script (xmlparse.pl) in [51]Listing A,
and the XML file (data.xml) in [52]Listing B. In this case, I chose not to
parse the URL element since this is a command-line script. To execute the
script, at the command prompt, type:
perl xmlparse.pl data.xml
The script first references the appropriate module:
use XML::Parser;
Next, it grabs the file from the command-prompt input:
my $xmlfile = shift;
die "Cannot find file \"$xmlfile\""
unless -f $xmlfile;
The script sets some initial variables:
Then, it creates our parser instance:
my $parser = new XML::Parser;
Now, we define our event handlers. I included handlers for start tags, end
tags, and character data. Purely for the sake of example, I also included a
default handler, which will parse everything not explicitly covered by the
other event handler definitions. If you plan to discard additional data, the
default handler will execute automatically without requiring a definition.
$parser->setHandlers( Start => \&startElement,
End => \&endElement,
Char => \&characterData,
Default => \&default);
The main portion of the script winds up by instructing the parser instance
to stream through the XML data file: $parser->parsefile($xmlfile);
All that's left is to define what to do in the case of each type of event.
When the script encounters a start tag, it will execute this subroutine
because it was defined in the setHandlers method above. I chose to flip
through and display some text for each element I'm interested in.
The variables I defined in each subroutine that follows are automatically
passed by the XML::Parser module. For the start tag handler, these variables
represent the parser instance, the tag name, and an array of any attributes
that tag may have. If the tag has no attributes, an empty array is passed to
the subroutine.
sub startElement {
my( $parseinst, $element, %attrs ) = @_;
SWITCH: {
if ($element eq "article") {
$count++;
$tag = "article";
print "Article $count:\n";
last SWITCH;
}
if ($element eq "title") {
print "Title: ";
$tag = "title";
last SWITCH;
}
if ($element eq "summary") {
print "Summary: ";
$tag = "summary";
last SWITCH;
}
}
}
The endElement subroutine will be called whenever an end tag is encountered
in the XML data file. Here, I decided to provide some line breaks. The
variables that are passed by the XML::Parser in this case are the parser
instance and the tag name.
sub endElement {
my( $parseinst, $element ) = @_;
if ($element eq "article") {
print "\n\n";
} elsif ($element eq "title") {
print "\n";
}
}
Since we're on the command line, I used the character data handler to strip
out any line and tab formatting that might have been included in the XML
data file and opted to show the content if it came from a title or summary
tag.
sub characterData {
my( $parseinst, $data ) = @_;
if (($tag eq "title") || ($tag eq "summary")) {
$data =~ s/\n|\t//g;
print "$data";
}
}
Finally, I defined a subroutine to handle any other types of elements that
might be encountered. This includes character encoding definitions, document
type definitions, and comments. Anything that isn't explicitly covered by my
start tag, end tag, and character data event handlers gets passed here.
sub default {
my( $parseinst, $data ) = @_;
# you could do something here
}
Summary
Once you've become familiar with the XML::Parser's Expat functionality, you
can use it as a jumping-off point to get into any of the hundreds of
available CPAN XML modules. The Stream style we looked at here is only one
type of parse the XML::Parser module has available, and you may find one of
the others better suited for your task. Perl has offered XML capabilities
almost since the first working draft was available, and it's a great
implementation, whatever your needs.
Listing A
use XML::Parser;
my $xmlfile = shift;
die "Cannot find file \"$xmlfile\""
unless -f $xmlfile;
$count = 0;
$tag = "";
my $parser = new XML::Parser;
$parser->setHandlers( Start => \&startElement,
End => \&endElement,
Char => \&characterData,
Default => \&default);
$parser->parsefile($xmlfile);
sub startElement {
my( $parseinst, $element, %attrs ) = @_;
SWITCH: {
if ($element eq "article") {
$count++;
$tag = "article";
print "Article $count:\n";
last SWITCH;
}
if ($element eq "title") {
print "Title: ";
$tag = "title";
last SWITCH;
}
if ($element eq "summary") {
print "Summary: ";
$tag = "summary";
last SWITCH;
}
}
}
sub endElement {
my( $parseinst, $element ) = @_;
if ($element eq "article") {
print "\n\n";
} elsif ($element eq "title") {
print "\n";
}
}
sub characterData {
my( $parseinst, $data ) = @_;
if (($tag eq "title") || ($tag eq "summary")) {
$data =~ s/\n|\t//g;
print "$data";
}
}
sub default {
my( $parseinst, $data ) = @_;
# do nothing, but stay quiet
}
Listing B
<?xml version="1.0" encoding="utf-8"?>
<series>
<article>
<url>http://builder.com.com/article.jhtml?id=u00220020327adm01.htm</url>
<title>Remedial XML for programmers: Basic syntax</title>
<summary>In this first installment in a three-part series, I'll introduce you to XML and its basic syntax.</summary>
</article>
<article>
<url>http://builder.com.com/article.jhtml?id=u00220020401adm01.htm</url>
<title>Remedial XML: Enforcing document formats with DTDs</title>
<summary>To enforce structure requirements for an XML document, you have to turn to one of XML's attendant technologies, data type definition (DTD).</summary>
</article>
<article>
<url>http://builder.com.com/article.jhtml?id=u00320020418adm01.htm</url>
<title>Remedial XML: Using XML Schema</title>
<summary>In this article, we'll briefly touch on the shortcomings of DTDs and discuss the basics of a newer, more powerful standard: XML Schemas.</summary>
</article>
<article>
<url>http://builder.com.com/article.jhtml?id=u00220020522adm01.htm</url>
<title>Remedial XML: Say hello to DOM</title>
<summary>Now it's time to put on your programmer's hat and get acquainted with Document Object Model (DOM), which provides easy access to XML documents via a tree-like set of objects.</summary>
</article>
<article>
<url>http://builder.com.com/article.jhtml?id=u00220020527adm01.htm</url>
<title>Remedial XML: Learning to play SAX</title>
<summary>In this fifth installment in our Remedial XML series, I'll introduce you to the SAX API, and provide some links to SAX implementations in several different languages.</summary>
</article>
</series>
XML::Parser Tutorial
http://perlmonks.thepen.com/62782.html
by OeufMayo
on Mar 07, 2001
Introduction
We all agree that Perl does a really good job when it comes to text
extraction, particulary with regular expressions.
The XML is based on text, so one might think that it would be dead easy to
take any XML input and have it converted in the way one wants.
Unfortunately, that is wrong. If you think you'll be able to parse a XML
file with your own homegrown parser you did overnight, think again, and look
at the XML specs closely. It's as complex as the CGI specs, and you'll never
want to waste precious time trying to do something that will surely end up
wrong anyway. Most of the background discussions on why you have to use
CGI.pm instead of your own CGI-parser apply here.
The aim of this tutorial is not to show you how XML should be structured and
why you shouldn't parse it by hand but how to use the proper tool to do the
right job. I'll focus on the most basic XML module you can find,
XML::Parser. It's written by Larry Wall and Clark Cooper, and I'm sure we
can trust the former to make good software (rn and patch are his most famous
programs)
Okay, enough talk, let's jump into the module!
This tutorial will only show you the basics of XML parsing, using the
easiest (IMHO) methods. Please refer to the perldoc XML::Parser for more
detailed info. I'm aware that there are a lot of XML tools available, but
knowing how to use XML::Parser can surely help you a lot when you don't have
any other module to work with, and it also helped me to understand how other
XML modules worked, since most of them are built on top of XML::Parser.
The data
The example I'll use for this tutorial is the Perlmonks Chatterbox ticker
that some of you may have already used. It looks like this:
<CHATTER><INFO site="http://perlmonks.org[]" sitename="Perl Monks">
Rendered by the Chatterbox XML Ticker</INFO>
<message author="OeufMayo" time="20010228112952">
test</message>
<message author="deprecated" time="20010228113142">
pong</message>
<message author="OeufMayo" time="20010228113153">
/me test again; :)</message>
<message author="OeufMayo" time="20010228113255">
<a href="#">please note the use of HTML
tags</a></message>
</CHATTER>
The code
Let's assume we want to output this file in a readable way (though it'll
still be barebone). It doesn't handles links and internal HTML entities. It
only gets the CB ticker, parses it and prints it, you have to launch it
again to follow the wise meditations and the brilliant rethoric of the other
fine monks present at the moment.
00001: !/usr/bin/perl -w
00002: se strict;
00003: se XML::Parser;
00004: se LWP::Simple; # used to fetch the chatterbox ticker
00005:
00006: y $message; # Hashref containing infos on a message
00007:
00008: y $cb_ticker = get("http://perlmonk?s.org/index.pl?node=?chatterbox+xml+ticke?r[]");
00009: we should really check if it succeeded or not
00010:
00011: my $parser = new XML::Parser ( Handlers => { # Creates our parser object
00012: Start => \&hdl_start,
00013: End => \&hdl_end,
00014: Char => \&hdl_char,
00015: Default => \&hdl_def,
00016: });
00017: $parser->parse($cb_t?icker);
00018:
00019: # The Handlers
00020: sub hdl_start{
00021: my ($p, $elt, %atts) = @_;
00022: return unless $elt eq 'message'; # We're only interrested in what's said
00023: $atts{'_str'} = '';
00024: $message = \%atts;
00025: }
00026:
00027: sub hdl_end{
00028: my ($p, $elt) = @_;
00029: format_message($mess?age) if $elt eq 'message' && $message && $message->{'_str'} =~ /\S/;
00030: }
00031:
00032: sub hdl_char {
00033: my ($p, $str) = @_;
00034: $message->{'_str'} .= $str;
00035: }
00036:
00037: sub hdl_def { } # We just throw everything else
00038:
00039: sub format_message { # Helper sub to nicely format what we got from the XML
00040: my $atts = shift;
00041: $atts->{'_str'} =~ s/\n//g;
00042:
00043: my ($y,$m,$d,$h,$n,$s) = $atts->{'time'} =~ m/^(\d{4})(\d{2})(\d?{2})(\d{2})(\d{2})(\?d{2})$/;
00044:
00045: # Handles the /me
00046: $atts->{'_str'} = $atts->{'_str'} =~ s/^\/me// ?
00047: "$atts->{'author'} $atts->{'_str'}" :
00048: "<$atts->{'author'}>?: $atts->{'_str'}";
00049: $atts->{'_str'} = "$h:$n " . $atts->{'_str'};
00050: print "$atts->{'_str'}\n";
00051: undef $message;
00052: }
Step-by-step code walkthrough:
Lines 1 to 4
Initialisation of the basics needed for this snippet, XML::Parser, of
course, and LWP::Simple to get the chatterbox ticker.
Line 8
LWP::Simple get the requested URL, and put the content of the page in the
$cb_ticker scalar.
Lines 11 to 16
The most interesting part, no doubt. We create here a new XML::Parser
object. The Parser can come in different styles, but when you have to deal
with simple data, like the CB ticker, the Handlers way is the easiest (see
also the Subs style, as it is really close to this one).
For this object, we define four handlers subs, each representing a different
state in the parsing process.
-
The 'Start' handler is called whenever a new element (or tag, HTML-wise)
is found. The sub given is called with the expat object, the name of the
element, and a hash containing all the atrributes of this element.
-
The 'End' is called whenever an element is closed, and is called with the
same parameters as the 'Start', minus the attributes.
-
The 'Char' handler is called when the parser finds something which is not
mark-up (in our case, the text enclosed in the <message> tag).
-
Finally, the 'Default' handler is called, well, by default, when anything
else matching the three other handlers is called.
Line 17
The line that does all the magic, parsing and calling all your subs for you
at the right moment.
Lines 20-25: the Start handler
We only want to deal with the <message> elements (those containing what it
is being said in the Chatterbox) so we'll happily skip every other element.
We got a hash with the attributes of the element, and we're going to use
this hash to store the string that will contain the text to be displayed in
the $atts{'_str'}
Lines 27-30: the End handler
Once we've reached the end of a message element, we format all the info we
have gathered and prints them via the format_message sub.
Lines 32-35: the Char handler
This sub gets all the strings returned by the parser and appends it to the
string to be finally displayed
Line 37: the Default handler
It does nothing, but it doesn't have to figure out what to do with this!
Lines 39-52
This subroutine mangles all the info we got from the XML file, with bad
regexes and all, and prints the formatted text in a hopefully readable
way. Please note that XML::Parser handled all of the decoding of the <
and > entities that were included in the original XML file
Summary
We now have a complete and simple parser, ready to analyse, extract, report
everything inside the Chatterbox XML ticker!
That's all for now, here are some links you may find useful:
subclassing vs. global variables
> This is nice, but I would rather know how to use XML::Parser by subclassing
> it. All of my attempts to do this ended up in very unclean, OOP-unfriendly
> code. I ended up with storing results in package-global variables rather
> than object attributes. This is both ugly and thread-unsafe.
>
> Is there some clean way how to subclass XML::Parser?
The problem is probably that XML::Parser is an object factory: it generates
XML::Parser::Expat objects with each parse or parsefile call. The handlers
then receive XML::Parser::Expat objects and not XML::Parser objects.
There is a way to store data in the XML::Parser object and to access it in
the handlers though: use the 'Non-Expat-Options' argument when creating the
XML::Parser:
#!/bin/perl -w
use strict;
use XML::Parser;
my $p= new XML::Parser(
'Non-Expat-Options' => { my_option => "toto" },
Handlers => { Start => \&start, }
);
$p->parse( '<a />');
sub start
{ my( $pe, $elt, %atts)= @_;
print "my option: ", $pe->{'Non-Expat-Options'}->{my_option}, "\n";
}
This is certainly ugly but it works!
Update: note that the data is still stored in the XML::Parser object though,
as shown by this code:
#!/bin/perl -w
use strict;
use XML::Parser;
my $p= new XML::Parser(
'Non-Expat-Options' => { my_option => "1" },
Handlers => { Start => \&start, }
);
$p->parse( '<a />');
$p->parse( '<b />');
sub start
{ my( $pe, $elt, %atts)= @_;
print "element: $elt - my option: ",
$pe->{'Non-Expat-Opt?ions'}->{my_option}+?+, "\n";
$p->parse( '<c />')
unless( $pe->{'Non-Expat-Opt?ions'}->{my_option} > 3);
}
Which outputs:
element: a - my option: 1
element: c - my option: 2
element: c - my option: 3
element: b - my option: 4
Comments
Why do you want to subclass it? It works much better as a "has-a" than an
"is-a", unless you want to get *very* cozy from the base class
implementation, which is a maze of twisty tiny packages all alike.
Just delegate the methods that you want to provide in your interface, and
handle the rest. Make a hash with one of the elements being your "inherited"
parser. I believe it's called the "wrapper" pattern, but I don't name my
patterns — I just use them!
Randal L. Schwartz, Perl hacker
Comments
Well, but …. (there is allways a 'but') :-)
Suppose I do not subclass XML::Parser. But then, how do I pass parameters to
XML::Parser handler methods and collect results of their run without using
global variables of XML::Parser package? Only class that I get to handler
methods is expat itself and there is no place for any aditional
parameters/results of handler methods.
And if I subclass XML::Parser, only advantage that I gain is using my own
package namespace for global variables instead of XML::Parser's
namespace. This do not looks to me like a good example of object oriented
programming style.
Possible silution is the one mirod suggested using Non-Expat-Options but it
is just a little bit less ugly than these two.
There best solution will be forcing XML::Parser to use my custom subclass of
XML::Parser::Expat instead of XML::Parser::Expat itself. Is there some way
how to do that?
Comments
The way to do this, without relying on the fact that the $p is a hashref, is
to pass a closure as the handlers, and have an object that you created saved
in the closure. This is how PerlSAX is implemented.
Witness:
my $handler = bless {}, "MyHandler";
my $p = XML::Parser->new(Handlers => {
Start => sub { $handler->handle_start(@_) }
});
sub handle_start {
my ($handler, $p, $element, %attribs) = @_;
...
}
by Anonymous Monk