Use XPath to locate information in XML documents

Prev		Next

http://builder.com.com/5100-6389-1054416.html

Baseline Inc. | October 31, 2002

XML is an excellent vehicle for packaging and exchanging data. Parsing and transforming an XML document are common tasks, but what about locating a specific piece of information within an XML document — XPath fills this niche. XPath is a set of syntax rules for addressing the individual pieces of an XML document. If you're familiar with XSLT, you've used XPath, perhaps without realizing it.

An industry standard

XPath is an industry standard developed by the World Wide Web Consortium (W3C). It's used in both the XSLT and XPointer standards. Native XML databases often use it to locate information as well.

XPath follows in the path of the Document Object Model (DOM), whereby each XML document is treated as a tree of nodes. Consequently, the nodes are one of seven types: root, element, attribute, text, namespace, processing instruction, and comment. These are all standard aspects of any XML document. You can see many of these elements in the following sample XML:

<?xml version="1.0" encoding="ISO-8859-1"?>
<books>
<book type='hardback'>
<title>Atlas Shrugged</title>
<author>Ayn Rand</author>
<isbn>0525934189</isbn>
</book>
<book type='paperback'>
<title>A Burnt-Out Case</title>
<author>Graham Greene</author>
<isbn>0140185399</isbn>
</book>
</books>

The root node is books; book is an element with the type attribute, and the text exists throughout the XML document elements. So how do you easily locate individual pieces of data within the document? XPath is the answer.

Locate what you need

You locate information in an XML document by using location-path expressions. These expressions are made up of steps.

A node is the most common search element you'll encounter. Nodes in the example books XML include book, title, and author. You use paths to locate nodes within an XML document. The slash (/) separates child nodes, with all elements matching the pattern returned. The following XPath statement returns all book elements:

//books/book

A double slash (//) signals that all elements in the XML document that match the search criteria are returned, regardless of location/level within the document. You can easily retrieve all ISBN elements:

/books/book/isbn

The previous code returns the following elements from the sample XML document:

<books>
<book type='hardback'>
<isbn>0525934189</isbn>
</book>
<book type='paperback'>
<isbn>0140185399</isbn>
</book>
</books>

Use square brackets to further concentrate the search. The brackets locate elements with certain child nodes or particular values. The following expression locates all books with the specified title:

/books/book[title='Atlas Shrugged']

You can use the brackets to select all books with author elements as well:

/books/book[author]

The bracket notation lets you use attributes as search criteria. The @ symbol facilitates working with attributes. The following XPath locates all hardback books (all books with the type attribute value hardback):

//book[@type='hardback']

It returns the following element from the sample XML document:

<book type='hardback'>
<title>Atlas Shrugged</title>
<author>Ayn Rand</author>
<isbn>0525934189</isbn>
</book>

The bracket notation is called a predicate in the XPath documentation. Another application of the brackets is specifying the item number to retrieve. For example, the first book element is read from the XML document using the following XPath: /books/book[1]

The sample returns the first book element from the sample XML document:

<book type='hardback'>
<title>Atlas Shrugged</title>
<author>Ayn Rand</author>
<isbn>0525934189</isbn>
</book>

Specifying elements by position, name, or attribute is great, but some situations require all elements. Thankfully, the XPath specification supports wildcards to retrieve everything. Every element contained within the root node is easily retrieved with the wildcard (*). The following sample returns all books from the sample XML document:

/books/*

You can easily combine statements with Boolean operators to select a combination of elements. The following statement retrieves all hardcover and soft cover books; thus all elements from the sample XML document:

//books/book[@type='hardcover'] | //books/book[@type='softcover']

The pipe (|) is equal to the logical OR operator. Selecting individual nodes from an XML document is powerful, but developers must be aware of the path to the node. In addition, XPath provides the logical OR and AND for evaluating results. Also, equality operators are available via the <=, <, >, >=, ==, and !=. The double equal (==) signs evaluate equality, while exclamation mark and equal sign (!=) evaluate inequality.

Reference point

The first character in the statement determines point of reference. Statements beginning with a forward slash (/) are considered absolute, while omitting the slash results in a relative reference. I've used absolute references up to this point, so here's an example of a relative reference:

book/*

The previous statement begins the search at the current reference point. It may appear in a group of statements, so the reference point left by the previous statement is utilized. Also, keep in mind that double forward slashes (//) retrieve every matching element regardless of location within the document.

Context and parent

XPath provides a dot notation to handle selecting the current and parent elements. This is analogous to a directory listing in which a single period (.) represents the current directory and double periods (..) represent the parent directory. In XPath, the single period is used to select the current node, and double periods return the parent of the current node. So, to retrieve all child nodes of the parent of the current node, use:

../*

For example, you could access all books from the sample XML document with the following XPath expression:

/books/book/..

Get what you need

The concepts I've touched on in this article are only an introduction to XPath. You can combine them and use them in an XSLT document or XPointer. XPath does provide more power via built-in functions, and it offers an alternate syntax. Check out the XPath specification for more details.