XML And Database

Store XML data in a relational database

http://builder.com.com/5100-31-5075453.html

September 29, 2003

By Brian Schaffner

A common problem with XML documents is how to persist them. Storing them in a relational database is often the most logical choice because relational databases are so prevalent.

It's not a simple matter of inserting the XML document into the database; there may be additional considerations. Let's look at some techniques you can use to store XML documents in relational databases.

Document table

The simplest and easiest technique is to create a table within the database that has a single large text field where you can store the XML data. Depending on the specific database and the specific XML documents, this field might be a binary large object (BLOB). Some databases require you to store large amounts of data as a BLOB rather than text.

The advantages of this technique are that it's extremely simple to dump the data into the table and equally simple to extract it back out. There are no keys to manage for this table.

Some major drawbacks are that you probably won't be able to do any useful text searching, and you may have difficulties locating a specific document since there's nothing to identify a unique document within the table.

Keyed table

The next most complex solution is to use a keyed table. This is very similar to the document table approach, but this time your table has two fields: a unique key and the XML document. With this technique, you retain much of the simplicity of storing and retrieving whole XML documents. You also introduce a small amount of complexity with managing the unique keys.

A common approach to creating unique keys is to use an MD5 checksum on the XML document. Keep in mind that this approach is insufficient if you are going to have duplicate XML documents in your table. In that case, you may add additional key fields that can be used to uniquely identify the document.

Like the document table, the keyed table is easy to implement. The additional overhead of using the keys is not significant and it solves the problem of finding specific documents within the table. However, like the document table, you will still not be able to perform any useful text searches.

Finite discrete tables

This technique is more complex, but it also gives you more flexibility. With finite discrete tables, you create a set of tables that will store a finite set of discrete XML information. What does that mean? Well, here's an example.

Imagine you have an order document. At the root of the document is the Order element, which contains CustomerInfo, ItemInfo, and ShippingInfo elements. Within the database, you create an OrderDoc table that has an ID field, a CustomerInfoId field, an ItemInfoId field, and a ShippingInfoId field. Then, you also create a CustomerInfo table, an ItemInfo table, and a ShippingInfo table.

These tables have their respective ID fields along with information about the customer, the items, and the shipping data related to the order. Within this table, there may be additional levels of reference. For example, the CustomerInfo table might contain an AddressInfo field, which references an entry from the AddressInfo table.

Advantages and disadvantages

The advantage of this approach is that it allows you to more closely model the tables to the XML data. This allows you to perform more sophisticated queries against the data. It also makes the data more available, since you don't need an XML parser to read the information.

The downside is that this technique requires a lot more effort to develop and maintain. It means that every document has to be parsed out into discrete components and then stored in the database. If that process is not carefully managed, you could end up with some serious data integrity issues. It also means that when extracting an XML document from the database, you have to assemble the discrete components.

XML:DB native XML database API and its implementation in Apache Xindice

http://builder.com.com/5100-6387-5098255.html

Peter V. Mikhalenko | November 14, 2003

The XML:DB API is designed to enable a common access mechanism to native XML databases. The API enables the construction of applications to store, retrieve, modify and query data that is stored in an XML database. The API is described in terms of IDL, giving a freedom to implement it in any particular language such as Java or C++, with the assumption that the language is object-oriented.

It is designed as vendor neutral to support use with the largest array of databases possible. With a native XML solution, there is no need to convert XML data to some other data structure — you store and retrieve your data ready-to-use in any XML processing workflow.

On the other hand, the benefits of relational data structures are high-speed data retrieval and a relational theory grounded on strong mathematics; it is time-proven technology. However, the performance benefits of a relational database can be depleted by mapping relational structures to XML.

XML:DB API can be considered generally equivalent to technologies such as ODBC, JDBC or Perl DBI.

XML:DB use cases

The API allows you to:

Retrieve a document from the database using a known ID, if you want to work with the result as a DOM Document object.
Retrieve a document from the database using a known ID, if you want to work with the result as text XML.
Retrieve a document from the database using a known ID, if you want to use a SAX content handler to handle the document.
Retrieve a binary BLOB from the database. The BLOB is identified with a known ID. The database will need to determine the data is binary and return the proper resource type.
Retrieve a document from the database using a known ID, if you want to work with the result as a DOM Node object.
Store a new DOM document in the database using a known ID.
Use a SAX ContentHandler to store a new document in the database using a known ID.
Store a new text XML document in the database using a known ID.
Remove an existing resource from the database using a known ID.
Update an existing DOM document stored in the database.
Update an existing text XML document stored in the database.
Search the collection of fields by XPath expression, working with the results as DOM Nodes.
Insert multiple DOM documents under the control of a transaction.
XUpdate update language.

Inside the XML:DB initiative, besides the IDL API and Java interfaces, there is also an update language expressed as a well-formed XML language. XUpdate makes extensive use of the expression language defined by XPath for selecting elements for updating and for conditional processing. XUpdate is a pure descriptive language which is designed with references to the definition of XSL Transformations.

An update is represented by an <xupdate:modifications> element in an XML document. An <xupdate:modifications> element must have a version attribute, indicating the version of XUpdate that the update requires. For the current moment, version 1.0 is the only version allowed.

This element may contain several types of attributes:

xupdate:insert-before
xupdate:insert-after
xupdate:append
xupdate:update
xupdate:remove
xupdate:rename
xupdate:variable
xupdate:value-of
xupdate:if

Inserts and appends are very similar to XSLT stylesheet processing. For example, for creating an XML comment you should execute the following code:

<xupdate:comment>This is the comment</xupdate:comment>

Which should transform to:

<!--This is the comment -->

And the query:

<xupdate:update select="/bottles/wine[2]/province">
Champagne
</xupdate:update>

Would change the content of the context node to:

<bottles>
<wine>
<province>Beaujolais</province>
</wine>
<wine>
<province>Champagne</province>
</wine>
</bottles>

The following code will be intuitively clear for people who know XSLT:

<xupdate:variable name="province" select="/bottles/wine[0]/province"/>

<xupdate:append select="/bottles">
  <xupdate:element name="wine">
    <xupdate:value-of select="$province"/>
  </xupdate:element>
</xupdate:append>

It binds the selected object to the variable named province and uses the value of this variable to append a new wine record.

Reasons to store data in a native XML database

One reason to store data in a native XML database is to avoid the inefficiency and wasted space that results when your data is semi-structured. That is, it has a regular structure, but that structure varies enough that mapping it to a relational database results in either a large number of columns with null values (wasted space) or a large number of tables (inefficient). Although semi-structured data can be stored in object-oriented and hierarchical databases, choosing to store it in a native XML database in the form of an XML document may be a better option.

A second reason to store data in a native XML database is retrieval speed. Depending on how the native XML database physically stores data, it might be able to retrieve data much faster than a relational database. The reason for this is that some storage strategies used by native XML databases store entire documents together physically or use physical (rather than logical) pointers between the parts of the document. This allows the documents to be retrieved either without joins or with physical joins, both of which are faster than the logical joins used by relational databases.

A third reason to store data in a native XML database is that it allows you to exploit XML-specific capabilities, such as executing XML queries. Given that few data-centric applications need this today and that relational databases are implementing XML query languages, this reason is less compelling.

Apache Xindice

Apache Xindice is a native database designed from the ground up to be especially valuable when you have very complex XML structures that would be difficult or impossible to map to a more structured database.

At the present time Xindice uses XPath for its query language and XML:DB XUpdate for its update language. It provides an implementation of the XML:DB API in Java and it is also possible to access Xindice from other languages using XML-RPC.

Native XML database technology is a very new area and Xindice is very much a project in development. The server currently supports storing well-formed XML documents. This means it does not have any schema that constrains what can be placed into a document collection. This makes Xindice a semi-structured database and provides tremendous flexibility in how you store your data, but it also means you give up some common database functionality such as data types.

Xindice currently offers three layers of APIs that can be used to develop applications:

The XML:DB XML Database API is used to develop Xindice applications in Java.
The CORBA API is used when accessing Xindice from a language other then Java.
The Core Server API is the internal Java API of the core database engine. This is the lowest level API and is only available to software running in the same Java VM as the database engine itself.