Indexing and query processing of XML documents
Student thesis: Master's Thesis
Related Research Unit(s)
|Award date||3 Oct 2006|
The Extensible Markup Language (XML) [Con00] is becoming the de facto standard for information representation and exchange over the Internet. Owing to its hierarchical (recursive) and self-describing syntax, XML is flexible enough to express a large variety of information. To retrieve useful information from XML, queries expressed in query language like XPath is used to specify some elements that suit a given criteria. An XPath expression is comprised of a sequence of location steps, each consisting of an axis, a node test, and possibly a predicate. An axis specifies the structural relationship between elements among two adjacent location steps. A node test gives restriction on the names or types of the elements selected in a location step. A predicate specifies further criteria. In this thesis, the indexing and query processing of XML documents are studied. First, we designed an efficient indexing structure for XML documents so that each basic XPath axis step is supported. The indexing structure is built on top of the B+-tree which is available in practically all commercial relational database systems. For most of the basic axis steps, we are able to derive theoretical worst case execution time bounds. We also perform experimental evaluation to substantiate those bounds. Besides, we also studied XML twig pattern matching algorithms and designed an enhancement to TJFast, a state-of-the-art algorithm for the problem. This reduces the CPU cost and also favors indexed inputs. Our algorithm can be shown analytically as efficient as TJFast in terms of worst case I/O, and experimentally performs significantly better.
- XML (Document markup language), Querying (Computer science), Indexing