Indexing and query processing of XML documents


Student thesis: Master's Thesis

View graph of relations


  • Chi Hang YUEN

Related Research Unit(s)


Awarding Institution
  • Chung Keung POON (Supervisor)
Award date3 Oct 2006


The Extensible Markup Language (XML) [Con00] is becoming the de facto standard for information representation and exchange over the Internet. Owing to its hierarchical (recursive) and self-describing syntax, XML is flexible enough to express a large variety of information. To retrieve useful information from XML, queries expressed in query language like XPath is used to specify some elements that suit a given criteria. An XPath expression is comprised of a sequence of location steps, each consisting of an axis, a node test, and possibly a predicate. An axis specifies the structural relationship between elements among two adjacent location steps. A node test gives restriction on the names or types of the elements selected in a location step. A predicate specifies further criteria. In this thesis, the indexing and query processing of XML documents are studied. First, we designed an efficient indexing structure for XML documents so that each basic XPath axis step is supported. The indexing structure is built on top of the B+-tree which is available in practically all commercial relational database systems. For most of the basic axis steps, we are able to derive theoretical worst case execution time bounds. We also perform experimental evaluation to substantiate those bounds. Besides, we also studied XML twig pattern matching algorithms and designed an enhancement to TJFast, a state-of-the-art algorithm for the problem. This reduces the CPU cost and also favors indexed inputs. Our algorithm can be shown analytically as efficient as TJFast in terms of worst case I/O, and experimentally performs significantly better.

    Research areas

  • XML (Document markup language), Querying (Computer science), Indexing