Reverse engineering data semantics from arbitrary XML document

以反向工程從任何 XML 文件中找出數據語意

Student thesis: Master's Thesis

View graph of relations


  • Hoi Cheung SHIU

Related Research Unit(s)


Awarding Institution
  • Shi Piu Joseph FONG (Supervisor)
Award date3 Oct 2006


Extensible Markup Language (XML) has been the standard for persistent data storing and data interchange via the Internet, due to its openness, self-descriptiveness and flexibility. As such, more and more data have been converted into XML format or handled in XML format, and the chance for software developers to handle XML documents is getting higher. The following is a list of common ways of manipulating XML documents, 1. extracting desired data from the documents for further processing, 2. translating their contents for generating the resultant XML documents that mostly in a different structure (or schema), or even in different document formats, such as plain text or HTML document, or 3. designing the necessary data structures, such as table schema in a relational database, for storing their contents Given a small to medium sized XML document, it is possible to use a text editor to view their contents for a better understanding to achieve the above goals. However, XML documents can be gigantic that cannot be loaded into the computer memory. As such, it is impossible to understand its contents by viewing it with a text editor. There are software that enables the users to examine the XML document, such as with queries in XPath, provided that the user has gone through the entire XML document for its structure or has studied the hard to understand schema document, Document Type Definition (DTD) [1] or the more powerful but yet more complicated XML Schema Document (XSD) [2]. It would be even worse if it is necessary to handle a huge sized XML document and the corresponding schema is missing. Then, the user has no choice but to view the document manually. With the above scenario in mind, we propose a systematic approach to reverse engineer arbitrary XML documents to their conceptual schema, DTD Graphs. The proposed approach not only determines the structure of the XML document, but also derives the candidate data semantics among the XML element instances, if we treat each instance as a record in a table of a relational database. Therefore, an even more significant contribution of the proposed approach is to determine the candidate data semantics from XML documents. If the DTD’s of the XML documents exists with the identifications of the ID/IDREF(S) type attributes, more data semantics can be derived, as they define explicit linkages among the XML element instances in the documents. Another application of the determined data semantics is to verify the linkages implemented by ID/IDREF(S). If the element is referring to an incorrect XML element type, an extra data semantic will be determined as a result, and such findings can be used for verification purposes. The proposed approach is based on the idea that there are implicit and explicit referential linkages among XML elements by parent-children structure and ID/IDREF(S) respectively. Inline with the idea of attribute inheritance, it is possible to determine implicit is-a relationship among XML elements by the comparison of element attributes or subelements. By these findings, it is possible to reverse engineer an arbitrary XML document into its conceptual schema in a DTD Graph format with data semantics. Furthermore, the data semantics that can be natively supported by XML documents will be discussed as well. All proposed algorithms use Simple API for XML (SAX) for processing XML documents so that it is unnecessary to load the entire XML document in the computer memory, which is especially important for handling huge XML documents.

    Research areas

  • XML (Document markup language), Reverse engineering, Data structures (Computer science)