Component-based approach to electronic document exchange in R&D project management


Student thesis: Master's Thesis

View graph of relations


  • Jiazhi LIANG

Related Research Unit(s)


Awarding Institution
Award date16 Feb 2004


In R&D project management, electronic documents are used to store and exchange valuable information. To automate the processing of electronic documents, information extraction systems are required to extract structured data from the binary content of electronic documents. In practice, electronic documents are encoded by multiple, heterogeneous file standards, such as Email, MS Word, HTML, etc. Potentially, it requires a "generic" algorithm to analyze the binary content and distill required data regardless of the file standards. However, coexistence of both open and proprietary file standards limits the accessibility of binary encoding patterns. This complicates the design of information extraction systems. This dissertation presents a component-based approach to the development of information extraction systems. It focuses on architecture design instead of specific algorithm specification. The architecture design can facilitate reuse of pre-built software components, such as Commercial-Off-The-Shelf (COTS) components, to effectively read and process electronic documents in different file standards. More specifically, the core of the proposed approach is a component framework proposed as a set of design guidelines to support the architectural specification of information extraction systems. It consists of formally-defined architectural elements and various equipped tools and techniques to complete the use of the elements. The proposed approach has been used and validated by constructing large-scale information extraction systems in The National Natural Science Foundation of China (NSFC) and Innovation and Technology Commission (ITC) of Hong Kong SAR government. By applying the proposed approach, the extraction systems can be effectively built by reusing COTS components to support electronic documents exchange in multiple file standards.

    Research areas

  • Data processing, Project management, Document imaging systems