Software Design Patterns Classification and Selection Using Text Categorization Approach

基於文本分類方法的軟件設計模式分類與選擇

Student thesis: Doctoral Thesis

View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
Award date13 Sep 2017

Abstract

Software design is considered as a challenging task in agile software development life cycle, where the fundamental structure of software artifacts is highly provoked to its evolution with the passage of time by adding new features or modifying the existing functionality. Consequently, it affects the system level quality attributes such as reusability, maintainability, and understandability. There are two common practical approaches to discuss the declined quality of software systems. The first approach is software refactoring, which is considered as a precise way to improve the design quality of a system by perfecting its internal structure without any change in its external behavior, and the second approach is the employment of design patterns. The use of design patterns in the early life of a designing phase may aid to prevent later refactoring. A design pattern refers to a solution of a commonly occurring problem and is published by an expert in a specific domain, such as in the domain of object-oriented development Gamma et al., published a catalog of 23 design patterns named as Gang-of-Four (GoF) design patterns. The classification scheme and semantic correlation between patterns depends on the experience and knowledge of experts in the corresponding domain. Consequently, a novice developer needs enough knowledge and efforts to understand the classification scheme, the semantic correlation between patterns, and the consequences of each pattern.

Numerous software design patterns have been introduced either as a canonical (a term coined to describe the standard/original solution) or a variant (a term coined to describe an alternative solution) solution to solve a design problem. Books and online repositories/libraries are common sources of cataloged software design patterns. Due to steadily increase in the number of design patterns in the books, literature, and online repositories, it is hard to be aware of the published design patterns and to select the pattern to the real design problem at hand. In this regard, Unified Modeling Based (UML), Ontology, and Text categorization based approach have been introduced to automate the selection of right design pattern for the given design problem. However, the emergence of new patterns, inconsistency in classification schemes, heterogenous pattern description, semi-formal specification, multi-class problem, and an adequate sample size to make precise learning (for individual classifier training) are the main constraints to use the existing automatic techniques to find a candidate design pattern class and suggest more appropriate pattern(s).

To address these issues, we exploit a text categorization based approach via unsupervised learning techniques that targets to present a systematic way to group the similar design patterns and suggest the right design pattern(s) to developers related to the specification of a given design problem (First Contribution). The proposed approach is employed through five widely used unsupervised learning techniques, namely Fuzzy c-means, k-means, Agglomerative, Partition Around Medoids (PAM) with Euclidean and Manhattan distance measures in the context of four design pattern collections used in different domains and 105 real design problems. Subsequently, We also propose an evaluation model to assess the effectiveness of the proposed approach in terms of the 1) organization of design patterns of a target pattern catalog according to the expert’s classification scheme, 2) determination of design pattern class for a given design problem, and 3) selection of right design patterns for a given design problem . In the text categorization approach based automated system, the global filter-based feature selection rather than wrapper and embedded methods are used to decrease the undesirable effect of useless features and construct a more representative feature set. The constructed feature set is based on the capacity of global filter-based feature selection method which is biased to their discriminative power. We leverage an Improved Global Feature Section Scheme (IGFSS) to combine the discriminative power of two feature selection methods and propose a new feature selection method named Ensemble-IG (Second Contribution). The proposed Ensemble-IG combines the power of Odds Ratio (OR) and Information Gain (IG). Moreover, we also propose a new approach to construct a more representative feature set by leveraging a powerful deep learning algorithm named Deep Belief Network (DBN) (Third Contribution).

We observed that Fuzzy c-means and Partition Around Medoids (PAM) make a better results as compared to other unsupervised learning techniques used in the proposed approach. However, we also observed that the learning precision of unsupervised learners remains sensitive to the weighting and feature selection methods. In many cases, we observed that Term Frequency Collection (TFC) and Term Frequency Inverse Document Frequency (TFIDF) remain the best weighting methods for unsupervised learners. Subsequently, we also observed significant improvement in learning precision of the outperformed unsupervised learner of the proposed approach through Ensemble-IG and deep learning based feature selection method.

There are several consequences extracted from the experimental results and described in the presented case studies. We summarize results and concluded that the proposed approach has four advantages over the existing approaches. First, the semi-formal specification of design patterns is not required as a prerequisite because description of problem definition (unstructured form) rather than solutions section of design patterns is used. Second, the ground reality of class label assignment is not mandatory because the proposed approach is employed through unsupervised rather than supervised learning techniques. Third, the lack of classifier’s training for each design pattern class. Fourth, an adequate sample size is not required to make precise learning.