Text Mining in Computational Biology and Biomedicine


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date15 Oct 2018


Healthcare is an essential topic throughout the whole life of everyone worldwide. Biomedical technology plays a vital role in the healthcare development. Ordinary people might be misled by inappropriate therapy and put into dangerous situations due to the lack of biomedical knowledge. Extensive information buried in biomedical documents are valuable but time-consuming to obtain by manually reading. Biomedical text mining (also known as BioNLP) bridges this gap using natural language processing technology to transform the complex biomedical text into easy-understanding representations of knowledge (e.g., tags, summaries, networks, etc.). The biomedical field is no longer insuperable for the ones without specialist biomedical training. Data analyzers are easier to get involved in a biomedical research project. As for the ordinary people, the easily earned biomedical knowledge will help improve their healthcare development and correctly direct their use of medical treatments. One of the difficulties in this interdisciplinary topic is to develop a method that can handle the difference between biomedical text and regular text. Another challenge is how to find a biomedical or healthcare problem that can make use of textual data and design a model to tackle it.

This thesis presents a brief overview of the current state of biomedical text mining from different aspects such as data sources, task types, support tools, challenges, and the frontier problems. Then it gives a detailed description of three practical applications - biomedical document categorization, information extraction from biomedical literature, and association mining from the documents in genomics data repository. The content includes the problem definition, feature description, novel machine-learning solutions and the experimental results.

Regarding to document categorization, we elaborate it in a concrete problem, cancer hallmark annotation on biomedical literature. Text mining is a promising technique that could discover vast amounts of knowledge about cancer embedded in the biomedical literature. The automated annotation of cancer hallmarks in biomedical literature could reveal related procedures of cancer transformation in the context of a paper and extract most of the articles corresponding to an interested cancer hallmark.

As for information extraction, the extraction of bacteria gene interactions from biomedical literature is depicted in this thesis. The task aims at identifying the relations among biological entities from biomedical literature. Since experimentally validated gene interactions are often reported in the literature, it is a feasible way to extract gene interaction events from the scientific texts. Specifically, the task is to extract genetic processes mentioned in the scientific corpus for a kind of bacterium - Bacillus subtilis. We describe a deep learning method to identify these relations from biomedical literature automatically.

For the association mining, it is entirely different from the conventional data mining task. The objective is to identify the associations between the genes, diseases, drugs. However, we cannot observe the potential relationships from either genomics data or the description documents. We proposed a computational method to obtain the gene signatures from the description documents as well as the genomics data, which would be used to calculate the similarity among the genes, diseases, and drugs. We further adopted a biological metric to validate the associations and derived the corresponding ranking scores.

    Research areas

  • Bioinformatics, BioNLP, Text mining, Data mining, Computational biology