Methods and Analyses on Oncovirus Integration


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date14 Nov 2018


Oncoviruses are related to many cancer types globally, especially human papillomavirus (HPV) in cervical cancer and hepatitis B virus (HBV) in hepatocellular carcinoma. During the infections in human cells, oncovirus expresses specific viral proteins, such as E6 and E7 of HPV,which could have interactions with human tumor-suppressor proteins and break cellular homeostasis. Besides this, integrations of virus genome segments into host genome happen frequently, inducing abnormal gene expressions in local genetic region via cis-regulatory, and promoting potential tumorigenesis. Investigations on oncovirus infections and integrations become more and more important to the diagnosis and therapeutic schedule of related diseases.

Next generation sequencing (NGS) technology provides opportunity to study details of virus infection and integration in multi-layer, e.g., DNA and RNA levels. We constructed automated pipeline (CanaPipe) for analyses on sequencing data of cancer samples, including whole genome sequencing (WGS),virus captured sequencing (VCS), and transcriptome sequencing (RNA-seq) data. CanaPipe consists of quality control and alignment of sequencing reads, and batch of modules for somatic mutation detections, including small alterations, segmental copy number variations (CNV), structure variations (SV), oncovirus analyses, and so on. For the automation of the pipeline, we developed one method, named as FlowSmart, to integrate analysis modules and manage hundreds of samples in multi-level priorities through defined workflow.

We developed one bioinformatics package (FuseSV) which provides numerous functions widely applied in oncovirus analyses on NGS data. It includes database construction, classification of viral subtype, mutations and recombination detection and construction of individual virus genome, identification of virus integrations and investigation of relevant genomic features, and construction of local genomic map (LGM) at the virus integration loci. Additional functions are also introduced for trans-omics analyses on allelic specific expression (ASE) and virus-host fusions. These functions are achieved by divided modules, and managed in integrative form, which facilitates future updates in collaborative developing mode. We set up an online platform (DoVirus) for integrated oncovirus analyses, providing further more investigations and interactive visualization based on FuseSV results.

FuseSV was applied on oncovirus analyses in several cooperated projects consisting of hundreds of cancer patients. From 135 cervical carcinomas (CC), including squamous- and adeno-subtype, we found several hotspot genes dysregulated by HPV integrations.Besides these found previously (MYC, FHIT, KLF12, KLF5LRP1B, and LEPREL1), some genes (HMGA2, DLG2, and SEMA3D) were firstly reported in cervical cancer study. Furthermore, at the junction sites of HPV integrations, we found significant enrichment of micro-homologies, which implied that HPV integrations might be formed via micro-homology mediated DNA repair pathway. In another project consisting of 150 small cell cervical carcinomas (SCCC), we found four novel HPV integrated hotspot gene families (SOX, NR4A, ANKRD, and CEA family) besides well-known one (MYC family). Copynumbers and RNA expressions of these genes were influenced dramatically.

We developed a conjugate-graph algorithm to resolve the local genomic map around oncovirus integrations. This method successfully solved all simulated cases (simple and complex structures). In the SCCC project, we successfully constructed all SCCC samples’ LGMs with WGS data. These LGMs were further classified into three distinct patterns: 1) directly duplicating oncogene, 2) forming fusion genes, and 3) activating genes via the cis-regulations of HPV long control regions. Our findings are the first time to report classifications of HPV integrations based on formed local genomic maps.We also applied this method in research of hepatocellular carcinoma (HCC). In one patient with six multiple HCC tumors in the liver, we used VCS and WGS data to construct the individual HBV genome and local genomic map around HBV integrations in each tumor locus. We found that samples from multiple tumors shared two viral integration sites that could affect three host genes (CSMD2 on chr1 and MED30/EXT1 on chr8). Further study indicated one hybrid chromosome formed by HBV integrations between chr1 and chr8 that was shared by multiple tumors, suggesting that HBV associated, multifocal HCC is monoclonal in origin.

Our software (FuseSV) and algorithm provide credible step-wise analyses on oncovirus study, especially on virus integrations. It gives researchers insights into where and how virus integrations occur and also the functional dysregulations they induce. More analysis modules on oncovirus will be developed and incorporated into FuseSV and online platform in our future work, and will support deeper investigations on mechanism of interactions between oncovirus and host cells.

    Research areas

  • Cancer, Oncovirus, Bioinformatics, Local genomic map, Conjugate graph