Detection of tandem repeats in DNA sequences based on parametric spectral analysis

  • Hongxia ZHOU

Student thesis: Master's Thesis

Abstract

Repetitive DNA sequences occur frequently in genomes. The repetitions may be directly adjacent to each other such as in tandem repeats, or dispersed throughout the genome such as in Long Interspersed Nuclear Elements (LINE) and Short Interspersed Nuclear Elements (SINE). A tandem repeat is an array of consecutive repeats, which are found to be related to regulatory functions and diseases. Since tandem repeats vary for different individuals, they are commonly used in human gene mapping, linkage studies, and forensic DNA fingerprinting analysis. Thus, detecting tandem repeats has considerable significance. A number of research works have been carried out to determine the existence of these repeats in DNA sequences. Some problems, such as the repetitive pattern structure and pattern size have to be addressed as they are not known in advance. Most tandem repeats are not repeated exactly but approximately, and as there is a vast amount of data to be processed, this usually requires a lot of computating. This thesis is aimed at developing an algorithm which can solve the problems and detect tandem repeats efficiently. The algorithm uses the signal processing technology. The DNA sequences are first converted from character strings into numerical strings. Then the autoregressive (AR) model is used to obtain the spectrum of a DNA sequence in a sliding window. The AR model is a parametric spectral estimation method, unlike the Fourier transform which is a non-parametric spectral estimation method. After the whole sequence is processed, a position-frequency plane is produced. In this plane, significant peaks in the spectrum are selected according to the signal to noise ratio (SNR). Then candidate regions which may contain tandem repeats are shown in this plane. Finally, these regions are analyzed to determine if they contain tandem repeats and what tandem repeats they contain, including exact and approximated repeats. In comparison with other algorithms, experiment results show that our method provides more detailed and reliable information. In this thesis, the tandem repeats detection methods are presented and their effectiveness is demonstrated. Key Words: Autoregressive model; Spectrum analysis; Pattern size; Tandem repeats; DNA sequence analysis
Date of Award2 Oct 2007
Original languageEnglish
Awarding Institution
  • City University of Hong Kong
SupervisorHong YAN (Supervisor)

Keywords

  • Spectrum analysis
  • DNA
  • Analysis

Cite this

'