Employing compact intra-genomic language models to predict
genomic sequences and characterize their entropy
Sérgio Deusdado1, Paulo Carvalho2
Escola Superior Agrária1
Instituto Politécnico de Bragança
P-5300 Bragança, Portugal
E-mail: sergiod@ipb.pt
Universidade do Minho2
Departamento de Informática
P-4710-057 Braga, Portugal
E-mail: pmc at di.uminho.pt
Abstract
Probabilistic models of languages are fundamental to understand and
learn the profile of the subjacent code in order to estimate its
entropy, enabling the verification and prediction of “natural”
emanations of the language. Language models are devoted to capture
salient statistical characteristics of the distribution of sequences of
words, which transposed to the genomic language, allow modeling a
predictive system of the peculiarities and regularities of genomic code
in different inter and intra-genomic conditions. In this paper, we
propose the application of compact intra-genomic language models to
predict the composition of genomic sequences, aiming to achieve
valuable resources for data compression and to contribute to enlarge
the similarity analysis perspectives in genomic sequences. The obtained
results encourage further investigation and validate the use of
language models in biological sequence analysis.
IWPACBB'10, Guimarães,
Portugal, June 2010