DOI: 10.5176/2251-1911_CMCGS14.28

Authors: G. Nuel and V. Delos

Abstract: Nowadays, Next Generation Sequencing (NGS) produce huge number of reads which are combined using multiple alignment techniques to produce sequences. During this process, many sequencing errors are corrected, but the resulting sequences nevertheless contain a marginal level of uncertainty in the form of 0:1{6e6090cdd558c53a8bc18225ef4499fead9160abd3419ad4f137e902b483c465} or less of degenerated positions (like the letter ‘N’ corresponding to any nucleotide). A previous work [7] showed that these degenerated letters might lead to erroneous counts when performing pattern matching on these sequences. An algorithm based on Deterministic Finite Automata (DFA) and Markov Chain Embedding (MCE) was suggested to deal with this problem. In this paper, we introduce a new version of this algorithm which uses Nondeterministic Finite Automata (NFA) rather than DFA to perform what we call “lazy MCE”. This new approach proves itself much faster than the previous one and we illustrate its usefulness on two NGS datasets and a selection of regular expressions. A software implementing this algorithm is available: countmotif, http://www.math-info.univ-paris5.fr/~delosvin/index.php?choix=4

Keywords: Nondeterministic Finite Automaton; Deterministic Finite Automaton; Next Generation Sequencing; Pattern Matching; Lazy Determinization.

simplr_role_lock:

Price: $0.00

Loading Updating cart...
LoadingUpdating...