python - Find number of breaks in a sequence -


i have program parsing allele sequences. trying write code determines if allele complete or not. so, need count number of breaks in reference sequence. break signified string of '-'. if there more 1 break want program "incomplete allele."

how can figure out how count number of breaks in sequence?

here example of "broken" sequence:

>dqb1*04:02:01 ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ --atgtcttggaagaaggctttgcggat-------ccctggaggccttcgggtagcaact gtgacctt----gatgctggcgatgctgagcaccccggtggctgagggcagagactctcc cgaggatttcgtgttccagtttaagggcatgtgctacttcaccaacgggaccgagcgcgt gcggggtgtgaccagatacatctataaccgagaggagtacgcgcgcttcgacagcgacgt gggggtgtatcgggcggtgacgccgctggggcggcttgacgccgagtactggaatagcca gaaggacatcctggaggaggaccgggcgtcggtggacaccgtatgcagacacaactacca gttggagctccgcacgaccttgcagcggcga----------------------------- ----------------------------------------------------- ------------------------------------------------------------ ------------------------------------------------------------ ---gtggagcccacagtgaccatctccccatccaggacagaggccctcaaccaccacaac ctgctggtctgctcagtgacagatttctatccagcccagatcaaagtccggtggtttcgg aatgaccaggaggagacaactggcgttgtgtccaccccccttattaggaacggtgactgg accttccagatcctggtgatgctggaaatgactccccagcgtggagacgtctacacctgc cacgtggagcaccccagcctccagaaccccatcatcgtggagtggcgggctcagtctgaa tctgcccagagcaagatgctgagtgg----cattggaggcttcgtgctggggctgatctt cctcgggctgggccttattatc--------------catcacaggagtcagaaagggctc ctgcactga--------------------------------------------------- ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ 

the code have far follows:

idx=[] m in range(len(sequence)):     n in re.finditer('-',sequence[0]):          idx.append(n.start()) counter=0 min_val=[] n in range(len(idx)):     if counter==idx[n]:         counter=counter+1     elif counter !=0:         min_val.append(idx[n-1])         counter=0 

my reasoning above code if find start positions of '-' can see how many times appear within sequence , if break sequence @ all. however, know there flaws in above code.

it seems can count occurrances of -+, i.e. sequence of one or more - symbols. problem line breaks, either incorporate regex, or split , join string before matching.

>>> sequence = """>dqb1*04:02:01.....""" >>> joined = ''.join(sequence.splitlines()) >>> sum(1 m in re.finditer("-+", joined)) 7 

note: includes - @ start , end of sequence.

or reverse approach: instead of counting gaps, count groups:

>>> sum(1 m in re.finditer("[gatc]+", joined)) 6 

Comments

Popular posts from this blog

Spring Boot + JPA + Hibernate: Unable to locate persister -

go - Golang: panic: runtime error: invalid memory address or nil pointer dereference using bufio.Scanner -

c - double free or corruption (fasttop) -