python - Find number of breaks in a sequence -
i have program parsing allele sequences. trying write code determines if allele complete or not. so, need count number of breaks in reference sequence. break signified string of '-'. if there more 1 break want program "incomplete allele."
how can figure out how count number of breaks in sequence?
here example of "broken" sequence:
>dqb1*04:02:01 ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ --atgtcttggaagaaggctttgcggat-------ccctggaggccttcgggtagcaact gtgacctt----gatgctggcgatgctgagcaccccggtggctgagggcagagactctcc cgaggatttcgtgttccagtttaagggcatgtgctacttcaccaacgggaccgagcgcgt gcggggtgtgaccagatacatctataaccgagaggagtacgcgcgcttcgacagcgacgt gggggtgtatcgggcggtgacgccgctggggcggcttgacgccgagtactggaatagcca gaaggacatcctggaggaggaccgggcgtcggtggacaccgtatgcagacacaactacca gttggagctccgcacgaccttgcagcggcga----------------------------- ----------------------------------------------------- ------------------------------------------------------------ ------------------------------------------------------------ ---gtggagcccacagtgaccatctccccatccaggacagaggccctcaaccaccacaac ctgctggtctgctcagtgacagatttctatccagcccagatcaaagtccggtggtttcgg aatgaccaggaggagacaactggcgttgtgtccaccccccttattaggaacggtgactgg accttccagatcctggtgatgctggaaatgactccccagcgtggagacgtctacacctgc cacgtggagcaccccagcctccagaaccccatcatcgtggagtggcgggctcagtctgaa tctgcccagagcaagatgctgagtgg----cattggaggcttcgtgctggggctgatctt cctcgggctgggccttattatc--------------catcacaggagtcagaaagggctc ctgcactga--------------------------------------------------- ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------
the code have far follows:
idx=[] m in range(len(sequence)): n in re.finditer('-',sequence[0]): idx.append(n.start()) counter=0 min_val=[] n in range(len(idx)): if counter==idx[n]: counter=counter+1 elif counter !=0: min_val.append(idx[n-1]) counter=0
my reasoning above code if find start positions of '-' can see how many times appear within sequence , if break sequence @ all. however, know there flaws in above code.
it seems can count occurrances of -+
, i.e. sequence of one or more -
symbols. problem line breaks, either incorporate regex, or split , join string before matching.
>>> sequence = """>dqb1*04:02:01.....""" >>> joined = ''.join(sequence.splitlines()) >>> sum(1 m in re.finditer("-+", joined)) 7
note: includes -
@ start , end of sequence.
or reverse approach: instead of counting gaps, count groups:
>>> sum(1 m in re.finditer("[gatc]+", joined)) 6
Comments
Post a Comment