python - Regex - Splitting Strings at full-stops unless it's part of an honorific -

- March 15, 2012

i have list containing possible titles:

['mr.', 'mrs.', 'ms.', 'dr.', 'prof.', 'rev.', 'capt.', 'lt.-col.', 'col.', 'lt.-cmdr.', 'the hon.', 'cmdr.', 'flt. lt.', 'brgdr.', 'wng. cmdr.', 'group capt.' ,'rt.', 'maj.-gen.', 'rear admrl.', 'esq.', 'mx', 'adv', 'jr.']

i need python 2.7 code can replace full-stops \. newline \n unless it's 1 of above titles.

splitting list of strings fine well.

sample input:

modi waiting in line thank dr. manmohan singh preparing road map introduction of gst in india. bill set pass.

sample output:

modi waiting in line thank dr. manmohan singh preparing road map introduction of gst in india. bill set pass.

this should trick, here use list comprehension conditional statement concatenate words \n if contain full-stop, , not in list of key words. otherwise concatenate space.

finally words in sentence joined using join(), , use rstrip() eliminate newline remaining @ end of string.

l = set(['mr.', 'mrs.', 'ms.', 'dr.', 'prof.', 'rev.', 'capt.', 'lt.-col.',  'col.', 'lt.-cmdr.', 'the hon.', 'cmdr.', 'flt. lt.', 'brgdr.', 'wng. cmdr.',  'group capt.' ,'rt.', 'maj.-gen.', 'rear admrl.', 'esq.', 'mx', 'adv', 'jr.'] ) s = 'modi waiting in line thank dr. manmohan singh preparing road  map introduction of gst in india. bill set pass.'  def split_at_period(input_str, keywords):      final = []      split_l = input_str.split(' ')      word in split_l:         if '.' in word , word not in keywords:             final.append(word + '\n')             continue         final.append(word + ' ')      return ''.join(final).rstrip()  print split_at_period(s, l)

or one liner :d

print ''.join([w + '\n' if '.' in w , w not in l else w + ' ' w in s.split(' ')]).rstrip()

sample output:

modi waiting in line thank dr. manmohan singh preparing road map introduction of gst in india. bill set pass.

how works?

firstly split our string space ' ' delimiter using split() string function, returning following list:

>>> ['modi', 'is', 'waiting', 'in', 'line', 'to', 'thank', 'dr.',  'manmohan', 'singh', 'for', 'preparing', 'a', 'road', 'map', 'for',  'introduction', 'of', 'gst', 'in', 'india.', 'the', 'bill', 'is',  'set', 'to', 'pass.']

we start build new list iterating through split-up list. if see word contains period, is not keyword, (ex: india. and pass. in case) have concatenate newline \n word begin new sentence. can append() our final list, , continue out of current iteration.

if word not end off sentence period, can concatenate space rebuild original string.

this final looks before built string using join().

>>> ['modi ', 'is ', 'waiting ', 'in ', 'line ', 'to ', 'thank ', 'dr.  ', 'manmohan ', 'singh ', 'for ', 'preparing ', 'a ', 'road ', 'map ',  'for ', 'introduction ', 'of ', 'gst ', 'in ', 'india.\n', 'the ', 'bill ',  'is ', 'set ', 'to ', 'pass.\n']

excellent, have spaces, , newlines need be! now, can rebuild string. notice however, the last element in list happens contain \n, can clean calling rstrip() on our new string.

the initial solution did not support spaces in keywords, i've included new more robust solution below:

import re  def format_string(input_string, keywords):     regexes = '|'.join(keywords)  # combine keywords regex.     split_list = re.split(regexes, input_string)  # split on keys.     removed = re.findall(regexes, input_string)  # find removed keys.     newly_joined = split_list + removed  # interleave removed , split.     newly_joined[::2] = split_list     newly_joined[1::2] = removed     space_regex = '\.\s*'      index, section in enumerate(newly_joined):         if '.' in section , section not in removed:             newly_joined[index] = re.sub(space_regex, '.\n', section)     return ''.join(newly_joined).strip()

Search This Blog

Image