python - Regex - Splitting Strings at full-stops unless it's part of an honorific -
i have list containing possible titles:
['mr.', 'mrs.', 'ms.', 'dr.', 'prof.', 'rev.', 'capt.', 'lt.-col.', 'col.', 'lt.-cmdr.', 'the hon.', 'cmdr.', 'flt. lt.', 'brgdr.', 'wng. cmdr.', 'group capt.' ,'rt.', 'maj.-gen.', 'rear admrl.', 'esq.', 'mx', 'adv', 'jr.']
i need python 2.7 code can replace full-stops \.
newline \n
unless it's 1 of above titles.
splitting list of strings fine well.
sample input:
modi waiting in line thank dr. manmohan singh preparing road map introduction of gst in india. bill set pass.
sample output:
modi waiting in line thank dr. manmohan singh preparing road map introduction of gst in india. bill set pass.
this should trick, here use list comprehension conditional statement concatenate words \n
if contain full-stop, , not in list of key words. otherwise concatenate space.
finally words in sentence joined using join()
, , use rstrip()
eliminate newline remaining @ end of string.
l = set(['mr.', 'mrs.', 'ms.', 'dr.', 'prof.', 'rev.', 'capt.', 'lt.-col.', 'col.', 'lt.-cmdr.', 'the hon.', 'cmdr.', 'flt. lt.', 'brgdr.', 'wng. cmdr.', 'group capt.' ,'rt.', 'maj.-gen.', 'rear admrl.', 'esq.', 'mx', 'adv', 'jr.'] ) s = 'modi waiting in line thank dr. manmohan singh preparing road map introduction of gst in india. bill set pass.' def split_at_period(input_str, keywords): final = [] split_l = input_str.split(' ') word in split_l: if '.' in word , word not in keywords: final.append(word + '\n') continue final.append(word + ' ') return ''.join(final).rstrip() print split_at_period(s, l)
or one liner :d
print ''.join([w + '\n' if '.' in w , w not in l else w + ' ' w in s.split(' ')]).rstrip()
sample output:
modi waiting in line thank dr. manmohan singh preparing road map introduction of gst in india. bill set pass.
how works?
firstly split our string space ' '
delimiter using split()
string function, returning following list
:
>>> ['modi', 'is', 'waiting', 'in', 'line', 'to', 'thank', 'dr.', 'manmohan', 'singh', 'for', 'preparing', 'a', 'road', 'map', 'for', 'introduction', 'of', 'gst', 'in', 'india.', 'the', 'bill', 'is', 'set', 'to', 'pass.']
we start build new list iterating through split-up list. if see word
contains period, is not keyword, (ex: india.
and pass.
in case) have concatenate newline \n
word begin new sentence. can append()
our final
list, , continue
out of current iteration.
if word not end off sentence period, can concatenate space rebuild original string.
this final
looks before built string using join()
.
>>> ['modi ', 'is ', 'waiting ', 'in ', 'line ', 'to ', 'thank ', 'dr. ', 'manmohan ', 'singh ', 'for ', 'preparing ', 'a ', 'road ', 'map ', 'for ', 'introduction ', 'of ', 'gst ', 'in ', 'india.\n', 'the ', 'bill ', 'is ', 'set ', 'to ', 'pass.\n']
excellent, have spaces, , newlines need be! now, can rebuild string. notice however, the last element in list happens contain \n
, can clean calling rstrip()
on our new string.
the initial solution did not support spaces in keywords, i've included new more robust solution below:
import re def format_string(input_string, keywords): regexes = '|'.join(keywords) # combine keywords regex. split_list = re.split(regexes, input_string) # split on keys. removed = re.findall(regexes, input_string) # find removed keys. newly_joined = split_list + removed # interleave removed , split. newly_joined[::2] = split_list newly_joined[1::2] = removed space_regex = '\.\s*' index, section in enumerate(newly_joined): if '.' in section , section not in removed: newly_joined[index] = re.sub(space_regex, '.\n', section) return ''.join(newly_joined).strip()
Comments
Post a Comment