java - How to dynamically assert on word boundaries in raw text while pre processing a document? -
firstly, have researched , found matches closely deal word boundaries in sentences or @ maximum, suggest use of tokenizers not looking for. query follows:
my current task related preprocessing unstructured data follow pipeline - conversion of pdf txt files gives out few sentences this:
s e ar c h t h s s t r ing def e c t
what want :
search string defect
all i'm looking few possible approaches such kinds of scenarios in nlp. in advance!
use this file word list.
from math import log # build cost dictionary, assuming zipf's law , cost = -math.log(probability). words = open("words-by-frequency.txt").read().split() wordcost = dict((k, log((i+1)*log(len(words)))) i,k in enumerate(words)) maxword = max(len(x) x in words) def infer_spaces(s): """uses dynamic programming infer location of spaces in string without spaces.""" # find best match first characters, assuming cost has # been built i-1 first characters. # returns pair (match_cost, match_length). def best_match(i): candidates = enumerate(reversed(cost[max(0, i-maxword):i])) return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) k,c in candidates) # build cost array. cost = [0] in range(1,len(s)+1): c,k = best_match(i) cost.append(c) # backtrack recover minimal-cost string. out = [] = len(s) while i>0: c,k = best_match(i) assert c == cost[i] out.append(s[i-k:i]) -= k return " ".join(reversed(out)) s = 's e ar c h t h s s t r ing def e c t'.replace(' ','') print(infer_spaces(s))
Comments
Post a Comment