java - How to dynamically assert on word boundaries in raw text while pre processing a document? -

- August 15, 2014

firstly, have researched , found matches closely deal word boundaries in sentences or @ maximum, suggest use of tokenizers not looking for. query follows:

my current task related preprocessing unstructured data follow pipeline - conversion of pdf txt files gives out few sentences this:

s e ar c h t h s s t r ing def e c t

what want :

search string defect

all i'm looking few possible approaches such kinds of scenarios in nlp. in advance!

use this file word list.

from math import log  # build cost dictionary, assuming zipf's law , cost = -math.log(probability). words = open("words-by-frequency.txt").read().split() wordcost = dict((k, log((i+1)*log(len(words)))) i,k in enumerate(words)) maxword = max(len(x) x in words)  def infer_spaces(s):     """uses dynamic programming infer location of spaces in string     without spaces."""      # find best match first characters, assuming cost has     # been built i-1 first characters.     # returns pair (match_cost, match_length).     def best_match(i):         candidates = enumerate(reversed(cost[max(0, i-maxword):i]))         return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) k,c in candidates)      # build cost array.     cost = [0]     in range(1,len(s)+1):         c,k = best_match(i)         cost.append(c)      # backtrack recover minimal-cost string.     out = []     = len(s)     while i>0:         c,k = best_match(i)         assert c == cost[i]         out.append(s[i-k:i])         -= k      return " ".join(reversed(out))  s = 's e ar c h t h s s t r ing def e c t'.replace(' ','') print(infer_spaces(s))

Search This Blog

Image

java - How to dynamically assert on word boundaries in raw text while pre processing a document? -

Comments

Post a Comment

Popular posts from this blog

Spring Boot + JPA + Hibernate: Unable to locate persister -

go - Golang: panic: runtime error: invalid memory address or nil pointer dereference using bufio.Scanner -

c - double free or corruption (fasttop) -