c++ - Splitting text into a list of words with ICU -


i'm working on text tokenizer. icu 1 of few c++ libraries have feature, , best maintained one, i'd use it. i've found docs breakiterator, there's 1 problem it: how leave punctuation out?

#include "unicode/brkiter.h"  #include <qfile>  #include <vector>  std::vector<qstring> listwordboundaries(const unicodestring& s) {     uerrorcode status = u_zero_error;     breakiterator* bi = breakiterator::createwordinstance(locale::getus(), status);      std::vector<qstring> words;      bi->settext(s);     (int32_t p = bi->first(), prevboundary = 0; p != breakiterator::done; prevboundary = p, p = bi->next())     {         const auto word = s.tempsubstringbetween(prevboundary, p);         char buffer [16384];         word.toutf8(checkedarraybytesink(buffer, 16384));         words.emplace_back(qstring::fromutf8(buffer));     }      delete bi;      return words; }  int main(int /*argc*/, char * /*argv*/ []) {     qfile f("e:\\words.txt");     f.open(qfile::readonly);      qfile result("e:\\words.txt");     result.open(qfile::writeonly);      const qbytearray strdata = f.readall();     (const qstring& word: listwordboundaries(unicodestring::fromutf8(stringpiece(strdata.data(), strdata.size()))))     {         result.write(word.toutf8());         result.write("\n");     }      return 0; } 

naturally, resulting file looks this:

“ come  outside .  best  if    not  wake  him . ” 

what need just words. how can done?

qt library include several useful methods check char's properties: qchar.

indeed, create qstring variable buffer , check properties need before insert output vector.

for example:

auto token = qstring::fromutf8(buffer); if (token.length() > 0 && token.data()[0].ispunct() == false) {   words.push_back(std::move(token)); } 

with code can access first character of string , check whether punctuation mark or not.

something more robust, express function:

bool isinblacklist(const qstring& str) {   const auto len = str.lenght();   if (len == 0) return true;   for(int = 0; < len; ++i) {     const auto&& c = str.data()[i];     if (c.ispunct() == true || c.isspace() == true) {       return true;     }   }   return false; } 

if function returns true, token hasn't inserted vector.


Comments

Popular posts from this blog

Spring Boot + JPA + Hibernate: Unable to locate persister -

go - Golang: panic: runtime error: invalid memory address or nil pointer dereference using bufio.Scanner -

c - double free or corruption (fasttop) -