c++ - Splitting text into a list of words with ICU -
i'm working on text tokenizer. icu 1 of few c++ libraries have feature, , best maintained one, i'd use it. i've found docs breakiterator
, there's 1 problem it: how leave punctuation out?
#include "unicode/brkiter.h" #include <qfile> #include <vector> std::vector<qstring> listwordboundaries(const unicodestring& s) { uerrorcode status = u_zero_error; breakiterator* bi = breakiterator::createwordinstance(locale::getus(), status); std::vector<qstring> words; bi->settext(s); (int32_t p = bi->first(), prevboundary = 0; p != breakiterator::done; prevboundary = p, p = bi->next()) { const auto word = s.tempsubstringbetween(prevboundary, p); char buffer [16384]; word.toutf8(checkedarraybytesink(buffer, 16384)); words.emplace_back(qstring::fromutf8(buffer)); } delete bi; return words; } int main(int /*argc*/, char * /*argv*/ []) { qfile f("e:\\words.txt"); f.open(qfile::readonly); qfile result("e:\\words.txt"); result.open(qfile::writeonly); const qbytearray strdata = f.readall(); (const qstring& word: listwordboundaries(unicodestring::fromutf8(stringpiece(strdata.data(), strdata.size())))) { result.write(word.toutf8()); result.write("\n"); } return 0; }
naturally, resulting file looks this:
“ come outside . best if not wake him . ”
what need just words. how can done?
qt library include several useful methods check char's properties: qchar.
indeed, create qstring variable buffer , check properties need before insert output vector.
for example:
auto token = qstring::fromutf8(buffer); if (token.length() > 0 && token.data()[0].ispunct() == false) { words.push_back(std::move(token)); }
with code can access first character of string , check whether punctuation mark or not.
something more robust, express function:
bool isinblacklist(const qstring& str) { const auto len = str.lenght(); if (len == 0) return true; for(int = 0; < len; ++i) { const auto&& c = str.data()[i]; if (c.ispunct() == true || c.isspace() == true) { return true; } } return false; }
if function returns true, token hasn't inserted vector.
Comments
Post a Comment