unicode - How to remove strange characters using gsub in R? -
i'm trying clean text loaded memory using readlines(..., encoding='utf-8').
if don't specify encoding, see kinds of strange characters like:
"the way talk family......i ass beat death....but kno cray cray & leave @ 😜ðŸ˜â˜º'"
this looks after readlines(..., encoding='utf-8'):
"the way talk family......i ass beat death....but kno cray cray & leave @ \xf0\u009f\u0098\u009c\xf0\u009f\u0098\u009d☺"
you can see unicode literals @ end: \u009f, \u0098, etc.
i can't find right command , regular expression rid of these. i've tried:
gsub('[^[:punct:][:alnum:][\s]]', '', text)
i tried specifying unicode characters, believe they're getting interpreted text:
gsub('\u009', '', text) # unchanged
the easiest way rid of these characters convert utf-8 ascii:
combined_doc <- iconv(combined_doc, 'utf-8', 'ascii', sub='')
Comments
Post a Comment