unicode - How to remove strange characters using gsub in R? -


i'm trying clean text loaded memory using readlines(..., encoding='utf-8').

if don't specify encoding, see kinds of strange characters like:

"the way talk family......i ass beat death....but kno cray cray & leave @ 😜ðŸ˜â˜º'"

this looks after readlines(..., encoding='utf-8'):

"the way talk family......i ass beat death....but kno cray cray & leave @ \xf0\u009f\u0098\u009c\xf0\u009f\u0098\u009d☺"

you can see unicode literals @ end: \u009f, \u0098, etc.

i can't find right command , regular expression rid of these. i've tried:

gsub('[^[:punct:][:alnum:][\s]]', '', text)

i tried specifying unicode characters, believe they're getting interpreted text:

gsub('\u009', '', text) # unchanged

the easiest way rid of these characters convert utf-8 ascii:

combined_doc <- iconv(combined_doc, 'utf-8', 'ascii', sub='') 

Comments

Popular posts from this blog

Spring Boot + JPA + Hibernate: Unable to locate persister -

go - Golang: panic: runtime error: invalid memory address or nil pointer dereference using bufio.Scanner -

c - double free or corruption (fasttop) -