java - Bullets in document getting as a question mark in GATE NLP -
i new gate nlp
. have document, contains bullets. when load gate
. bullets detected unknown type symbol printed
. tried set encoding utf-8
. , tryed load document programmatically, bullets gets detected ?
.
can explain me this?
example:
promoted senior member technical in 2.5 years of experience.
here symbol in gate developer ui
, ?
symbol shown when did "programmatically".
in experience, doc
, docx
files not produce
characters. bullets either missing (text formatted bullet-list) or printed •
(text raw bullet characters).
see related question: parsing either font style or block of paragraph in gate
pdf
files produce "
-bullet characters" in gate document. may related pdf or apache pdfbox issues, see e.g. this one.
these characters have unicode value. in xml, encoded example 
. in case, advice trace such characters (they may have different unicode values depending on original bullet character) , replace them printable (e.g. •
).
concerning ?
characters: caused java environment doesn't support these characters. see e.g.: why unicode characters appears question mark in console?
Comments
Post a Comment