java - Bullets in document getting as a question mark in GATE NLP -
i new gate nlp. have document, contains bullets. when load gate. bullets detected unknown type symbol printed . tried set encoding utf-8. , tryed load document programmatically, bullets gets detected ? .
can explain me this?
example:
promoted senior member technical in 2.5 years of experience.
here symbol in gate developer ui , ? symbol shown when did "programmatically".
in experience, doc , docx files not produce characters. bullets either missing (text formatted bullet-list) or printed • (text raw bullet characters).
see related question: parsing either font style or block of paragraph in gate
pdf files produce "-bullet characters" in gate document. may related pdf or apache pdfbox issues, see e.g. this one.
these characters have unicode value. in xml, encoded example . in case, advice trace such characters (they may have different unicode values depending on original bullet character) , replace them printable (e.g. •).
concerning ? characters: caused java environment doesn't support these characters. see e.g.: why unicode characters appears question mark in console?
Comments
Post a Comment