xml - Improper UTF-8 and LibXML::Reader -
i have large xml file remote source, says 'utf8', file shows us-ascii.
<?xml version="1.0" encoding="utf-8"?>... file -bi <file> indicates application/xml; charset=us-ascii encode::guess indicates utf8
edit: there code reads in file, output lwp get...i have try force encoding here, other errors wide chars.
my $fh = io::file->new; $fh->open( '<' . $filename ) $content = join '', <$fh>;
i using xml::reader
my $reader = xml::libxml::reader->new(string => $content) or die qq(cannot read content: $!); while ($reader->nextelement($template->{ 'item' } )) { $copy = $reader->copycurrentnode(1); $test = $copy->findvalue( 'description' ) ...# other stuff $copy
this works fine through of contents. however, there looks invalid utf-8 or malformed data gives error half way through..
(note, in xml::bare whole xml processed 'fine' more forgiving, file on limit of memory size, need smaller memory xml parser).
entity: line 64070: parser error : input not proper utf-8, indicate encoding ! bytes: 0x1a 0x73 0x20 0x73
if in vim @ point after last success, can see
^z or <^z> 26, hex 1a, octal 032 :ascii in vim
i have looked here on so try , ensure @ least valid utf-8 can't origin fixed, , trying...
use encode qw( encode decode ); $octets = decode('utf-8', $content, encode::fb_default ); $content = encode('utf-8', $octets, encode::fb_croak );
but still same error. happy skip parts invalid utf-8, whole parser dies, , can't see way carry on processing later (which believe supposed happen xml parsing).
my question is, best way guarantee utf-8 (assuming can't file changed), or there method should around error (i regex particular char out, i'm assuming there may other similar issues later, feels clunky) ?
the error message misleading; problem has nothing encoding[1]. in fact, error receive following[2]:
:1: parser error : pcdata invalid char value 26
from xml spec,
char ::= #x9 | #xa | #xd | [#x20-#xd7ff] | [#xe000-#xfffd] | [#x10000-#x10ffff]
u+001a may not legally appear in xml files, not character reference (
).
characters referred using character references
must
match production char.
if file contain binary data, binary portions should encoded (e.g. using base64).
1a
,20
,73
less80
.i tested using xml::libxml rather xml::libxml::reader, suspect relevant difference difference in version of xml::libxml or libxml2.
Comments
Post a Comment