xml - Improper UTF-8 and LibXML::Reader -
i have large xml file remote source, says 'utf8', file shows us-ascii.
<?xml version="1.0" encoding="utf-8"?>... file -bi <file> indicates application/xml; charset=us-ascii encode::guess indicates utf8 edit: there code reads in file, output lwp get...i have try force encoding here, other errors wide chars.
my $fh = io::file->new; $fh->open( '<' . $filename ) $content = join '', <$fh>; i using xml::reader
my $reader = xml::libxml::reader->new(string => $content) or die qq(cannot read content: $!); while ($reader->nextelement($template->{ 'item' } )) { $copy = $reader->copycurrentnode(1); $test = $copy->findvalue( 'description' ) ...# other stuff $copy this works fine through of contents. however, there looks invalid utf-8 or malformed data gives error half way through..
(note, in xml::bare whole xml processed 'fine' more forgiving, file on limit of memory size, need smaller memory xml parser).
entity: line 64070: parser error : input not proper utf-8, indicate encoding ! bytes: 0x1a 0x73 0x20 0x73 if in vim @ point after last success, can see
^z or <^z> 26, hex 1a, octal 032 :ascii in vim i have looked here on so try , ensure @ least valid utf-8 can't origin fixed, , trying...
use encode qw( encode decode ); $octets = decode('utf-8', $content, encode::fb_default ); $content = encode('utf-8', $octets, encode::fb_croak ); but still same error. happy skip parts invalid utf-8, whole parser dies, , can't see way carry on processing later (which believe supposed happen xml parsing).
my question is, best way guarantee utf-8 (assuming can't file changed), or there method should around error (i regex particular char out, i'm assuming there may other similar issues later, feels clunky) ?
the error message misleading; problem has nothing encoding[1]. in fact, error receive following[2]:
:1: parser error : pcdata invalid char value 26 from xml spec,
char ::= #x9 | #xa | #xd | [#x20-#xd7ff] | [#xe000-#xfffd] | [#x10000-#x10ffff]
u+001a may not legally appear in xml files, not character reference ().
characters referred using character references
mustmatch production char.
if file contain binary data, binary portions should encoded (e.g. using base64).
1a,20,73less80.i tested using xml::libxml rather xml::libxml::reader, suspect relevant difference difference in version of xml::libxml or libxml2.
Comments
Post a Comment