xml - Improper UTF-8 and LibXML::Reader -


i have large xml file remote source, says 'utf8', file shows us-ascii.

<?xml version="1.0" encoding="utf-8"?>...  file -bi <file> indicates application/xml; charset=us-ascii encode::guess indicates utf8 

edit: there code reads in file, output lwp get...i have try force encoding here, other errors wide chars.

my $fh = io::file->new; $fh->open( '<' . $filename ) $content = join '', <$fh>; 

i using xml::reader

my $reader = xml::libxml::reader->new(string => $content) or die qq(cannot read content: $!);  while ($reader->nextelement($template->{ 'item' } )) {     $copy = $reader->copycurrentnode(1);     $test = $copy->findvalue( 'description' ) ...# other stuff $copy 

this works fine through of contents. however, there looks invalid utf-8 or malformed data gives error half way through..
(note, in xml::bare whole xml processed 'fine' more forgiving, file on limit of memory size, need smaller memory xml parser).

entity: line 64070: parser error : input not proper utf-8, indicate encoding ! bytes: 0x1a 0x73 0x20 0x73 

if in vim @ point after last success, can see

^z  or <^z>  26,  hex 1a,  octal 032 :ascii in vim 

i have looked here on so try , ensure @ least valid utf-8 can't origin fixed, , trying...

use encode qw( encode decode ); $octets = decode('utf-8', $content, encode::fb_default ); $content = encode('utf-8', $octets, encode::fb_croak ); 

but still same error. happy skip parts invalid utf-8, whole parser dies, , can't see way carry on processing later (which believe supposed happen xml parsing).

my question is, best way guarantee utf-8 (assuming can't file changed), or there method should around error (i regex particular char out, i'm assuming there may other similar issues later, feels clunky) ?

the error message misleading; problem has nothing encoding[1]. in fact, error receive following[2]:

:1: parser error : pcdata invalid char value 26 

from xml spec,

char ::= #x9 | #xa | #xd | [#x20-#xd7ff] | [#xe000-#xfffd] | [#x10000-#x10ffff] 

u+001a may not legally appear in xml files, not character reference (&#x1a;).

characters referred using character references must match production char.

if file contain binary data, binary portions should encoded (e.g. using base64).


  1. 1a, 20 , 73 less 80.

  2. i tested using xml::libxml rather xml::libxml::reader, suspect relevant difference difference in version of xml::libxml or libxml2.


Comments

Popular posts from this blog

Spring Boot + JPA + Hibernate: Unable to locate persister -

go - Golang: panic: runtime error: invalid memory address or nil pointer dereference using bufio.Scanner -

c - double free or corruption (fasttop) -