Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

This simple code segment shows an issue I am having with JSON::XS encoding in Perl:

#!/usr/bin/perl
use strict;
use warnings;
use JSON::XS; 
use utf8;
binmode STDOUT, ":encoding(utf8)";
my (%data);
$data{code} = "Gewürztraminer";
print "data{code} = " . $data{code} . "\n";
my $json_text = encode_json \%data;
print $json_text . "\n";

The output this yields is:

johnnyb@boogie:~/Projects/repos > ./jsontest.pl 
data{code} = Gewürztraminer
{"code":"Gewürztraminer"}

Now if I comment out the binmode line above I get:

johnnyb@boogie:~/Projects/repos > ./jsontest.pl 
data{code} = Gew�rztraminer
{"code":"Gewürztraminer"}

What is happening here? Note that I am trying to fix this behavior in a perl CGI script in which binmode can not be used but I always get the "ü" characters as above returned in the JSON stream. How do I debug this? What am I missing?

encode_json (short for JSON::XS->new->utf8->encode) encodes using UTF-8, then you are re-encoding it by printing it to STDOUT to which you've added an encoding layer. Effectively, you are doing encode_utf8(encode_utf8($uncoded_json)).

Solution 1

use open ':std', ':encoding(utf8)';  # Defaults
binmode STDOUT;                      # Override defaults
print encode_json(\%data);

Solution 2

use open ':std', ':encoding(utf8)';    # Defaults
print JSON::XS->new->encode(\%data);   # Or to_json from JSON.pm

Solution 3

The following works with any encoding on STDOUT by using \u escapes for non-ASCII:

print JSON::XS->new->ascii->encode(\%data);
use utf8;                      # Encoding of source code.
use open ':encoding(UTF-8)';   # Default encoding of file handles.
BEGIN {
   binmode STDIN;                       # Usually does nothing on non-Windows.
   binmode STDOUT;                      # Usually does nothing on non-Windows.
   binmode STDERR, ':encoding(UTF-8)';  # For text sent to the log file.
use CGI      qw( -utf8 );
use JSON::XS qw( ); 
   my $cgi = CGI->new();
   my $data = { code => "Gewürztraminer" };
   print $cgi->header('application/json');
   print encode_json($data);
                Yes - that does work and, as stated below, I think I am encoding data that is already UTF8. Not sure how to get around that. Unfortunately STDOUT really has no meaning in a CGI script (?) so I'm not sure I can use the method above.
– Omortis
                Jul 2, 2015 at 1:04
                You're not making any sense. You wouldn't have the problem if you didn't use STDOUT. And yes, CGI does use STDOUT.
– ikegami
                Jul 2, 2015 at 1:21
                I have not been able to figure out to to use a CGI object as a filehande to hand to binmode. The encoded json data must be printed to the CGI object to be returned to the jQuery making the AJAX request. If you know how to do this I am all ears and eyes.
– Omortis
                Jul 2, 2015 at 1:46
                What are you talking about. We've already established that CGI uses STDOUT, so you use exactly the code I posted.
– ikegami
                Jul 2, 2015 at 2:40
                @Omortis CGI.pm doesn't send anything anywhere. Your code needs to print stuff. CGI.pm just gives you crappy old functions that generate HTML for you. Also keep in mind that CGI.pm is not in core any more. :)
– simbabque
                Jul 2, 2015 at 7:46

JSON::XS encodes its output into octets. It means the external representation of encoded utf8 string, but it is not unicode string. For more details see perlunicode. In short, content of $json_text is prepared for transmitting by IO handler in binary code. If you create scalar content of $data{code} after use utf8; you have scalar containing internally encoded unicode characters string. (Which is internally encoded as utf8 but it is implementation detail you should not rely on. Pragma use utf8; means the source code is encoded as utf8 and nothing else.) If you would like to output both scalars in utf8 encoded IO handler you have to transform $json_string into internal unicode chracters string.

use strict;
use warnings;
use JSON::XS; 
use utf8;
binmode STDOUT, ":encoding(utf8)";
my (%data);
$data{code} = "Gewürztraminer";
print "data{code} = " . $data{code} . "\n";
my $json_text = encode_json \%data;
utf8::decode($json_text);
print $json_text . "\n";

Or how it is intended to use, output encoded string using IO handler in binary mode.

my $json_text = encode_json \%data;
binmode STDOUT;
print $json_text . "\n";
print utf8::is_utf8($json_text) ? "UTF8" : "OCTETS" . "\n";

to see what is inside.

Rather wasteful to ask to UTF-8 encode it only to follow up with a decode and yet another encode! – ikegami Jul 1, 2015 at 20:24 Never ever use is_utf8. If you need to force the use of one of the two storage formats, use utf8::upgrade or utf8::downgrade, neither of which are needed here. – ikegami Jul 2, 2015 at 2:39 @Omortis: You read data from DBI? Here we go. Check if your data read from DBI is flagged internally as Unicode using is_utf8 and if not then use utf8::upgrade . It is usual bug in DBI drivers. Then use approach in @ikegami's answer and choose if you would like to have IO in binary or utf8 mode. – Hynek -Pichi- Vychodil Jul 2, 2015 at 8:33

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.