Unicode characters

MattCrozier · May 2016

I'm having trouble returning unicode data in the JSON response. Eg

{
   "_links" : {
      "collection" : {
         "href" : "http://localhost/api/users"
      },
      "self" : {
         "href" : "http://localhost/api/users/ARIEL"
      }
   },
   "person" : "??? Liu"
}

The three question marks ??? in the 'person' data should represent three Chinese characters.

I am ensuring that UTF8 mode is on in HTTP_MCP, and I can see the UTF8 data showing properly in variables as I step through the code.

Are unicode characters even allowed in JSON format? - or do they need to be encoded at some stage?

Cheers, M@

DonBakke · May 2016

Matt,

This is not a JSON problem. It's a OECGI issue. I had an extended discussion with Bryan Shumsky about this last year when we first ran into a similar problem. I won't bore you with the intricate details but it boils down to two issues:

Engines that are started by the OECGI are in ANSI mode. This is why you have to use UTF8 functions in your code to handle your data.
The CGI protocol is byte-based and doesn't understand wide characters.

Hence, even if the first item was resolved, the second item is still a show-stopper. The work around, as it were, is to Base64 encode the data and decode it in the client code. Not ideal, but it will get you there.

MattCrozier · May 2016

Ahh, I see. Could we /u escape any unicode characters? Eg

{
   "_links" : {
      "collection" : {
         "href" : "http://localhost/api/users"
      },
      "self" : {
         "href" : "http://localhost/api/users/ARIEL"
      }
   },
   "person" : "/u67F3/u8339/u6960 Liu"
}

Would a JSON parser on the client naturally interpret that?
Cheers, M@

DonBakke · May 2016

Matt,

I don't know what any given JSON parser might do, but I am dubious that any would treat that as anything other than plain text. How would you expect it to know the difference?

I think you will need to encode the entire payload and then decode it before handing it off to your JSON parser.

MattCrozier · May 2016

I would hope so, according to http://www.json.org/

A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes. A character is represented as a single character string. A string is very much like a C or Java string.

Actually, that looks like I could just convert those UTF8 characters to UTF16?

DonBakke · May 2016

Matt,

Certainly JSON itself can handle Unicode characters, assuming they could be passed through successfully (but OECGI doesn't...so we move on).

I wasn't being precise enough in my response because I was focused on your example. First, you were using a forward slash and I think you need to use a backslash. Second, you were trying to encode parts of the data rather than the entire data.

Otherwise, I think it ought to work.

MattCrozier · May 2016

Ah, of course - my mistakes, thanks. So it seems \u encoding of unicode characters is the way to go if OECGI can only work with ANSI.

I'd like to return a \u encoded JSON response. HTTP_JSON_SERVICES() uses the SETVALUE and STRINGIFY services in SRP_JSON(), but it seems SETVALUE is also ANSI only. Eg,

Call SRP_JSON( handle, 'SETVALUE', 'Name', 'Māori', 'STRING')
json = SRP_JSON( handle, 'STRINGIFY') ;* = {"Name":"M?ori"}

Trying to \u escape the text myself, but STRINGIFY will just return double-slashes as it encodes the '\' literally:

Call SRP_JSON( handle, 'SETVALUE', 'Name', 'M\u0101ori', 'STRING')
json = SRP_JSON( handle, 'STRINGIFY')  ;* = {"Name":"M\\u0101ori"}

I suppose I could then swap '\\u' for '\u' in the result.

The PARSE service seems to interpret a \u encoded string and store the unicode characters internally:

Call SRP_JSON( handle, 'PARSE', '{"Name":"M\u0101ori"}' )
json = SRP_JSON( handle, 'STRINGIFY') ;* = {"Name":"Māori"}

Would it be possible to enhance SETVALUE (and NEW, ADDVALUE) to retain unicode characters in UTF8 mode ?? And allow STRINGIFY to optionally \u encode them? That would save me from swapping '\\u' for '\u' in the response :).

Cheers, M@

DonBakke · June 2016

Matt,

I believe I have doubly good news for you. First, we have an update to the SRP Utilities that will fix SRP JSON so that it will support Unicode characters properly when using the SETVALUE method. This ended up being a problem with the underlying data type we were using in the low-level API calls. This also affects (or fixes) other utilities such as SRP List and SRP HashTable...even though no one has, as of yet, reported problems with those utilities. If you would like to test you can download the latest pre-release from this link:

SRP Utilities RDK 1.5.7 RC2

The other bit of good news is that we think we were wrong in our assertion that OECGI does not support Unicode. Well, perhaps it technically doesn't, but we no longer belief it interferes with wide characters. The upshot of this is that you don't have to worry about encoding your characters using /u#### in the response. Just send your characters as they exist in the database and you should be good to go. Consequently, we will not be working on support for the backslash encoding in SRP JSON since it is moot.

I will likely document our findings in an upcoming blog article, but I'll provide here a short version of why we were confused about OECGI and Unicode support. It might be helpful to you and others. We have a web app that has Unicode data in the database. We noticed problems with the representation in the client UI right away. For instance, we would see the following:

T�te-�-T�te'"

But in the database it look like this:

Tête-à-Tête

So you can imagine why we automatically assumed this was a Unicode issue. We tried turning on UTF8 mode at the beginning of our API but to no avail. We still got the funny characters coming back. I consulted with Bryan and he assumed that this must be a problem with the OECGI protocol itself...although he probably relied upon my test results rather than any internal testing on his end.

So I had hoped that the latest SRP JSON would fix the above issue. It didn't. However, I discovered that other data which included Unicode did come through properly. I did a little more testing by copying the above data and pasting it in another record. In the original record it still didn't work but in the new record it worked as expected. Very strange that the same data would have different results, even in the same API response. I had to get Kevin to help me figure this out.

By inspecting the data in the Debugger, Kevin was able to figure out that the special characters (e.g., ê) in the original record must have been saved as ANSI. We were only seeing a single byte, such as 0xEA (Char(234)), which displays as ê. We discovered that by saving the record again with OI in UTF8 mode, this properly encodes the data to 0xC3AA. Afterwards, the data comes through as expected. It never occurred to us that the customer's data might have extended characters that were entered before UTF8 support was added to OI. However, upon reflection, this makes total sense as this application has been around for a few decades. Either way, the fix to SRP JSON was still necessary since it was definitely not storing Unicode data properly.

Please let us know how your testing goes.

MattCrozier · June 2016

Hi Don,

Great! My unit testing of SRP_JSON with UTF8 data is working well. The final response is the UTF8 text one would expect.

It took me a while to work out that I need to include 'charset=utf-8' in the Content Type header to get this showing properly in the response :).

Cheers, M@

DonBakke · June 2016

Matt,

I'm relieved and glad to hear that this is working for you. However, I am surprised you needed to set the Content-Type response header that way. I have not needed to do that myself, although I think it is probably best practice to do what you are doing.

MattCrozier · June 2016

Hi Don,

Yes, I couldn't figure out why OECGI appeared to be working in UTF8 mode for you and not me - maybe there are some OECGI settings I'm missing, like a specific UTF8 port number (you may have seen my last posting on the Rev forum to Bryan).

Before setting the Content-Type header, the response I was getting for 'Tête-à-Tête' was

TÃªte-Ã -TÃªte

This is an equivalent to result to the Ansi_Utf8( 'Tête-à-Tête') function. Putting a UTF8 string through an ANSI to UTF8 conversion doesn't make sense to me!??

Cheers, M@

DonBakke · June 2016

Matt,

Yes, I saw the post to Bryan. Are you setting UTF8 mode in your HTTP service before handling your data?

MattCrozier · June 2016

Yes, I SetUTF8( 1) in HTTP_MCP(), actually. Everything seems hunky-dory in UTF8 right up to return Response in HTTP_MCP().

Unicode characters

Comments