Decoding byte-escaped \xFF strings in PHP
Friday, 31 Jul 2009I came across a multi-byte (UTF-8) string this week, that was encoded by the single-byte escaping \x, for example:
Replace all \x patterns by their byte representations through a regular expression, which I did with the following code:
Hope this helps anyone facing the same decoding issue!
His nickname was \xE2\x80\x98the Angel\xE2\x80\x99, which is kind of a clich\xC3\xA9 in my opinion.Which represents in UTF-8 the string:
His nickname was ‘the Angel’, which is kind of a cliché in my opinion.As you can see, the multibyte characters are encoded per byte using the \x escape. I was kind of surprised to find there's no PHP function which helps you do this, and you're consistently pointed in the direction of utf8_decode, which really doesn't help here. The solution seemed more simple than the actual implementation:
Replace all \x patterns by their byte representations through a regular expression, which I did with the following code:
preg_replace("#(\\\x[0-9A-F]{2})#e", "chr(hexdec('\\1'))", $string)The most surprising thing (and hardest to figure out) for me was the \\\x in the regex, which still makes no sense to me. I was simply expecting \\x to do the job. Any light on this is appreciated.
Hope this helps anyone facing the same decoding issue!



Comments (1)