Decoding byte-escaped \xFF strings in PHP

Friday, 31 Jul 2009
I came across a multi-byte (UTF-8) string this week, that was encoded by the single-byte escaping \x, for example:
His nickname was \xE2\x80\x98the Angel\xE2\x80\x99,
which is kind of a clich\xC3\xA9 in my opinion.
Which represents in UTF-8 the string:
His nickname was the Angel,
which is kind of a cliché in my opinion.
As you can see, the multibyte characters are encoded per byte using the \x escape. I was kind of surprised to find there's no PHP function which helps you do this, and you're consistently pointed in the direction of utf8_decode, which really doesn't help here. The solution seemed more simple than the actual implementation:

Replace all \x patterns by their byte representations through a regular expression, which I did with the following code:
preg_replace("#(\\\x[0-9A-F]{2})#e", "chr(hexdec('\\1'))", $string)
The most surprising thing (and hardest to figure out) for me was the \\\x in the regex, which still makes no sense to me. I was simply expecting \\x to do the job. Any light on this is appreciated.

Hope this helps anyone facing the same decoding issue!
Posted in: php encodings
Add comment

Comments (1)

10-02-2010, 04:07
awm
The \\\x makes sense. In order to match \x, the regular expression needs to be \\x (so that \x is not interpreted), and "\\\x" will evaluate to \\x when the string is parsed. "\\\\x" will also work.