I discovered recently the importance of proper collation of database tables. I inherited a proprietary CMS to manage. The default collation was latin1_swedish_ci. Apparently it’s because “The bloke who wrote it was co-head of a Swedish company“. The problem occurred when a form we had on our site began getting submissions with foreign characters. The database collation couldn’t accept the characters and was saving them as question marks (?).
“Serialization is the process of translating data structures or object state into a format that can be stored.” For example the array:
$returnValue = serialize(array('hello', 'world'));
This is what the above string means:
- There is an array that is 2 in length. a:2.
- The first item in the array has a key that is an integer with the value of 0. i:0.
- The value for that item is a string that is 5 characters long, which is “hello”. s:5.
- The second item in the array has a key that is an integer with the value of 1. i:1.
- The value for that item is a string that is 5 characters long, which is “world”. s:5.
An unserialize offset error can occur when the string count in the serialized data does not match the length of the string being saved. so in the above example that would look like this:
Notice the number ’4′, while there are really 5 characters in the world ‘hello’.
So the question is, why would the offset happen when a ? replaces a foreign character?
To understand why, you need to dig into how UTF-8 works and things will become clear.
The UTF-8 value of ‘?’ is ’3f’, while the value for ‘Æ’ is ‘c3 86′.
'?' translates into
'Æ' translates into
s:2:"Æ";. Notice the 2 replacing the 1 in the string length. So basically, what’s happening is that when php serializes the data it is storing the foreign character as a double the length but when it’s passed to MySQL, when the table isn’t formatted for UTF-8, the database converts the character to a ?, which is then stored as a single character. But the serialization length is not updated, so when you go and unserialize the data there is an offset error.
How to resolve the problem
There are several articles that provide solutions. The most popular is to use the base64_encode() function around the serialized data. This will prevent the data from getting corrupted since base64 converts the data to ASCII which any collation can take.
//to safely serialize
$safe_string_to_store = base64_encode(serialize($multidimensional_array));
$array_restored_from_db = unserialize(base64_decode($encoded_serialized_string));
If you don’t have access to your database, or don’t want to fool with it, this is a great solution. You can also set your table collation to utf8_general_ci or utf8_unicode_ci and that should solve your problem as well (that’s what we did).
But what if you already have bad data in your database, like we had, and you’re getting the horrid ‘Notice: unserialize() [function.unserialize]: Error at Offset’ error. When you get this notice, chances are you’re not getting all your data either…
Here’s what you do:
$fixed_serialized_data = preg_replace_callback ( '!s:(\d+):"(.*?)";!',
return ($match == strlen($match)) ? $match : 's:' . strlen($match) . ':"' . $match . '";';
This will search out the strings, recount the length, and replace the string length with the correct value. Unfortunately it cannot recover what the original foreign character was, but at least the rest of your data will load.
I got the original code from StackOverflow, but since PHP 5.5 the /e modifier in preg_replace() has been deprecated completely and the original preg_match statement suggested will error out. So I rewrote it with preg_replace_callback().