PHP(6): Character encoding and other new features

April 13, 2010

mb_convert_encoding - Convert character encoding

The Internet is converging on UTF-8 for a common / universal character encoding.

Set your apache environment to utf-8 by adding ‘AddDefaultCharset utf-8’ to your .htaccess. If you do not use apache add ‘default_charset utf-8’ to your php.ini. You have to do either of them (not both), php will use the apache setting where needed. And, of course, your html-header: ‘< meta http-equiv="Content-type" content="text/html; charset=UTF-8" />‘.

PHP5 started handling the problem of differing character sets with a few mb_ (multi-byte) functions and php6 will get serious about it. All 127 ascii characters were 1 byte. 1 character = 1 byte. Now what?? If you know about IBM’s dumb idea of “code pages” to handle different languages, you know people using 2 different code pages could not communicate with them – and you know some form of Unicode is the only answer – and the internet, the world, is finally about to settle on UTF-8.

the problem is serious

Ever see those “smart quotes?” Those are DUMB(!) quotes! non-ascii characters! You can end up in a situation where you have UTF-8 encoded characters in your Latin-1 collated table, or vice-versa, which means “you’ve just landed in character set hell!”

Having mis-matched character sets in your tables means you can potentially create
sqldumps of your data which are not restore-able. This can happen transparently – for example, mysqldump will create the dump with no errors or warnings, but you won’t know that is is full of unreadable garbage and, perhaps, not even restore-able.

In Sam Ruby’s i18n Survival Guide, he recommends using the string Iñtërnâtiônàlizætiøn for testing. Counted with your eye, you can see it contains 20 characters;
Iñtërnâtiônàlizætiøn 12345678901234567890

When encountering a Russian name like Aleksandra, using the Latin Alphabet, or Александра using the Russian Alphabet, it looks like ten characters either way, but PHP’s strlen function knows and counts the second one as containing 20 bytes.

Is the German letter ß a real letter or just a fancy way of writing ss? If a letter’s shape changes at the end of the word, is that a different letter? Hebrew says yes, Arabic says no.

There is no limit on the number of characters that Unicode can define and defining one can go beyond two bytes.

“PHP’s latest features, including core support for Unicode, make it even easier for you to write feature-filled PHP applications.” – IBM {http://www.ibm.com/developerworks/opensource/library/os-php-future/}

Features that are security risks will be removed from the PHP version including:
* magic_quotes
* register_globals
* register_long_arrays {all $HTTP_*_VARS, ex: $HTTP_SERVER_VARS, $HTTP_GET_VARS, etc.}
* safe_mode
* ereg’s

mbstring predecessor

The mbstring extension was to provide a mechanism to override a large number of PHP’s string functions. But it was not available by default and third-hand reports say it used to be pretty unstable. Analysis: it was a learning tool for the PHP project. Hopefully, it will enable them to do much better with PHP6. PHP 6 should have native understanding of Unicode and default to UTF-8 for output as well as a bunch of other stuff, building on the International Components for Unicode project.

mb_convert_encoding – Convert character encoding
Converts the character encoding of string str to to_encoding from optionally from_encoding .

string mb_convert_encoding ( string $str , string $to_encoding [, mixed $from_encoding ] )

/* Convert TO FROM */
$str = mb_convert_encoding($str, “UTF-8”, “ISO-8859-1”);

/* “auto” is expanded to “ASCII,JIS,UTF-8,EUC-JP,SJIS” */
$str = mb_convert_encoding($str, “UTF-8”, “auto”);

$string = mb_convert_encoding($string, “ASCII”, “HTML-ENTITIES”);
Note that mb_convert_encoding($val, ‘HTML-ENTITIES’) does not escape ‘\”, ‘”‘, ‘<', '>‘, or ‘&’.

// Convert string from ISO_8859-1
$string = mb_convert_encoding($string, “UTF-8”, “ISO-8859-1”);

public function encodeToUtf8($string) {
return mb_convert_encoding($string, “UTF-8”, mb_detect_encoding($string, “UTF-8, ISO-8859-1, ISO-8859-15”, true));
}

In order to check if a string is encoded correctly in utf-8, try the following function,
that implements the RFC3629 better than mb_check_encoding():

 
function check_utf8($str) {
    $len = strlen($str);
    for($i = 0; $i < $len; $i++){
        $c = ord($str[$i]);
        if ($c > 128) {
            if (($c > 247)) return false;
            elseif ($c > 239) $bytes = 4;
            elseif ($c > 223) $bytes = 3;
            elseif ($c > 191) $bytes = 2;
            else return false;
            if (($i + $bytes) > $len) return false;
            while ($bytes > 1) {
                $i++;
                $b = ord($str[$i]);
                if ($b < 128 || $b > 191) return false;
                $bytes--;
            }
        }
    }
    return true;
}

PHP(6): Character encoding and other new features

the problem is serious

mbstring predecessor

Categories

Pages

Recent Postings

Links to Some Other Good Sites