Monday, April 30, 2012

UTF-8 Regular Expressions in PHP

While PHP itself doesn't know about different character sets and treats all characters as being one byte long, the PCRE engine understands UTF-8. There's also mb_ereg_match(), but I prefer the PCRE functions (preg_...). Here's a piece of code to see if your PHP was compiled with PCRE UTF-8 support.

$str = 'ありがとう';
echo "strlen('$str') = " . strlen($str) . "\n";
echo "preg_match_all('/./', '$str', \$matches) = " .
  preg_match_all('/./', $str, $matches) . "\n";
echo "preg_match_all('/(*UTF8)./u', '$str', \$matches) = " .
  preg_match_all('/(*UTF8)./u', $str, $matches) . "\n";

Which outputs the correct length of 5 characters when you start your regular expresssion with (*UTF8) and use the /u modifier.

strlen('ありがとう') = 15
preg_match_all('/./', 'ありがとう', $matches) = 15
preg_match_all('/(*UTF8)./u', 'ありがとう', $matches) = 5

You can also use Unicode character properties to match only letters (in any language) for example:

// The WRONG way to do it, only works for ASCII:
preg_match_all('/[a-zA-Z]/', $str, $matches);

// This way it works with any language:
preg_match_all('/(*UTF8)\p{L}/u', $str, $matches);

You can see other Unicode character properties in the PHP Manual.

No comments: