Tuesday, January 10, 2012

Unicode-friendly PHP and MySQL

SkyHi @ Tuesday, January 10, 2012
Nowadays, full Unicode support is a must-have for good web applications; shuffling text around as single-byte Latin characters isn’t enough, even if you’re only targeting English speakers.
PHP’s UTF-8 support still isn’t tightly integrated, but it’s good enough if you’re careful. However, I’ve encountered a lot of conflicting information and examples that didn’t work for me, so here’s a summary of what I’m doing to make everything UTF-8-friendly (please note that this may not work for you, usual disclaimers, etc.).

Pages

The web pages need to use UTF-8, declared via an HTTP header:
Content-type: text/html; charset=utf-8
This may already be the default for your server setup, or can be specified via .htaccess orheader(). You should also declare the encoding within the page’s markup:

String handling

PHP’s standard string functions only handle single-byte characters. The mbstring extension is commonly installed and provides multibyte-friendly functions, so use that if possible. Configure it at the start of your code:
mb_language('uni');
mb_internal_encoding('UTF-8');
You can clean up invalid UTF-8 sequences (100% guaranteed validity requires some extra filtering though) in user input using:
$str = mb_convert_encoding($str, 'UTF-8', 'UTF-8');
If mbstring isn’t available, use the iconv functions or one of the various string handling libraries. For regular expressions, the u modifier allows the standard preg_ functions to use UTF-8, and watch out for single-byte functions such as wordwrap() and chunk_split()(you’ll have to create/find alternatives).

MySQL

You can often get away with stuffing Unicode into non-Unicode fields, as many popular web apps still do, but it’s better to abandon older versions of MySQL and ditch the hacks.
Make sure all databases and tables use the character set utf8 and collationutf8_unicode_ci (or utf8_general_ci, which is slightly faster but ‘less correct’). Thecollation specifies how strings are compared and sorted, allowing for alternative representations of characters (watch out if you’re expecting exact string matches). If you export the database from your admin tool you should see everything set to utf8, e.g.:
CREATE DATABASE `db` DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci;
USE db;

CREATE TABLE `tbl` (
  `id` mediumint(8) unsigned NOT NULL auto_increment,
  `sometext` varchar(100) collate utf8_unicode_ci NOT NULL,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=1 ;
To get PHP and MySQL talking in UTF-8, articles usually advise sending a SET NAMES 'utf8' query immediately after connecting to the database, and I’ve seen mention of also using SET CHARACTER SET, but this is what worked for me:
SET NAMES 'utf8' COLLATE 'utf8_unicode_ci'

Email

If you want to send HTML or attachments, save yourself endless headaches by using a good library, but mb_send_mail() is adequate for plain text UTF-8 emails. Like mail(), it forces you to construct additional headers to set the return address, so make sure anything going into them is rigorously validated to avoid email injection. Here’s a cut-down (no filtering/validation) function as a starting point:
function utf8Email($toEmail, $toName, $fromEmail, $fromName, $subject, $message)
{
 $toName = mb_encode_mimeheader($toName, 'UTF-8', 'Q', "\n");
 // PHP won't allow line breaks in the To: field, so only
 // include characters that fit into the first encoded line
 $n = strpos($toName, "\n");
 if ($n !== FALSE) $toName = substr($toName, 0, $n);
 
 $fromName = mb_encode_mimeheader($fromName, 'UTF-8', 'Q', "\n");
 
 $headers = 'From: "'.$fromName.'" <'.$fromEmail.'>'."\n";
 $headers .= 'Reply-To: '.$fromEmail;

 return @mb_send_mail('"'.$toName.'" <'.$toEmail.'>', $subject, $message, $headers);
}
Most PHP developers seem to be either unaware of Unicode or scared of it, but once every aspect is UTF-8-friendly you can stop worrying about encoding hacks and unusual characters; it all just works.

REFERENCES