Of Slugs and Permalinks and PHP Redux
Around three years ago I posted a simple slugification function that used iconv() to coerce a string into URL-friendly ASCII. It had a couple of drawbacks in that it dropped characters rather than transliterating them (oops) and it fell flat on its face if iconv() wasn’t available.
Thus, I present here an improved version.
/**
Clean up a string and make it suitable for inclusion in a url.
@param $str The (UTF-8) string to be slugified
@return a string containing a sanitized, url-safe version of $str
*/
function slugify ($str)
{
if (function_exists('iconv')) {
$old_level = error_reporting(0);
$old_locale = set_locale(LC_ALL, '0');
setlocale(LC_ALL, 'en_US.UTF-8');
$slug = iconv('UTF-8', 'UTF-8//IGNORE', $str);
$slug = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $slug);
set_locale(LC_ALL, $old_locale);
error_reporting($old_level);
} else {
$slug = preg_replace('/\s+/', ' ', $str);
$slug = preg_replace("/[^a-z0-9 ]/i", "", $slug);
}
$slug = preg_replace('/\W+/', ' ', $slug);
$slug = trim($slug);
$slug = strtolower($slug);
$slug = preg_replace('/\ +/', '-', $slug);
return $slug;
}
The basic mechanism is to take a string, remove non-ASCII characters, strip punctuation, collapse whitespace and replace it with hyphens. Which will (ideally) turn “I lìke: Icé Crëam!” into “i-like-ice-cream”.
The non-iconv() path doesn’t quite live up to that goal. It collapses white space and then strip anything that isn’t an alphanumeric ASCII character or ASCII space. This means it turns the previous example into “I lke Ic Cram”. I’m not too worried about that as it’s intended solely as a fallback. I added it because had a development machine without iconv() and a server with it, and didn’t want to waste time installing it on the development machine at that moment.
The iconv() path is much nicer, and I’ll take it step by step. First error reporting is disabled because iconv() loves to spew errors even when everything is operating as expected. The old reporting level is recorded so that it can be restored once iconv() is done spewing errors. Next the current locale is set to a UTF-8 value. The function assumes its input is UTF-8 and this ensures iconv()’s transliteration works properly. As with the error level, the current locale is recorded for later restoration.
Now comes the heart of the function, in which the input string is first stripped of invalid UTF-8 byte sequences (just in case $str is composed of random gibberish), then bounced down from UTF-8 to ASCII. It’s the second iconv() call, with the ‘TRANSLIT’ command, that makes up the real magic, transliterating accented character into non-accented versions. That is, it turns “ß” into “ss” and “é” into “e”. Unfortunately it doesn’t do much for non-European scripts, merely converting the characters into question marks. At this point the example has been transformed into “I like: Ice Cream!”.
After the string has been converted, the error reporting level and locale are restored.
The remainder of the function is the same for both paths from this point. All runs of non-word characters (i.e., whitespace, control characters, punctuation) are replaced with spaces (I like Ice Cream ). Then leading and trailing whitespace is stripped (I like Ice Cream), and the string is lowercased (i like ice cream). Finally, the remaining runs of whitespace are replaced by hyphens (i-like-ice-cream or i-lke-ic-cram for the non-iconv() path) and the string is returned to the caller.
My original function was adapted from an early version of Rick Olsen’s permalink_fu.rb but there has been significant divergence since.
