PHP DOMDocument class and HTML entities

Posted: October 19, 2010 in PHP
Tags: , , , ,

DOMDocument class behaves strange sometimes. It could omit some entities like “ and some valid UTF-8 characters (it also may do so for other encodings). This probably could be fixed by using own DTD, but there is a simple way too. Each HTML entity has its binary code, so DOMDocument will export your entities correctly if you replace the entities with the appropriate codes. I have a small list of them.

By using the following function you can avoid symbols loss:

function parseEntities($string) {
    $entities = array (
        "auml" => "#228;",
        "ouml" => "#246;",
        "uuml" => "#252;",
        "szlig" => "#223;",
        "Auml" => "#196;",
        "Ouml" => "#214;",
        "Uuml" => "#220;",
        "nbsp" => "#160;",
        "Agrave" => "#192;",
        "Egrave" => "#200;",
        "Eacute" => "#201;",
        "Ecirc"    => "#202;",
        "egrave" => "#232;",
        "eacute" => "#233;",
        "ecirc" => "#234;",
        "agrave" => "#224;",
        "iuml" => "#239;",
        "ugrave" => "#249;",
        "ucirc" => "#251;",
        "uuml" => "#252;",
        "ccedil" => "#231;",
        "AElig" => "#198;",
        "aelig" => "#330;",
        "OElig" => "#338;",
        "oelig" => "#339;",
        "angst" => "#8491;",
        "cent" => "#162;",
        "copy" => "#169;",
        "Dagger" => "#8225;",
        "dagger" => "#8224;",
        "deg" => "#176;",
        "emsp" => "#8195;",
        "ensp" => "#8194;",
        "ETH" => "#208;",
        "eth" => "#240;",
        "euro" => "#8364;",
        "half" => "#189;",
        "laquo" => "#171;",
        "ldquo" => "#8220;",
        "lsquo" => "#8216;",
        "mdash" => "#8212;",
        "micro" => "#181;",
        "middot" => "#183;",
        "ndash" => "#8211;",
        "not" => "#172;",
        "numsp" => "#8199;",
        "para" => "#182;",
        "permil" => "#8240;",
        "puncsp" => "#8200;",
        "raquo" => "#187;",
        "rdquo" => "#8221;",
        "rsquo" => "#8217;",
        "reg" => "#174;",
        "sect" => "#167;",
        "THORN" => "#222;",
        "thorn" => "#254;",
        "trade" => "#8482;"
     );

    foreach ($entities as $ent=>$repl) {
        $string = preg_replace('/&'.$ent.';?/m', '&'.$repl, $string);
    }

    return $string;
}

This list contains not all the entities, but it is easy to add new ones without any other code change.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s