PHP: html_entity_decode

Описание

string html_entity_decode ( string string [, int quote_style [, string charset]] )

html_entity_decode(), в противоположность функции htmlentities(), Преобразует HTML сущности в строке string в соответствующие символы.

Необязательный аргумент quote_style позволяет указать способ обработки 'одиночных' и "двойных" кавычек. Значением этого аргумента может быть одна из трех следующих констант (по умолчанию ENT_COMPAT):

Таблица 1. Константы quote_style

Имя константы	Описание
`ENT_COMPAT`	Преобразуются двойные кавычки, одиночные остаются без изменений.
`ENT_QUOTES`	Преобразуются и двойные, и одиночные кавычки.
`ENT_NOQUOTES`	И двойные, и одиночные кавычки остаются без изменений.

Необязательный третий аргумент charset определяет кодировку, используемую при преобразовании. По умолчанию используется кодировка ISO-8859-1.

Начиная с PHP 4.3.0 поддерживаются следующие кодировки.

Таблица 2. Поддерживаемые кодировки

Кодировка	Псевдонимы	Описание
ISO-8859-1	ISO8859-1	Западно-европейская Latin-1
ISO-8859-15	ISO8859-15	Западно-европейская Latin-9. Добавляет знак евро, французские и финские буквы к кодировке Latin-1(ISO-8859-1).
UTF-8		8-битная Unicode, совместимая с ASCII.
cp866	ibm866, 866	Кириллическая кодировка, применяемая в DOS. Поддерживается в версии 4.3.2.
cp1251	Windows-1251, win-1251, 1251	Кириллическая кодировка, применяемая в Windows. Поддерживается в версии 4.3.2.
cp1252	Windows-1252, 1252	Западно-европейская кодировка, применяемая в Windows.
KOI8-R	koi8-ru, koi8r	Русская кодировка. Поддерживается в версии 4.3.2.
BIG5	950	Традиционный китайский, применяется в основном на Тайване.
GB2312	936	Упрощенный китайский, стандартная национальная кодировка.
BIG5-HKSCS		Расширенная Big5, применяемая в Гонг-Конге.
Shift_JIS	SJIS, 932	Японская кодировка.
EUC-JP	EUCJP	Японская кодировка.

Замечание: Не перечисленные выше кодировки не поддерживаются, и вместо них применяется ISO-8859-1.

Пример 1. Декодирование HTML сущностей

<?php $orig = "I'll \"walk\" the dog now"; $a = htmlentities($orig); $b = html_entity_decode($a); echo $a; // I'll "walk" the dog now echo $b; // I'll "walk" the dog now // в версиях до PHP 4.3.0 можно сделать так: function unhtmlentities($string) { $trans_tbl = get_html_translation_table(HTML_ENTITIES); $trans_tbl = array_flip($trans_tbl); return strtr($string, $trans_tbl); } $c = unhtmlentities($a); echo $c; // I'll "walk" the dog now ?>

Замечание: Может показаться странным, что результатом вызова trim(html_entity_decode(' ')); не является пустая строка Причина том, что ' ' преобразуется не в символ с ASCII-кодом 32 (который удаляется функцией trim()),а в символ с ASCII-кодом 160 (0xa0) в принимаемой по умолчанию кодировке ISO-8859-1.

См. также описание функций htmlentities(), htmlspecialchars(), get_html_translation_table() и urldecode().

html_entity_decode

nycolhas at hotmail dot com
05-Apr-2006 11:24


This function might be useful for people who want to capitalize a string using html entities.



<?php

function htmlstrtoupper(&$string) {

    return htmlentities(strtoupper(html_entity_decode(&$string)));

}

?>

buraks78 at gmail dot com
07-Feb-2006 03:19


The "unhtmlentities" function defined above fails to decode single quotes properly. The issue can be solved by putting double quotes around replacing chr(\\1) with chr("\\1")...



function unhtmlentities($string)

{

   // replace numeric entities

   $string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);

   $string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);

   // replace literal entities

   $trans_tbl = get_html_translation_table(HTML_ENTITIES);

   $trans_tbl = array_flip($trans_tbl);

   return strtr($string, $trans_tbl);

}

hurricane at cyberworldz dot org
22-Dec-2005 08:33


I shortened the function repace_num_entity a bit to make more understandable and clean. Maybe now someone sees the problem it possibly has... (as mentioned below)



<?php

function replace_num_entity($ord) {

    $ord = $ord[1];

    if (preg_match('/^x([0-9a-f]+)$/i', $ord, $match)) $ord = hexdec($match[1]);

        else $ord = intval($ord);

    $no_bytes = 0;

    $byte = array();

    if ($ord < 128) return chr($ord);

    if ($ord < 2048) $no_bytes = 2;

        else if ($ord < 65536) $no_bytes = 3;

        else if ($ord < 1114112) $no_bytes = 4;

        else return;

    switch($no_bytes) {

        case 2: $prefix = array(31, 192); break;

        case 3: $prefix = array(15, 224); break;

        case 4: $prefix = array(7, 240);

    }

    for ($i=0; $i < $no_bytes; ++$i)

        $byte[$no_bytes-$i-1] = (($ord & (63 * pow(2,6*$i))) / pow(2,6*$i)) & 63 | 128;

    $byte[0] = ($byte[0] & $prefix[0]) | $prefix[1];

    $ret = '';

    for ($i=0; $i < $no_bytes; ++$i) $ret .= chr($byte[$i]);

    return $ret;

}

?>

loufoque
08-Oct-2005 01:15


If you want to decode NCRs to utf-8 use this function instead of chr().



function utf8_chr($code)

{

    if($code<128) return chr($code);

    else if($code<2048) return chr(($code>>6)+192).chr(($code&63)+128);

    else if($code<65536) return chr(($code>>12)+224).chr((($code>>6)&63)+128).chr(($code&63)+128);

    else if($code<2097152) return chr($code>>18+240).chr((($code>>12)&63)+128)

                                  .chr(($code>>6)&63+128).chr($code&63+128));

}

emilianomartinezluque at yahoo dot com
25-Sep-2005 05:22


I've been using the great replace_num_entity function posted below. But there seems to be some problems with the 128 to 160 characters range. Ie, try:



<?php header("Content-type: text/html; charset=utf-8"); ?>

<html><body>

<?php

for($x=128; $x<161; $x++) {

      echo('&#' . $x . '; -- ' . preg_replace_callback('/&#([0-9a-fx]+);/mi', 'replace_num_entity', '&#' . $x . ';') . '</br>');

}

?>

</body></html>



I really don�t know the reason for this (since according to UTF-8 specs the function should have worked) but I did a modified version of the function to address this. Hope it helps.



function replace_num_entity($ord)

   {

       $ord = $ord[1];

       if (preg_match('/^x([0-9a-f]+)$/i', $ord, $match))

       {

           $ord = hexdec($match[1]);

       }

       else

       {

           $ord = intval($ord);

       }

      

       $no_bytes = 0;

       $byte = array();



        if($ord == 128) { 

            return chr(226).chr(130).chr(172);

        } elseif($ord == 129) {

            return chr(239).chr(191).chr(189);

        } elseif($ord == 130) {

            return chr(226).chr(128).chr(154);

        } elseif($ord == 131) {

            return chr(198).chr(146);

        } elseif($ord == 132) {

            return chr(226).chr(128).chr(158);

        } elseif($ord == 133) {

            return chr(226).chr(128).chr(166);

        } elseif($ord == 134) {

            return chr(226).chr(128).chr(160);

        } elseif($ord == 135) {

            return chr(226).chr(128).chr(161);

        } elseif($ord == 136) {

            return chr(203).chr(134);

        } elseif($ord == 137) {

            return chr(226).chr(128).chr(176);

        } elseif($ord == 138) {

            return chr(197).chr(160);

        } elseif($ord == 139) {

            return chr(226).chr(128).chr(185);

        } elseif($ord == 140) {

            return chr(197).chr(146);

        } elseif($ord == 141) {

            return chr(239).chr(191).chr(189);

        } elseif($ord == 142) {

            return chr(197).chr(189);

        } elseif($ord == 143) {

            return chr(239).chr(191).chr(189);

        } elseif($ord == 144) {

            return chr(239).chr(191).chr(189);

        } elseif($ord == 145) {

            return chr(226).chr(128).chr(152);

        } elseif($ord == 146) {

            return chr(226).chr(128).chr(153);

        } elseif($ord == 147) {

            return chr(226).chr(128).chr(156);

        } elseif($ord == 148) {

            return chr(226).chr(128).chr(157);

        } elseif($ord == 149) {

            return chr(226).chr(128).chr(162);

        } elseif($ord == 150) {

            return chr(226).chr(128).chr(147);

        } elseif($ord == 151) {

            return chr(226).chr(128).chr(148);

        } elseif($ord == 152) {

            return chr(203).chr(156);

        } elseif($ord == 153) {

            return chr(226).chr(132).chr(162);

        } elseif($ord == 154) {

            return chr(197).chr(161);

        } elseif($ord == 155) {

            return chr(226).chr(128).chr(186);

        } elseif($ord == 156) {

            return chr(197).chr(147);

        } elseif($ord == 157) {

            return chr(239).chr(191).chr(189);

        } elseif($ord == 158) {

            return chr(197).chr(190);

        } elseif($ord == 159) {

            return chr(197).chr(184);

        } elseif($ord == 160) {

            return chr(194).chr(160);

        } 



       if ($ord < 128)

       {

           return chr($ord);

       }

       elseif ($ord < 2048)

       {

           $no_bytes = 2;

       }

       elseif ($ord < 65536)

       {

           $no_bytes = 3;

       }

       elseif ($ord < 1114112)

       {

           $no_bytes = 4;

       }

       else

       {

           return;

       }



       switch($no_bytes)

       {

           case 2:

           {

               $prefix = array(31, 192);

               break;

           }

           case 3:

           {

               $prefix = array(15, 224);

               break;

           }

           case 4:

           {

               $prefix = array(7, 240);

           }

       }



       for ($i = 0; $i < $no_bytes; $i++)

       {

           $byte[$no_bytes - $i - 1] = (($ord & (63 * pow(2, 6 * $i))) / pow(2, 6 * $i)) & 63 | 128;

       }



       $byte[0] = ($byte[0] & $prefix[0]) | $prefix[1];



       $ret = '';

       for ($i = 0; $i < $no_bytes; $i++)

       {

           $ret .= chr($byte[$i]);

       }



       return $ret;

   }

florianborn (at) yahoo (dot) de
20-Jul-2005 03:43


Note that



<?php



 echo urlencode(html_entity_decode("&nbsp;"));



?>



will output "%A0" instead of "+".

gaui at gaui dot is
04-Jul-2005 05:15


if( !function_exists( 'html_entity_decode' ) )

{

    function html_entity_decode( $given_html, $quote_style = ENT_QUOTES ) {

        $trans_table = array_flip(get_html_translation_table( HTML_SPECIALCHARS, $quote_style ));

        $trans_table['&#39;'] = "'";

        return ( strtr( $given_html, $trans_table ) );

       }

}

marius (at) hot (dot) ee
08-Apr-2005 06:40


To convert html entities into unicode characters, use the following:



        $trans_tbl = get_html_translation_table(HTML_ENTITIES);

        foreach($trans_tbl as $k => $v)

        {

            $ttr[$v] = utf8_encode($k);

        }

    

        $text = strtr($text, $ttr);

php dot net at c dash ovidiu dot tk
18-Mar-2005 12:37


Quick & dirty code that translates numeric entities to UTF-8.



<?php



    function replace_num_entity($ord)

    {

        $ord = $ord[1];

        if (preg_match('/^x([0-9a-f]+)$/i', $ord, $match))

        {

            $ord = hexdec($match[1]);

        }

        else

        {

            $ord = intval($ord);

        }

        

        $no_bytes = 0;

        $byte = array();



        if ($ord < 128)

        {

            return chr($ord);

        }

        elseif ($ord < 2048)

        {

            $no_bytes = 2;

        }

        elseif ($ord < 65536)

        {

            $no_bytes = 3;

        }

        elseif ($ord < 1114112)

        {

            $no_bytes = 4;

        }

        else

        {

            return;

        }



        switch($no_bytes)

        {

            case 2:

            {

                $prefix = array(31, 192);

                break;

            }

            case 3:

            {

                $prefix = array(15, 224);

                break;

            }

            case 4:

            {

                $prefix = array(7, 240);

            }

        }



        for ($i = 0; $i < $no_bytes; $i++)

        {

            $byte[$no_bytes - $i - 1] = (($ord & (63 * pow(2, 6 * $i))) / pow(2, 6 * $i)) & 63 | 128;

        }



        $byte[0] = ($byte[0] & $prefix[0]) | $prefix[1];



        $ret = '';

        for ($i = 0; $i < $no_bytes; $i++)

        {

            $ret .= chr($byte[$i]);

        }



        return $ret;

    }



    $test = 'This is a &#269;&#x5d0; test&#39;';



    echo $test . "<br />\n";

    echo preg_replace_callback('/&#([0-9a-fx]+);/mi', 'replace_num_entity', $test);



?>

Silvan
28-Jan-2005 07:33


Passing NULL or FALSE as a string will generate a '500 Internal Server Error' (or break the script when inside a function). 



So always test your string first before passing it to html_entity_decode().

daniel at brightbyte dot de
13-Nov-2004 06:12


This function seems to have to have two limitations (at least in PHP 4.3.8):



a) it does not work with multibyte character codings, such as UTF-8

b) it does not decode numeric entity references



a) can be solved by using iconv to convert to ISO-8859-1, then decoding the entities, than convert to UTF-8 again. But that's quite ugly and detroys all characters not present in Latin-1.



b) can be solved rather nicely using the following code:



<?php

function decode_entities($text) {

    $text= html_entity_decode($text,ENT_QUOTES,"ISO-8859-1"); #NOTE: UTF-8 does not work!

    $text= preg_replace('/&#(\d+);/me',"chr(\\1)",$text); #decimal notation

    $text= preg_replace('/&#x([a-f0-9]+);/mei',"chr(0x\\1)",$text);  #hex notation

    return $text;

}

?>



HTH

aidan at php dot net
14-Sep-2004 12:57


This functionality is now implemented in the PEAR package PHP_Compat.



More information about using this function without upgrading your version of PHP can be found on the below link:



http://pear.php.net/package/PHP_Compat