1. Главная страница » Компьютеры » Php преобразовать кодировку в utf 8

Php преобразовать кодировку в utf 8

Автор: | 16.12.2019

(PHP 4 >= 4.0.6, PHP 5, PHP 7)

mb_convert_encoding — Преобразует кодировку символов

Описание

Преобразует символы val в кодировку to_encoding . Также можно указать необязательный параметр from_encoding . Если val является массивом ( array ), все его строковые ( string ) значения будут преобразованы рекурсивно.

Список параметров

Строка ( string ) или массив ( array ), для преобразования.

Кодировка, в которую будут преобразованы данные из val .

Параметр для указания исходной кодировки строки. Это может быть массив ( array ), или строка со списком кодировок через запятую. Если параметр from_encoding не указан, то кодировка определяется автоматически.

Возвращаемые значения

Преобразованная строка ( string ) или массив ( array ).

Примеры

Пример #1 Пример использования mb_convert_encoding()

/* Преобразует строку в кодировку SJIS */
$str = mb_convert_encoding ( $str , "SJIS" );

/* Преобразует из EUC-JP в UTF-7 */
$str = mb_convert_encoding ( $str , "UTF-7" , "EUC-JP" );

/* Автоматически определяется кодировка среди JIS, eucjp-win, sjis-win, затем преобразуется в UCS-2LE */
$str = mb_convert_encoding ( $str , "UCS-2LE" , "JIS, eucjp-win, sjis-win" );

/* "auto" используется для обозначения "ASCII,JIS,UTF-8,EUC-JP,SJIS" */
$str = mb_convert_encoding ( $str , "EUC-JP" , "auto" );
?>

Смотрите также

  • mb_detect_order() — Установка/получение списка кодировок для механизмов определения кодировки

Список изменений

Версия Описание
7.2.0 Функция теперь также принимает массив ( array ) в val . Ранее поддерживались только строки ( string ).

User Contributed Notes 32 notes

For my last project I needed to convert several CSV files from Windows-1250 to UTF-8, and after several days of searching around I found a function that is partially solved my problem, but it still has not transformed all the characters. So I made ​​this:

function w1250_to_utf8($text) <
// map based on:
// http://konfiguracja.c0.pl/iso02vscp1250en.html
// http://konfiguracja.c0.pl/webpl/index_en.html#examp
// http://www.htmlentities.com/html/entities/
$map = array(
chr(0x8A) => chr(0xA9),
chr(0x8C) => chr(0xA6),
chr(0x8D) => chr(0xAB),
chr(0x8E) => chr(0xAE),
chr(0x8F) => chr(0xAC),
chr(0x9C) => chr(0xB6),
chr(0x9D) => chr(0xBB),
chr(0xA1) => chr(0xB7),
chr(0xA5) => chr(0xA1),
chr(0xBC) => chr(0xA5),
chr(0x9F) => chr(0xBC),
chr(0xB9) => chr(0xB1),
chr(0x9A) => chr(0xB9),
chr(0xBE) => chr(0xB5),
chr(0x9E) => chr(0xBE),
chr(0x80) => ‘€’,
chr(0x82) => ‘‚’,
chr(0x84) => ‘„’,
chr(0x85) => ‘…’,
chr(0x86) => ‘†’,
chr(0x87) => ‘‡’,
chr(0x89) => ‘‰’,
chr(0x8B) => ‘‹’,
chr(0x91) => ‘‘’,
chr(0x92) => ‘’’,
chr(0x93) => ‘“’,
chr(0x94) => ‘”’,
chr(0x95) => ‘•’,
chr(0x96) => ‘–’,
chr(0x97) => ‘—’,
chr(0x99) => ‘™’,
chr(0x9B) => ‘’’,
chr(0xA6) => ‘¦’,
chr(0xA9) => ‘©’,
chr(0xAB) => ‘«’,
chr(0xAE) => ‘®’,
chr(0xB1) => ‘±’,
chr(0xB5) => ‘µ’,
chr(0xB6) => ‘¶’,
chr(0xB7) => ‘·’,
chr(0xBB) => ‘»’,
);
return html_entity_decode(mb_convert_encoding(strtr($text, $map), ‘UTF-8’, ‘ISO-8859-2’), ENT_QUOTES, ‘UTF-8’);
>

I’ve been trying to find the charset of a norwegian (with a lot of ø, æ, å) txt file written on a Mac, i’ve found it in this way:

= "A strange string to pass, maybe with some ø, æ, å characters." ;

foreach( mb_list_encodings () as $chr ) <
echo mb_convert_encoding ( $text , ‘UTF-8’ , $chr ). " : " . $chr . "
" ;
>
?>

The line that looks good, gives you the encoding it was written in.

Hope can help someone

aaron, to discard unsupported characters instead of printing a ?, you might as well simply set the configuration directive:

in your php.ini. Be sure to include the quotes around none. Or at run-time with

Hey guys. For everybody who’s looking for a function that is converting an iso-string to utf8 or an utf8-string to iso, here’s your solution:

public function encodeToUtf8($string) <
return mb_convert_encoding($string, "UTF-8", mb_detect_encoding($string, "UTF-8, ISO-8859-1, ISO-8859-15", true));
>

public function encodeToIso($string) <
return mb_convert_encoding($string, "ISO-8859-1", mb_detect_encoding($string, "UTF-8, ISO-8859-1, ISO-8859-15", true));
>

For me these functions are working fine. Give it a try

many people below talk about using
( $s , ‘HTML-ENTITIES’ , ‘UTF-8’ );
?>
to convert non-ascii code into html-readable stuff. Due to my webserver being out of my control, I was unable to set the database character set, and whenever PHP made a copy of my $s variable that it had pulled out of the database, it would convert it to nasty latin1 automatically and not leave it in it’s beautiful UTF-8 glory.

So [insert korean characters here] turned into .

I found myself needing to pass by reference (which of course is deprecated/nonexistent in recent versions of PHP)
so instead of
(& $s , ‘HTML-ENTITIES’ , ‘UTF-8’ );
?>
which worked perfectly until I upgraded, so I had to use
( ‘mb_convert_encoding’ , array(& $s , ‘HTML-ENTITIES’ , ‘UTF-8’ ));
?>

Hope it helps someone else out

My solution below was slightly incorrect, so here is the correct version (I posted at the end of a long day, never a good idea!)

Again, this is a quick and dirty solution to stop mb_convert_encoding from filling your string with question marks whenever it encounters an illegal character for the target encoding.

function convert_to ( $source , $target_encoding )
<
// detect the character encoding of the incoming file
$encoding = mb_detect_encoding ( $source , "auto" );

// escape all of the question marks so we can remove artifacts from
// the unicode conversion process
$target = str_replace ( "?" , "[question_mark]" , $source );

// convert the string to the target encoding
$target = mb_convert_encoding ( $target , $target_encoding , $encoding );

// remove any question marks that have been introduced because of illegal characters
$target = str_replace ( "?" , "" , $target );

// replace the token string "[question_mark]" with the symbol "?"
$target = str_replace ( "[question_mark]" , "?" , $target );

return $target ;
>
?>

Hope this helps someone! (Admins should feel free to delete my previous, incorrect, post for clarity)
-A

To add to the Flash conversion comment below, here’s how I convert back from what I’ve stored in a database after converting from Flash HTML text field output, in order to load it back into a Flash HTML text field:

function htmltoflash($htmlstr)
<
return str_replace("
","
",
str_replace(" ",">",
mb_convert_encoding(html_entity_decode($htmlstr),
"UTF-8","ISO-8859-1"))));
>

Why did you use the php html encode functions? mbstring has it’s own Encoding which is (as far as I tested it) much more usefull:

$text = mb_convert_encoding($text, ‘HTML-ENTITIES’, "UTF-8");

instead of ini_set(), you can try this

Читайте также:  Msi radeon r7 370 2gb

For those wanting to convert from $set to MacRoman, use iconv():

= iconv ( ‘UTF-8’ , ‘macintosh’ , $string );

?>

(‘macintosh’ is the IANA name for the MacRoman character set.)

If you are trying to generate a CSV (with extended chars) to be opened at Exel for Mac, the only that worked for me was:
( $CSV , ‘Windows-1252’ , ‘UTF-8’ ); ?>

I also tried this:

//Separado OK, chars MAL
iconv ( ‘MACINTOSH’ , ‘UTF8’ , $CSV );
//Separado MAL, chars OK
chr ( 255 ). chr ( 254 ). mb_convert_encoding ( $CSV , ‘UCS-2LE’ , ‘UTF-8’ );
?>

But the first one didn’t show extended chars correctly, and the second one, did’t separe fields correctly

Clean a string for use as filename by simply replacing all unwanted characters with underscore (ASCII converts to 7bit). It removes slightly more chars than necessary. Hope its useful.

// convert UTF8 to DOS = CP850
//
// $utf8_text=UTF8-Formatted text;
// $dos=CP850-Formatted text;

$dos = mb_convert_encoding($utf8_text, "CP850", mb_detect_encoding($utf8_text, "UTF-8, CP850, ISO-8859-15", true));

Another sample of recoding without MultiByte enabling.
(Russian koi->win, if input in win-encoding already, function recode() returns unchanged string)

// 0 — win
// 1 — koi
function detect_encoding ( $str ) <
$win = 0 ;
$koi = 0 ;

if( $win $koi ) <
return 1 ;
> else return 0 ;

// recodes koi to win
function koi_to_win ( $string ) <

$kw = array( 128 , 129 , 130 , 131 , 132 , 133 , 134 , 135 , 136 , 137 , 138 , 139 , 140 , 141 , 142 , 143 , 144 , 145 , 146 , 147 , 148 , 149 , 150 , 151 , 152 , 153 , 154 , 155 , 156 , 157 , 158 , 159 , 160 , 161 , 162 , 163 , 164 , 165 , 166 , 167 , 168 , 169 , 170 , 171 , 172 , 173 , 174 , 175 , 176 , 177 , 178 , 179 , 180 , 181 , 182 , 183 , 184 , 185 , 186 , 187 , 188 , 189 , 190 , 191 , 254 , 224 , 225 , 246 , 228 , 229 , 244 , 227 , 245 , 232 , 233 , 234 , 235 , 236 , 237 , 238 , 239 , 255 , 240 , 241 , 242 , 243 , 230 , 226 , 252 , 251 , 231 , 248 , 253 , 249 , 247 , 250 , 222 , 192 , 193 , 214 , 196 , 197 , 212 , 195 , 213 , 200 , 201 , 202 , 203 , 204 , 205 , 206 , 207 , 223 , 208 , 209 , 210 , 211 , 198 , 194 , 220 , 219 , 199 , 216 , 221 , 217 , 215 , 218 );
$wk = array( 128 , 129 , 130 , 131 , 132 , 133 , 134 , 135 , 136 , 137 , 138 , 139 , 140 , 141 , 142 , 143 , 144 , 145 , 146 , 147 , 148 , 149 , 150 , 151 , 152 , 153 , 154 , 155 , 156 , 157 , 158 , 159 , 160 , 161 , 162 , 163 , 164 , 165 , 166 , 167 , 168 , 169 , 170 , 171 , 172 , 173 , 174 , 175 , 176 , 177 , 178 , 179 , 180 , 181 , 182 , 183 , 184 , 185 , 186 , 187 , 188 , 189 , 190 , 191 , 225 , 226 , 247 , 231 , 228 , 229 , 246 , 250 , 233 , 234 , 235 , 236 , 237 , 238 , 239 , 240 , 242 , 243 , 244 , 245 , 230 , 232 , 227 , 254 , 251 , 253 , 255 , 249 , 248 , 252 , 224 , 241 , 193 , 194 , 215 , 199 , 196 , 197 , 214 , 218 , 201 , 202 , 203 , 204 , 205 , 206 , 207 , 208 , 210 , 211 , 212 , 213 , 198 , 200 , 195 , 222 , 219 , 221 , 223 , 217 , 216 , 220 , 192 , 209 );

$end = strlen ( $string );
$pos = 0 ;
do <
$c = ord ( $string [ $pos ]);
if ( $c > 128 ) <
$string [ $pos ] = chr ( $kw [ $c — 128 ]);
>

function recode ( $str ) <

$enc = detect_encoding ( $str );
if ( $enc == 1 ) <
$str = koi_to_win ( $str );
>

Можно ли конвертировать файл в UTF-8 на моем конце?

Если у меня есть доступ к файлу после подачи с

Замечания: Пользователь может загрузить файл CSV с любым типом кодировки, я обычно сталкиваюсь с неизвестный 8-битный кодировок.

Но проблема в том, что этот код удаляет специальные символы, такие как одинарные кавычки.

Я поставил это для дополнительной информации. Спасибо за тех, кто может помочь!

Решение

Попробуйте это.
Пример, который я использовал, был чем-то, что я делал в тестовой среде, возможно, вам придется немного изменить код.

У меня был текстовый файл со следующими данными в:

Затем у меня была форма, в которую входил файл и выполнялся следующий код:

Функция neatify_files это то, что я написал, чтобы сделать $_FILES массив более логичен в своей компоновке.

Форма является стандартной формой, которая просто POST данные на сервер.
Примечание: использование $_SERVER["PHP_SELF"] это риск для безопасности, см здесь .

Когда данные публикуются, я сохраняю файл в переменной. Очевидно, что если вы используете multiple Атрибут ваш код будет выглядеть не совсем так.

$handle хранит все содержимое текстового файла в формате только для чтения; следовательно "r" аргумент.

$enc использует mb_detect_encoding функция для определения кодировки (дух).
Сначала у меня были проблемы с получением правильной кодировки. Настройка encoding_list использовать только UTF-8 и настройки strict чтобы быть правдой.

Если кодировка UTF-8, то я просто печатаю строку, если нет, я конвертирую ее в UTF-8, используя iconv функция.

Другие решения

прежде чем вы сможете конвертировать его в utf-8, вам нужно знать, что это за набор символов.
если вы не можете понять это, вы не можете каким-либо вменяемым образом преобразовать его в utf8 ..
однако безумный способ преобразовать его в utf-8, если кодировка не может быть определена,
это просто удалить любые байты, которые не являются действительными в UTF-8, вы
может быть в состоянии использовать это как запасной вариант …

предупреждение, непроверенный код (я внезапно спешу), но может выглядеть примерно так:

Вы можете преобразовать текст файла в двоичные данные, используя следующие

после преобразования данных в двоичный файл вы просто изменяете текст на метод php mb_convert_encoding ($ fileText, «UTF-8»);

(PHP 4 >= 4.0.5, PHP 5, PHP 7)

iconv — Преобразование строки в требуемую кодировку

Описание

Преобразует набор символов строки str из кодировки in_charset в out_charset .

Список параметров

Кодировка входной строки.

Требуемая на выходе кодировка.

Если добавить к out_charset строку //TRANSLIT, включается режим транслитерации. Это значит, что в случае, если символ не может быть представлен в требуемой кодировке, он будет заменен на один или несколько наиболее близких по внешнему виду символов. Если добавить строку //IGNORE, то символы, которые не могут быть представлены в требуемой кодировке, будут удалены. В случае отсутствия вышеуказанных параметров будет сгенерирована ошибка уровня E_NOTICE , а функция вернет FALSE .

Как будет работат //TRANSLIT и будет ли вообще, зависит от системной реализации iconv() ( ICONV_IMPL ). Известны некоторые реализации, которые просто игнорируют //TRANSLIT, так что конвертация для символов некорректных для out_charset скорее всего закончится ошибкой.

Читайте также:  Destiny 2 экзотическое оружие как получить

Строка, которую необходимо преобразовать.

Возвращаемые значения

Возвращает преобразованную строку или FALSE в случае возникновения ошибки.

Список изменений

Версия Описание
5.4.0 Начиная с этой версии, функция возвращает FALSE на некорректных символах, только если в выходной кодировке не указан //IGNORE. До этого функция возвращала часть строки.

Примеры

Пример #1 Пример использования iconv()

= "Это символ евро — ‘€’." ;

echo ‘Исходная строка : ‘ , $text , PHP_EOL ;
echo ‘С добавлением TRANSLIT : ‘ , iconv ( "UTF-8" , "ISO-8859-1//TRANSLIT" , $text ), PHP_EOL ;
echo ‘С добавлением IGNORE : ‘ , iconv ( "UTF-8" , "ISO-8859-1//IGNORE" , $text ), PHP_EOL ;
echo ‘Обычное преобразование : ‘ , iconv ( "UTF-8" , "ISO-8859-1" , $text ), PHP_EOL ;

Результатом выполнения данного примера будет что-то подобное:

User Contributed Notes 39 notes

The "//ignore" option doesn’t work with recent versions of the iconv library. So if you’re having trouble with that option, you aren’t alone.

That means you can’t currently use this function to filter invalid characters. Instead it silently fails and returns an empty string (or you’ll get a notice but only if you have E_NOTICE enabled).

This has been a known bug with a known solution for at least since 2009 years but no one seems to be willing to fix it (PHP must pass the -c option to iconv). It’s still broken as of the latest release 5.4.3.

ini_set(‘mbstring.substitute_character’, "none");
$text= mb_convert_encoding($text, ‘UTF-8’, ‘UTF-8’);

That will strip invalid characters from UTF-8 strings (so that you can insert it into a database, etc.). Instead of "none" you can also use the value 32 if you want it to insert spaces in place of the invalid characters.

Please note that iconv(‘UTF-8’, ‘ASCII//TRANSLIT’, . ) doesn’t work properly when locale category LC_CTYPE is set to C or POSIX. You must choose another locale otherwise all non-ASCII characters will be replaced with question marks. This is at least true with glibc 2.5.

Example:
( LC_CTYPE , ‘POSIX’ );
echo iconv ( ‘UTF-8’ , ‘ASCII//TRANSLIT’ , "Žluťoučký kůň
" );
// ?lu?ou?k? k??

setlocale ( LC_CTYPE , ‘cs_CZ’ );
echo iconv ( ‘UTF-8’ , ‘ASCII//TRANSLIT’ , "Žluťoučký kůň
" );
// Zlutoucky kun
?>

Interestingly, setting different target locales results in different, yet appropriate, transliterations. For example:

//some German
$utf8_sentence = ‘Weiß, Goldmann, Göbel, Weiss, Göthe, Goethe und Götz’ ;

//UK
setlocale ( LC_ALL , ‘en_GB’ );

//transliterate
$trans_sentence = iconv ( ‘UTF-8’ , ‘ASCII//TRANSLIT’ , $utf8_sentence );

//gives [Weiss, Goldmann, Gobel, Weiss, Gothe, Goethe und Gotz]
//which is our original string flattened into 7-bit ASCII as
//an English speaker would do it (ie. simply remove the umlauts)
echo $trans_sentence . PHP_EOL ;

//Germany
setlocale ( LC_ALL , ‘de_DE’ );

$trans_sentence = iconv ( ‘UTF-8’ , ‘ASCII//TRANSLIT’ , $utf8_sentence );

//gives [Weiss, Goldmann, Goebel, Weiss, Goethe, Goethe und Goetz]
//which is exactly how a German would transliterate those
//umlauted characters if forced to use 7-bit ASCII!
//(because really ä = ae, ö = oe and ü = ue)
echo $trans_sentence . PHP_EOL ;

to test different combinations of convertions between charsets (when we don’t know the source charset and what is the convenient destination charset) this is an example :

= array( "UTF-8" , "ASCII" , "Windows-1252" , "ISO-8859-15" , "ISO-8859-1" , "ISO-8859-6" , "CP1256" );
$chain = "" ;
foreach ( $tab as $i )
<
foreach ( $tab as $j )
<
$chain .= " $i$j " . iconv ( $i , $j , " $my_string " );
>
>

echo $chain ;
?>

then after displaying, you use the $i$j that shows good displaying.
NB: you can add other charsets to $tab to test other cases.

If you are getting question-marks in your iconv output when transliterating, be sure to ‘setlocale’ to something your system supports.

Some PHP CMS’s will default setlocale to ‘C’, this can be a problem.

use the "locale" command to find out a list..

( LC_CTYPE , ‘en_AU.utf8’ );
$str = iconv ( ‘UTF-8’ , ‘ASCII//TRANSLIT’ , "Côte d’Ivoire" );
?>

Like many other people, I have encountered massive problems when using iconv() to convert between encodings (from UTF-8 to ISO-8859-15 in my case), especially on large strings.

The main problem here is that when your string contains illegal UTF-8 characters, there is no really straight forward way to handle those. iconv() simply (and silently!) terminates the string when encountering the problematic characters (also if using //IGNORE), returning a clipped string. The

= html_entity_decode ( htmlentities ( $oldstring , ENT_QUOTES , ‘UTF-8’ ), ENT_QUOTES , ‘ISO-8859-15’ );

?>

workaround suggested here and elsewhere will also break when encountering illegal characters, at least dropping a useful note ("htmlentities(): Invalid multibyte sequence in argument in. ")

I have found a lot of hints, suggestions and alternative methods (it’s scary and in my opinion no good sign how many ways PHP natively provides to convert the encoding of strings), but none of them really worked, except for this one:

= mb_convert_encoding ( $oldstring , ‘ISO-8859-15’ , ‘UTF-8’ );

There may be situations when a new version of a web site, all in UTF-8, has to display some old data remaining in the database with ISO-8859-1 accents. The problem is iconv("ISO-8859-1", "UTF-8", $string) should not be applied if $string is already UTF-8 encoded.

I use this function that does’nt need any extension :

function convert_utf8( $string ) <
if ( strlen(utf8_decode($string)) == strlen($string) ) <
// $string is not UTF-8
return iconv("ISO-8859-1", "UTF-8", $string);
> else <
// already UTF-8
return $string;
>
>

I have not tested it extensively, hope it may help.

For those who have troubles in displaying UCS-2 data on browser, here’s a simple function that convert ucs2 to html unicode entities :

function ucs2html ( $str ) <
$str = trim ( $str ); // if you are reading from file
$len = strlen ( $str );
$html = » ;
for( $i = 0 ; $i $len ; $i += 2 )
$html .= ‘&#’ . hexdec ( dechex ( ord ( $str [ $i + 1 ])).
sprintf ( "%02s" , dechex ( ord ( $str [ $i ])))). ‘;’ ;
return( $html );
>
?>

In my case, I had to change:
( LC_CTYPE , ‘cs_CZ’ );
?>
to
( LC_CTYPE , ‘cs_CZ.UTF-8’ );
?>
Otherwise it returns question marks.

When I asked my linux for locale (by locale command) it returns "cs_CZ.UTF-8", so there is maybe correlation between it.

iconv (GNU libc) 2.6.1
glibc 2.3.6

Here is how to convert UCS-2 numbers to UTF-8 numbers in hex:

function ucs2toutf8 ( $str )
<
for ( $i = 0 ; $i strlen ( $str ); $i += 4 )
<
$substring1 = $str [ $i ]. $str [ $i + 1 ];
$substring2 = $str [ $i + 2 ]. $str [ $i + 3 ];

Читайте также:  Dji phantom 3 advanced дальность полета

if ( $substring1 == "00" )
<
$byte1 = "" ;
$byte2 = $substring2 ;
>
else
<
$substring = $substring1 . $substring2 ;
$byte1 = dechex ( 192 +( hexdec ( $substring )/ 64 ));
$byte2 = dechex ( 128 +( hexdec ( $substring )% 64 ));
>
$utf8 .= $byte1 . $byte2 ;
>
return $utf8 ;
>

echo strtoupper ( ucs2toutf8 ( "06450631062D0020" ));

?>

Input:
06450631062D
Output:
D985D8B1D8AD

I have used iconv to convert from cp1251 into UTF-8. I spent a day to investigate why a string with Russian capital ‘Р’ (sounds similar to ‘r’) at the end cannot be inserted into a database.

The problem is not in iconv. But ‘Р’ in cp1251 is chr(208) and ‘Р’ in UTF-8 is chr(208).chr(106). chr(106) is one of the space symbol which match ‘s’ in regex. So, it can be taken by a greedy ‘+’ or ‘*’ operator. In that case, you loose ‘Р’ in your string.

For example, ‘ГР ‘ (Russian, UTF-8). Function preg_match. Regex is ‘(.+?)[s]*’. Then ‘(.+?)’ matches ‘Г’.chr(208) and ‘[s]*’ matches chr(106).’ ‘.

Although, it is not a bug of iconv, but it looks like it very much. That’s why I put this comment here.

Here is how to convert UTF-8 numbers to UCS-2 numbers in hex:

function utf8toucs2 ( $str )
<
for ( $i = 0 ; $i strlen ( $str ); $i += 2 )
<
$substring1 = $str [ $i ]. $str [ $i + 1 ];
$substring2 = $str [ $i + 2 ]. $str [ $i + 3 ];

if ( hexdec ( $substring1 ) 127 )
$results = "00" . $str [ $i ]. $str [ $i + 1 ];
else
<
$results = dechex (( hexdec ( $substring1 )- 192 )* 64 + ( hexdec ( $substring2 )- 128 ));
if ( $results 1000 ) $results = "0" . $results ;
$i += 2 ;
>
$ucs2 .= $results ;
>
return $ucs2 ;
>

echo strtoupper ( utf8toucs2 ( "D985D8B1D8AD" )). "
" ;
echo strtoupper ( utf8toucs2 ( "456725" )). "
" ;

I just found out today that the Windows and *NIX versions of PHP use different iconv libraries and are not very consistent with each other.

Here is a repost of my earlier code that now works on more systems. It converts as much as possible and replaces the rest with question marks:

if (! function_exists ( ‘utf8_to_ascii’ )) <
setlocale ( LC_CTYPE , ‘en_AU.utf8’ );
if (@ iconv ( "UTF-8" , "ASCII//IGNORE//TRANSLIT" , ‘é’ ) === false ) <
// PHP is probably using the glibc library (*NIX)
function utf8_to_ascii ( $text ) <
return iconv ( "UTF-8" , "ASCII//TRANSLIT" , $text );
>
>
else <
// PHP is probably using the libiconv library (Windows)
function utf8_to_ascii ( $text ) <
if ( is_string ( $text )) <
// Includes combinations of characters that present as a single glyph
$text = preg_replace_callback ( ‘/X/u’ , __FUNCTION__ , $text );
>
elseif ( is_array ( $text ) && count ( $text ) == 1 && is_string ( $text [ 0 ])) <
// IGNORE characters that can’t be TRANSLITerated to ASCII
$text = iconv ( "UTF-8" , "ASCII//IGNORE//TRANSLIT" , $text [ 0 ]);
// The documentation says that iconv() returns false on failure but it returns »
if ( $text === » || ! is_string ( $text )) <
$text = ‘?’ ;
>
elseif ( preg_match ( ‘/w/’ , $text )) < // If the text contains any letters.
$text = preg_replace ( ‘/W+/’ , » , $text ); // . then remove all non-letters
>
>
else < // $text was not a string
$text = » ;
>
return $text ;
>
>
>

Didn’t know its a feature or not but its works for me (PHP 5.0.4)

test it to convert from windows-1251 (stored in DB) to UTF-8 (which i use for web pages).
BTW i convert each array i fetch from DB with array_walk_recursive.

Here is an example how to convert windows-1251 (windows) or cp1251(Linux/Unix) encoded string to UTF-8 encoding.

function cp1251_utf8 ( $sInput )
<
$sOutput = "" ;

for ( $i = 0 ; $i strlen ( $sInput ); $i ++ )
<
$iAscii = ord ( $sInput [ $i ] );

Be aware that iconv in PHP uses system implementations of locales and languages, what works under linux, normally doesn’t in windows.

Also, you may notice that recent versions of linux (debian, ubuntu, centos, etc) the //TRANSLIT option doesn’t work. since most distros doesn’t include the intl packages (example: php5-intl and icuxx (where xx is a number) in debian) by default. And this because the intl package conflicts with another package needed for international DNS resolution.

Problem is that configuration is dependent of the sysadmin of the machine where you’re hosted, so iconv is pretty much useless by default, depending on what configuration is used by your distro or the machine’s admin.

iconv with //IGNORE works as expected: it will skip the character if this one does not exist in the $out_charset encoding.

If a character is missing from the $in_charset encoding (eg byte x81 from CP1252 encoding), then iconv will return an error, whether with //IGNORE or not.

For transcoding values in an Excel generated CSV the following seems to work:

= iconv ( ‘Windows-1252’ , ‘UTF-8//TRANSLIT’ , $value );
?>

Note an important difference between iconv() and mb_convert_encoding() — if you’re working with strings, as opposed to files, you most likely want mb_convert_encoding() and not iconv(), because iconv() will add a byte-order marker to the beginning of (for example) a UTF-32 string when converting from e.g. ISO-8859-1, which can throw off all your subsequent calculations and operations on the resulting string.

In other words, iconv() appears to be intended for use when converting the contents of files — whereas mb_convert_encoding() is intended for use when juggling strings internally, e.g. strings that aren’t being read/written to/from files, but exchanged with some other media.

‘" to the output.
This function will strip out these extra characters:
( LC_ALL , ‘en_US.UTF8’ );
function clearUTF ( $s )
<
$r = » ;
$s1 = @ iconv ( ‘UTF-8’ , ‘ASCII//TRANSLIT’ , $s );
$j = 0 ;
for ( $i = 0 ; $i strlen ( $s1 ); $i ++) <
$ch1 = $s1 [ $i ];
$ch2 = @ mb_substr ( $s , $j ++, 1 , ‘UTF-8’ );
if ( strstr ( ‘`^

function detectUTF8($string)
<
return preg_match(‘%(?:
[xC2-xDF][x80-xBF] # non-overlong 2-byte
|xE0[xA0-xBF][x80-xBF] # excluding overlongs
|[xE1-xECxEExEF][x80-xBF] <2># straight 3-byte
|xED[x80-x9F][x80-xBF] # excluding surrogates
|xF0[x90-xBF][x80-xBF] <2># planes 1-3
|[xF1-xF3][x80-xBF] <3># planes 4-15
|xF4[x80-x8F][x80-xBF] <2># plane 16
)+%xs’, $string);
>

function cp1251_utf8( $sInput )
<
$sOutput = "";

Добавить комментарий

Ваш e-mail не будет опубликован. Обязательные поля помечены *

*

code