英文單詞的統計可以直接用php原生的函數str_word_count來進行統計。但這個函數對于中文漢字顯得無能為力,無法準确統計到漢字個數。
解決辦法是根據漢字的編碼規則,自己來實作中文漢字數統計和中英文單詞數統計。漢字編碼參考Unicode編碼表和GB2312區位碼、編碼表與編碼規則。
對于GB2312編碼的字元采用以下函數:<?php
define( "GB2312_CHINESE_PATTERN", "/[\xb0-\xfe][\xa0-\xfe]/" );
define( "GB2312_SYMBOL_PATTERN", "/[\xa1-\xa3][\xa0-\xfe]/" );
// count only chinese words
function str_gb2312_chinese_word_count($str = ""){
$str = preg_replace(GB2312_SYMBOL_PATTERN, "", $str);
return preg_match_all(GB2312_CHINESE_PATTERN, $str, $arr);
}
// count both chinese and english
function str_gb2312_mix_word_count($str = ""){
$str = preg_replace(GB2312_SYMBOL_PATTERN, "", $str);
return str_gb2312_chinese_word_count($str) + str_word_count(preg_replace(GB2312_CHINESE_PATTERN, "", $str));
}
?>
對于UTF-8編碼的字元采用以下函數:
define( "UTF8_CHINESE_PATTERN", "/[\x{4e00}-\x{9fff}\x{f900}-\x{faff}]/u" );
define( "UTF8_SYMBOL_PATTERN", "/[\x{ff00}-\x{ffef}\x{2000}-\x{206F}]/u" );
// count only chinese words
function str_utf8_chinese_word_count($str = ""){
$str = preg_replace(UTF8_SYMBOL_PATTERN, "", $str);
return preg_match_all(UTF8_CHINESE_PATTERN, $str, $arr);
}
// count both chinese and english
function str_utf8_mix_word_count($str = ""){
$str = preg_replace(UTF8_SYMBOL_PATTERN, "", $str);
return str_utf8_chinese_word_count($str) + str_word_count(preg_replace(UTF8_CHINESE_PATTERN, "", $str));
}?>
以上兩種代碼功能相同,隻是根據不同的字元編碼做了不同的實作,實際使用視頁面編碼對應選擇。都有兩個函數,一個隻統計中文漢字數,另一個統計中英文單詞數(中文漢字數+英文單詞數),中英文符号都不計入數字統計。
特别說明:如不先去除中文标點會導緻統計出錯,如GB2312編碼下":‘"兩個中文标點的位元組表示為a3baa1ae,中間部分baa1正好對應GB2312編碼地"骸"字,會被統計為一個中文漢字,導緻計數錯誤。
函數使用可參考以下測試頁面:
define( "GB2312_CHINESE_PATTERN", "/[\xb0-\xfe][\xa0-\xfe]/" );
define( "GB2312_SYMBOL_PATTERN", "/[\xa1-\xa3][\xa0-\xfe]/" );
// count only chinese words
function str_gb2312_chinese_word_count($str = ""){
$str = preg_replace(GB2312_SYMBOL_PATTERN, "", $str);
return preg_match_all(GB2312_CHINESE_PATTERN, $str, $textrr);
}
// count both chinese and english
function str_gb2312_mix_word_count($str = ""){
$str = preg_replace(GB2312_SYMBOL_PATTERN, "", $str);
return str_gb2312_chinese_word_count($str) + str_word_count(preg_replace(GB2312_CHINESE_PATTERN, "", $str));
}
define( "UTF8_CHINESE_PATTERN", "/[\x{4e00}-\x{9fff}\x{f900}-\x{faff}]/u" );
define( "UTF8_SYMBOL_PATTERN", "/[\x{ff00}-\x{ffef}\x{2000}-\x{206F}]/u" );
// count only chinese words
function str_utf8_chinese_word_count($str = ""){
$str = preg_replace(UTF8_SYMBOL_PATTERN, "", $str);
return preg_match_all(UTF8_CHINESE_PATTERN, $str, $textrr);
}
// count both chinese and english
function str_utf8_mix_word_count($str = ""){
$str = preg_replace(UTF8_SYMBOL_PATTERN, "", $str);
return str_utf8_chinese_word_count($str) + str_word_count(preg_replace(UTF8_CHINESE_PATTERN, "", $str));
}
// convert a string to hex-coding form
function binhex($str) {
$hex = "";
$i = 0;
do {
$hex .= sprintf("%02x", ord($str{$i}));
$i++;
} while ($i < strlen($str));
return $hex;
}
$text = $_REQUEST["text"] ? $_REQUEST["text"] : "";
echo "Text: " . $text . "
";
echo "Hex : " . ($text ? binhex($text) : "") . "
";
// use one of the following two lines according to the page encoding
echo "Word count: " . str_gb2312_mix_word_count($text);
// echo "Word count: " . str_utf8_mix_word_count($text);
?>
本部落格所有文章如無特别注明均為原創。