天天看點

php 英文單詞 數,【PHP】統計中英文單詞數(GB2312/UTF-8編碼)

英文單詞的統計可以直接用php原生的函數str_word_count來進行統計。但這個函數對于中文漢字顯得無能為力,無法準确統計到漢字個數。

解決辦法是根據漢字的編碼規則,自己來實作中文漢字數統計和中英文單詞數統計。漢字編碼參考Unicode編碼表和GB2312區位碼、編碼表與編碼規則。

對于GB2312編碼的字元采用以下函數:<?php

define( "GB2312_CHINESE_PATTERN", "/[\xb0-\xfe][\xa0-\xfe]/" );

define( "GB2312_SYMBOL_PATTERN", "/[\xa1-\xa3][\xa0-\xfe]/" );

// count only chinese words

function str_gb2312_chinese_word_count($str = ""){

$str = preg_replace(GB2312_SYMBOL_PATTERN, "", $str);

return preg_match_all(GB2312_CHINESE_PATTERN, $str, $arr);

}

// count both chinese and english

function str_gb2312_mix_word_count($str = ""){

$str = preg_replace(GB2312_SYMBOL_PATTERN, "", $str);

return str_gb2312_chinese_word_count($str) + str_word_count(preg_replace(GB2312_CHINESE_PATTERN, "", $str));

}

?>

對于UTF-8編碼的字元采用以下函數:

define( "UTF8_CHINESE_PATTERN", "/[\x{4e00}-\x{9fff}\x{f900}-\x{faff}]/u" );

define( "UTF8_SYMBOL_PATTERN", "/[\x{ff00}-\x{ffef}\x{2000}-\x{206F}]/u" );

// count only chinese words

function str_utf8_chinese_word_count($str = ""){

$str = preg_replace(UTF8_SYMBOL_PATTERN, "", $str);

return preg_match_all(UTF8_CHINESE_PATTERN, $str, $arr);

}

// count both chinese and english

function str_utf8_mix_word_count($str = ""){

$str = preg_replace(UTF8_SYMBOL_PATTERN, "", $str);

return str_utf8_chinese_word_count($str) + str_word_count(preg_replace(UTF8_CHINESE_PATTERN, "", $str));

}?>

以上兩種代碼功能相同,隻是根據不同的字元編碼做了不同的實作,實際使用視頁面編碼對應選擇。都有兩個函數,一個隻統計中文漢字數,另一個統計中英文單詞數(中文漢字數+英文單詞數),中英文符号都不計入數字統計。

特别說明:如不先去除中文标點會導緻統計出錯,如GB2312編碼下":‘"兩個中文标點的位元組表示為a3baa1ae,中間部分baa1正好對應GB2312編碼地"骸"字,會被統計為一個中文漢字,導緻計數錯誤。

函數使用可參考以下測試頁面:

define( "GB2312_CHINESE_PATTERN", "/[\xb0-\xfe][\xa0-\xfe]/" );

define( "GB2312_SYMBOL_PATTERN", "/[\xa1-\xa3][\xa0-\xfe]/" );

// count only chinese words

function str_gb2312_chinese_word_count($str = ""){

$str = preg_replace(GB2312_SYMBOL_PATTERN, "", $str);

return preg_match_all(GB2312_CHINESE_PATTERN, $str, $textrr);

}

// count both chinese and english

function str_gb2312_mix_word_count($str = ""){

$str = preg_replace(GB2312_SYMBOL_PATTERN, "", $str);

return str_gb2312_chinese_word_count($str) + str_word_count(preg_replace(GB2312_CHINESE_PATTERN, "", $str));

}

define( "UTF8_CHINESE_PATTERN", "/[\x{4e00}-\x{9fff}\x{f900}-\x{faff}]/u" );

define( "UTF8_SYMBOL_PATTERN", "/[\x{ff00}-\x{ffef}\x{2000}-\x{206F}]/u" );

// count only chinese words

function str_utf8_chinese_word_count($str = ""){

$str = preg_replace(UTF8_SYMBOL_PATTERN, "", $str);

return preg_match_all(UTF8_CHINESE_PATTERN, $str, $textrr);

}

// count both chinese and english

function str_utf8_mix_word_count($str = ""){

$str = preg_replace(UTF8_SYMBOL_PATTERN, "", $str);

return str_utf8_chinese_word_count($str) + str_word_count(preg_replace(UTF8_CHINESE_PATTERN, "", $str));

}

// convert a string to hex-coding form

function binhex($str) {

$hex = "";

$i = 0;

do {

$hex .= sprintf("%02x", ord($str{$i}));

$i++;

} while ($i < strlen($str));

return $hex;

}

$text = $_REQUEST["text"] ? $_REQUEST["text"] : "";

echo "Text: " . $text . "

";

echo "Hex : " . ($text ? binhex($text) : "") . "

";

// use one of the following two lines according to the page encoding

echo "Word count: " . str_gb2312_mix_word_count($text);

// echo "Word count: " . str_utf8_mix_word_count($text);

?>

本部落格所有文章如無特别注明均為原創。