天天看點

04.微網誌消息的語言檢測04.微網誌消息的語言檢測

大意是,封裝google語言檢測ajax web service的接口,輸入一段話,輸出語言種類。這個方法是從rssmeme.com看來的,經測試效果還不錯,可用于檢測微部落格消息的語言,如中文、日文、韓文等。但由于google對過于頻繁的請求會重置連結,是以提請注意,這個web service不适合大量密集請求送出。 

通路

<a href="http://ajax.googleapis.com/ajax/services/language/detect?v=1.0&amp;q=hello+world">http://ajax.googleapis.com/ajax/services/language/detect?v=1.0&amp;q=hello+world</a>

連結,你可以看到傳回結果是一個json字元串:

{"responsedata": {"language":"en","isreliable":false,"confidence":0.114892714}, "responsedetails": null, "responsestatus": 200}

記得加版本号參數:v=1.0,否則傳回如下json:

{"responsedata": null, "responsedetails": "invalid version", "responsestatus": 400}

舉例,送去檢測的微部落格消息是:

經過urlencode變換後,送出到google,傳回的結果是:

{"responsedata": {"language":"ja","isreliable":true,"confidence":0.88555187}, "responsedetails": null, "responsestatus": 200}

這樣用result['responsedata']['language']就獲得了語言的代号。

隻要檢查這個代号不是“zh-cn”,那麼就不是中文語言了。

示範:

import urllib

import httplib2

try:

    from base import easyjson

except:

    pass

class detect():

    def __init__(self, httplib2_inst=none):

        """從外可以傳入httplib執行個體,便于在外部加設代理軟體穿牆"""

        self.http = httplib2_inst or httplib2.http()

    def post_sentence(self, q):

        return self._fetch(

            self.google_api_prefix,

            {'v':"1.0",'q':q}

            )

    def _fetch(self, url, params):

        request = url +"?"+ urllib.urlencode(params)

        resp, content = self.http.request(request, "get")

        return easyjson.parse_json_func(content)

    def detectzhcn(self, text):

        """輸入文字如果檢測到是zh-cn,傳回true,否則傳回false"""

        data = self.post_sentence(text)['responsedata']

        if(data):

            language = data['language']

            if(language=='zh-cn'):

                return true

        return false