天天看點

用Apache HttpClient 4.0時強制指定響應的字元編碼

前兩天一段調用HTTP服務的腳本出了問題,仔細一看,發現是提供的HTTP服務在響應頭裡寫了:

HTTP/1.1 200 OK
Server: xxxxxxxxxx
Content-Type: text/html; charset=utf-8
Connection: close
Content-Length:2014
           

響應的頭中聲明了Content-Type,其中指定了charset=utf-8;但實際上響應中的文本卻是GBK編碼的。這使得原本我寫的請求腳本出了問題。

依賴的Apache HttpClient如下:

pom.xml:

<dependency>
  <groupId>org.apache.httpcomponents</groupId>
  <artifactId>httpcomponents-client</artifactId>
  <version>4.0</version>
</dependency>
<dependency>
  <groupId>org.apache.httpcomponents</groupId>
  <artifactId>httpcomponents-core</artifactId>
  <version>4.0.1</version>
</dependency>
           

原本的腳本使用[url=http://hc.apache.org/httpcomponents-client-ga/httpclient/apidocs/org/apache/http/impl/client/DefaultHttpClient.html]DefaultHttpClient[/url]去發起請求,并通過[url=http://hc.apache.org/httpcomponents-core-ga/httpcore/apidocs/org/apache/http/util/EntityUtils.html]EntityUtils[/url]自己實作一個與[url=http://hc.apache.org/httpcomponents-client-ga/httpclient/apidocs/org/apache/http/impl/client/BasicResponseHandler.html]BasicResponseHandler[/url]相似的[url=http://hc.apache.org/httpcomponents-client-ga/httpclient/apidocs/org/apache/http/client/ResponseHandler.html]ResponseHandler[/url],類似這樣的:

import org.apache.http.client.HttpResponseException;
import org.apache.http.client.ResponseHandler;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.util.EntityUtils;

def httpClient = new DefaultHttpClient();
def makeResponseHandler(charset) {
  { response ->
      def statusLine = response.statusLine;
      if (statusLine.statusCode >= 300) {
        throw new HttpResponseException(statusLine.statusCode, statusLine.reasonPhrase);
      }

      def entity = response.entity;
      entity ? EntityUtils.toString(entity, charset) : null;
  } as ResponseHandler
}

def httpGet = new HttpGet(requestUrl);
def responseBody = httpClient.execute(httpGet, makeResponseHandler('GBK'));
           

原本要調用的那個HTTP服務傳回的響應的頭裡面沒有Content-Type,是以這樣去使用[url=http://hc.apache.org/httpcomponents-core-ga/httpcore/apidocs/org/apache/http/util/EntityUtils.html#toString(org.apache.http.HttpEntity)]EntityUtils.toString(entity, defaultCharset)[/url]就已經可以達到指定解析響應内容時使用的字元編碼的目的了。

問題是那個HTTP服務現在帶上了錯誤的Content-Type,而EntityUtils.toString(entity, defaultCharset)認為Content-Type中的charset比defaultCharset更優先,此時上面的腳本就達不到強制指定字元編碼的目的了。

咋辦呢?最直覺的當然是自己把響應的内容的byte數組拿到手,然後自己想怎麼處理就怎麼處理:

import org.apache.http.client.HttpResponseException;
import org.apache.http.client.ResponseHandler;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.util.EntityUtils;

def httpClient = new DefaultHttpClient();
def makeResponseHandler(charset) {
  { response ->
      def statusLine = response.statusLine;
      if (statusLine.statusCode >= 300) {
        throw new HttpResponseException(statusLine.statusCode, statusLine.reasonPhrase);
      }

      def entity = response.entity;
      def bytes = entity ? EntityUtils.toByteArray(entity) : null;
      bytes ? new String(bytes, charset) : null;
  } as ResponseHandler
}

def httpGet = new HttpGet(requestUrl);
def responseBody = httpClient.execute(httpGet, makeResponseHandler('GBK'));
           

不知道還有沒有啥更好的辦法呢?我對HttpClient還是太不熟悉了。

本來最好自然是提供HTTP服務的那邊把響應頭的資訊修正,但這又要經過各種繁瑣的流程,我在跟進的某工具卻等不及了,隻好hack一下 =_=