咕咕咕~
總所周知httpclient是java爬蟲的利器,
一般我個人開發,都是用httpclient來編寫抓取登陸代理等,用jsoup,xpath,正則來處了解析。
廢話不多說直接上代碼。
public static String getPageContent(String url) {
// 建立一個用戶端,類似于打開一個浏覽器
DefaultHttpClient httpClient = new DefaultHttpClient();
// 建立一個GET方法,類似在浏覽器位址欄中輸入一個位址
HttpGet httpGet = new HttpGet(url);
String content = "";
try {
// 類似與在浏覽器位址欄中輸入回車,獲得網頁内容
HttpResponse response = httpClient.execute(httpGet);
// 檢視傳回内容
HttpEntity entity = response.getEntity();
if (entity != null) {
content += EntityUtils.toString(entity, "utf-8");
EntityUtils.consume(entity);// 關閉内容流
}
} catch (Exception e) {
logger.error("網頁擷取内容失敗:" + e);
}
httpClient.getConnectionManager().shutdown();
return content;
}
這就是一個簡易版的httpclient抓取的代碼,用的是defaulthttpclient,需要手動關閉連接配接,否則再次連接配接則會沖突。
當然也可以用CloseableHttpClient statichttpClient = HttpClients.createDefault();則更為友善。
上述代碼有沒有問題呢,沒有。
但是也有,為什麼這麼說呢,因為忽視了header的設定,許多網站會直接屏蔽這樣的請求。
那咋辦?
我們可以改成這樣:
public static String getPageContent_addHeader(String url) {
CloseableHttpClient httpclient = HttpClients.createDefault();
try {
HttpGet httpget = new HttpGet(url);
httpget.addHeader("Accept", Accept);
httpget.addHeader("Accept-Charset", Accept_Charset);
httpget.addHeader("Accept-Encoding", Accept_EnCoding);
httpget.addHeader("Accept-Language", Accept_Language);
httpget.addHeader("User-Agent", User_Agent);
ResponseHandler<String> responseHandler = new ResponseHandler<String>() {
public String handleResponse(final HttpResponse response) throws ClientProtocolException, IOException {
int status = response.getStatusLine().getStatusCode();
if (status >= 200 && status < 300) {
HttpEntity entity = response.getEntity();
System.out.println(status);
return entity != null ? EntityUtils.toString(entity) : null;
} else {
System.out.println(status);
Date date = new Date();
System.out.println(date);
System.exit(0);
throw new ClientProtocolException("Unexpected response status: " + status);
}
}
};
String responseBody = httpclient.execute(httpget, responseHandler);
return responseBody;
} catch (Exception e) {
logger.error(e);
} finally {
try {
httpclient.close();
} catch (IOException e) {
// TODO 自動生成的 catch 塊
logger.error("httpclient未正常關閉");
}
}
return null;
}
加上了些頭請求,如下:
private static String User_Agent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.160 Safari/537.22";
private static String Accept = "text/html";
private static String Accept_Charset = "utf-8";
private static String Accept_EnCoding = "gzip";
private static String Accept_Language = "en-Us,en";