天天看點

五種實作網絡爬蟲的方法(三,基于httpclient編寫爬蟲)

咕咕咕~

總所周知httpclient是java爬蟲的利器,

一般我個人開發,都是用httpclient來編寫抓取登陸代理等,用jsoup,xpath,正則來處了解析。

廢話不多說直接上代碼。

public static String getPageContent(String url) {
		// 建立一個用戶端,類似于打開一個浏覽器
		DefaultHttpClient httpClient = new DefaultHttpClient();
		// 建立一個GET方法,類似在浏覽器位址欄中輸入一個位址
		HttpGet httpGet = new HttpGet(url);
		String content = "";
		try {
			// 類似與在浏覽器位址欄中輸入回車,獲得網頁内容
			HttpResponse response = httpClient.execute(httpGet);
			// 檢視傳回内容
			HttpEntity entity = response.getEntity();
			if (entity != null) {
				content += EntityUtils.toString(entity, "utf-8");
				EntityUtils.consume(entity);// 關閉内容流
			}
		} catch (Exception e) {
			logger.error("網頁擷取内容失敗:" + e);
		}
		httpClient.getConnectionManager().shutdown();
		return content;
	}
           

這就是一個簡易版的httpclient抓取的代碼,用的是defaulthttpclient,需要手動關閉連接配接,否則再次連接配接則會沖突。

當然也可以用CloseableHttpClient statichttpClient = HttpClients.createDefault();則更為友善。

上述代碼有沒有問題呢,沒有。

但是也有,為什麼這麼說呢,因為忽視了header的設定,許多網站會直接屏蔽這樣的請求。

那咋辦?

我們可以改成這樣:

public static String getPageContent_addHeader(String url) {
		CloseableHttpClient httpclient = HttpClients.createDefault();

		try {
			HttpGet httpget = new HttpGet(url);
			httpget.addHeader("Accept", Accept);
			httpget.addHeader("Accept-Charset", Accept_Charset);
			httpget.addHeader("Accept-Encoding", Accept_EnCoding);
			httpget.addHeader("Accept-Language", Accept_Language);
			httpget.addHeader("User-Agent", User_Agent);
			ResponseHandler<String> responseHandler = new ResponseHandler<String>() {

				public String handleResponse(final HttpResponse response) throws ClientProtocolException, IOException {
					int status = response.getStatusLine().getStatusCode();
					if (status >= 200 && status < 300) {
						HttpEntity entity = response.getEntity();
						System.out.println(status);
						return entity != null ? EntityUtils.toString(entity) : null;
					} else {
						System.out.println(status);
						Date date = new Date();
						System.out.println(date);
						System.exit(0);
						throw new ClientProtocolException("Unexpected response status: " + status);
					}
				}
			};
			String responseBody = httpclient.execute(httpget, responseHandler);
			return responseBody;
		} catch (Exception e) {
			logger.error(e);
		} finally {
			try {
				httpclient.close();
			} catch (IOException e) {
				// TODO 自動生成的 catch 塊
				logger.error("httpclient未正常關閉");
			}
		}
		return null;
	}
           

加上了些頭請求,如下:

private static String User_Agent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.160 Safari/537.22";
	private static String Accept = "text/html";
	private static String Accept_Charset = "utf-8";
	private static String Accept_EnCoding = "gzip";
	private static String Accept_Language = "en-Us,en";