五種實作網絡爬蟲的方法（三，基于httpclient編寫爬蟲）

2023-07-05 10:06:31

咕咕咕~

總所周知httpclient是java爬蟲的利器，

一般我個人開發，都是用httpclient來編寫抓取登陸代理等，用jsoup，xpath，正則來處了解析。

廢話不多說直接上代碼。

public static String getPageContent(String url) {
		// 建立一個用戶端，類似于打開一個浏覽器
		DefaultHttpClient httpClient = new DefaultHttpClient();
		// 建立一個GET方法，類似在浏覽器位址欄中輸入一個位址
		HttpGet httpGet = new HttpGet(url);
		String content = "";
		try {
			// 類似與在浏覽器位址欄中輸入回車,獲得網頁内容
			HttpResponse response = httpClient.execute(httpGet);
			// 檢視傳回内容
			HttpEntity entity = response.getEntity();
			if (entity != null) {
				content += EntityUtils.toString(entity, "utf-8");
				EntityUtils.consume(entity);// 關閉内容流
			}
		} catch (Exception e) {
			logger.error("網頁擷取内容失敗:" + e);
		}
		httpClient.getConnectionManager().shutdown();
		return content;
	}

這就是一個簡易版的httpclient抓取的代碼，用的是defaulthttpclient，需要手動關閉連接配接，否則再次連接配接則會沖突。

當然也可以用CloseableHttpClient statichttpClient = HttpClients.createDefault();則更為友善。

上述代碼有沒有問題呢，沒有。

但是也有，為什麼這麼說呢，因為忽視了header的設定，許多網站會直接屏蔽這樣的請求。

那咋辦？

我們可以改成這樣：

public static String getPageContent_addHeader(String url) {
		CloseableHttpClient httpclient = HttpClients.createDefault();

		try {
			HttpGet httpget = new HttpGet(url);
			httpget.addHeader("Accept", Accept);
			httpget.addHeader("Accept-Charset", Accept_Charset);
			httpget.addHeader("Accept-Encoding", Accept_EnCoding);
			httpget.addHeader("Accept-Language", Accept_Language);
			httpget.addHeader("User-Agent", User_Agent);
			ResponseHandler<String> responseHandler = new ResponseHandler<String>() {

				public String handleResponse(final HttpResponse response) throws ClientProtocolException, IOException {
					int status = response.getStatusLine().getStatusCode();
					if (status >= 200 && status < 300) {
						HttpEntity entity = response.getEntity();
						System.out.println(status);
						return entity != null ? EntityUtils.toString(entity) : null;
					} else {
						System.out.println(status);
						Date date = new Date();
						System.out.println(date);
						System.exit(0);
						throw new ClientProtocolException("Unexpected response status: " + status);
					}
				}
			};
			String responseBody = httpclient.execute(httpget, responseHandler);
			return responseBody;
		} catch (Exception e) {
			logger.error(e);
		} finally {
			try {
				httpclient.close();
			} catch (IOException e) {
				// TODO 自動生成的 catch 塊
				logger.error("httpclient未正常關閉");
			}
		}
		return null;
	}

加上了些頭請求，如下：

private static String User_Agent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.160 Safari/537.22";
	private static String Accept = "text/html";
	private static String Accept_Charset = "utf-8";
	private static String Accept_EnCoding = "gzip";
	private static String Accept_Language = "en-Us,en";

五種實作網絡爬蟲的方法（三，基于httpclient編寫爬蟲）

繼續閱讀

Java小案例——随機數猜測随機數猜測

nginx location中斜線的位置的重要性

sort()函數到底是怎樣進行數字排序的

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的簡單使用

neo4j之cypher使用文檔

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method