httpclient自動擷取頁面編碼設定進行字元編碼,使httpclient适用所有網頁抓取不亂碼...

2023-07-19 12:58:55

//生成HttpMethod的方法就不舉例了，網上很多，這裡隻是寫明如何使得Httpclient适用所有編碼的網頁抓取

	/**
	 * 擷取頁面html内容
	 * @param method
	 * @param methodType
	 * @return String
	 * @throws UnsupportedEncodingException
	 * @throws IOException
	 */
	private static String readInputStream(HttpMethod method) throws Exception{
		String charset = "UTF-8";
		if(method instanceof PostMethod){
			charset = ((PostMethod)method).getResponseCharSet();
		}else{
			charset = ((GetMethod)method).getResponseCharSet();
		}
		byte[] bytes = method.getResponseBody();
		String body = new String(bytes,"UTF-8");
		charset = getCharSetByBody(body,charset);
		return new String(bytes,charset);
	}
	
	/**
	 * 根據頁面body擷取字元編碼
	 * @param html
	 * @param charset
	 * @return
	 */
	private static String getCharSetByBody(String html,String charset){
		Document document = parseJSoupDocumentFromHtml(html, Constants.parseBaseUri);
		Elements elements = document.select("meta");
		for(Element metaElement : elements){
			if(metaElement!=null && StringUtils.isNotBlank(metaElement.attr("http-equiv")) && metaElement.attr("http-equiv").toLowerCase().equals("content-type")){
				String content = metaElement.attr("content");
				charset = getCharSet(content);
				break;
			}
		}
		return charset;
	}
	
	/**
	 * 正則擷取字元編碼
	 * @param content
	 * @return
	 */
	private static String getCharSet(String content){
		String regex = ".*charset=([^;]*).*";
		Pattern pattern = Pattern.compile(regex);
		Matcher matcher = pattern.matcher(content);
		if(matcher.find())
			return matcher.group(1);
		else
			return null;
	}

httpclient自動擷取頁面編碼設定進行字元編碼,使httpclient适用所有網頁抓取不亂碼...

繼續閱讀

HttpClient添加cookie政策

HttpClient_01 簡介HttpClient_01 簡介

httpclient 4.5.1---進階主題

HttpClient post請求上傳檔案（java）

Java爬蟲入門簡介（三）——HttpClient儲存使用Cookie登入

Spring RestTemplate工具類

HttpClient post 請求執行個體

HttpComponents —— HTTP請求（HttpRequest）

HttpClient中，使用HttpDelete時無法攜帶body的解決辦法

接口自動化落地（二：HttpClient+testNG實作對接口的測試及校驗）

Python3内置http.client，urllib.request及三方庫requests發送請求對比

httpclient 應用檔案上傳中文編碼問題

asp.netcore 高并發下使用HttpClient的方法

HttpClient 實作酷狗 Top500 音樂下載下傳

Android網絡程式設計之使用HttpClient批量上傳檔案（二）AsyncTask+HttpClient并實作上傳進度監聽 Android網絡程式設計之使用HttpClient批量上傳檔案（二）AsyncTask+HttpClient并實作上傳進度監聽