httpclient自动获取页面编码设置进行字符编码,使httpclient适用所有网页抓取不乱码...

2023-07-19 12:58:55

//生成HttpMethod的方法就不举例了，网上很多，这里只是写明如何使得Httpclient适用所有编码的网页抓取

	/**
	 * 获取页面html内容
	 * @param method
	 * @param methodType
	 * @return String
	 * @throws UnsupportedEncodingException
	 * @throws IOException
	 */
	private static String readInputStream(HttpMethod method) throws Exception{
		String charset = "UTF-8";
		if(method instanceof PostMethod){
			charset = ((PostMethod)method).getResponseCharSet();
		}else{
			charset = ((GetMethod)method).getResponseCharSet();
		}
		byte[] bytes = method.getResponseBody();
		String body = new String(bytes,"UTF-8");
		charset = getCharSetByBody(body,charset);
		return new String(bytes,charset);
	}
	
	/**
	 * 根据页面body获取字符编码
	 * @param html
	 * @param charset
	 * @return
	 */
	private static String getCharSetByBody(String html,String charset){
		Document document = parseJSoupDocumentFromHtml(html, Constants.parseBaseUri);
		Elements elements = document.select("meta");
		for(Element metaElement : elements){
			if(metaElement!=null && StringUtils.isNotBlank(metaElement.attr("http-equiv")) && metaElement.attr("http-equiv").toLowerCase().equals("content-type")){
				String content = metaElement.attr("content");
				charset = getCharSet(content);
				break;
			}
		}
		return charset;
	}
	
	/**
	 * 正则获取字符编码
	 * @param content
	 * @return
	 */
	private static String getCharSet(String content){
		String regex = ".*charset=([^;]*).*";
		Pattern pattern = Pattern.compile(regex);
		Matcher matcher = pattern.matcher(content);
		if(matcher.find())
			return matcher.group(1);
		else
			return null;
	}

httpclient自动获取页面编码设置进行字符编码,使httpclient适用所有网页抓取不乱码...

继续阅读

HttpClient添加cookie策略

HttpClient_01 简介HttpClient_01 简介

httpclient 4.5.1---高级主题

HttpClient post请求上传文件（java）

Java爬虫入门简介（三）——HttpClient保存使用Cookie登录

Spring RestTemplate工具类

HttpClient post 请求实例

HttpComponents —— HTTP请求（HttpRequest）

HttpClient中，使用HttpDelete时无法携带body的解决办法

接口自动化落地（二：HttpClient+testNG实现对接口的测试及校验）

Python3内置http.client，urllib.request及三方库requests发送请求对比

httpclient 应用文件上传中文编码问题

asp.netcore 高并发下使用HttpClient的方法

HttpClient 实现酷狗 Top500 音乐下载

Android网络编程之使用HttpClient批量上传文件（二）AsyncTask+HttpClient并实现上传进度监听 Android网络编程之使用HttpClient批量上传文件（二）AsyncTask+HttpClient并实现上传进度监听