需求:根据Url抓取并解析HTML
1、开发过程中一直连接超时:
String url = "http://www.xinhuanet.com";
Document doc = Jsoup.connect(url).get();
java.net.ConnectException: Connection timed out: connect
本人小白一个,经老员工指点,才知道公司访问外网中间有一层代理
2、通过HttpURLConnection使用代理访问外网
经大神指明方向,多方百度后,问题变为由HttpURLConnection使用代理爬取资源,然后再用jsoup解析。
且HttpURLConnection支持http和https
String url = "http://www.xinhuanet.com";
System.getProperties().put("proxySet", "true");
//这里设置IP,不要用域名
System.getProperties().setProperty("http.proxyHost", "代理IP");
System.getProperties().setProperty("http.proxyPort", "端口号");
URL url2 = new URL(url);
HttpURLConnection conn = (HttpURLConnection)url2.openConnection();
conn.setRequestMethod("GET");
int status = conn.getResponseCode();
System.out.println(status);
代理IP在浏览器中可以查看

如果代理服务器需要用户登录,还需要设置用户名和密码。(如果不设置,返回状态码407,提示需要代理授权)
java.io.IOException: Server returned HTTP response code: 407 for URL: http://www.xinhuanet.com
下面是加上用户登录的代码
String url = "http://www.xinhuanet.com";
System.getProperties().put("proxySet", "true");
//这里设置IP,不要用域名
System.getProperties().setProperty("http.proxyHost", "代理IP");
System.getProperties().setProperty("http.proxyPort", "端口号");
URL url2 = new URL(url);
HttpURLConnection conn = (HttpURLConnection)url2.openConnection();
conn.setRequestMethod("GET");
//设置你的用户名和密码 例如 username=admin,password=123456
String authentication = "admin:123456";
//需要用BASE64Encoder进行编码转换
String encodedLogin = new BASE64Encoder().encode(authentication.getBytes());
conn.setRequestProperty("Proxy-Authorization", " Basic " + encodedLogin);
int status = conn.getResponseCode();
System.out.println(status);
3、解决Eclipse中无法直接使用Base64Encoder的问题
右键项目名称 ----> Build Path ----> Configure Build Path
选择 Access rules ----> Edit ----> Add
在Resolution 中选择 Accessible ,在 Rule Pattern中输入 ** 。OK保存即可
上一段整理后的代码
//使用代理服务器,添加这段代码块 -START-
static{
System.getProperties().put("proxySet", "true");
System.getProperties().setProperty("http.proxyHost", "代理IP");
System.getProperties().setProperty("http.proxyPort", "代理端口号");
}
//使用代理服务器,添加这段代码块 -END-
//获取Jsoup的Document
public Document getDoc(String strUrl){
Document doc = null;
InputStream in = null;
try {
URL url = new URL(strUrl);
HttpURLConnection conn = (HttpURLConnection)url.openConnection();
//使用代理且需要登录,添加这段代码
/*conn.setRequestProperty("Proxy-Authorization", " Basic " +
new BASE64Encoder().encode("用户名:密码".getBytes()));*/
//该项必须配置,很多网站会拒绝非浏览器的访问,不设置会返回403,访问被服务器拒绝
conn.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)");
conn.setRequestMethod("GET");
conn.setRequestProperty("Content-type", "text/html");
conn.setRequestProperty("Connection", "close");
conn.setUseCaches(false);
conn.setConnectTimeout(5 * 1000);
String encode = getEncode(conn.getHeaderField("Content-Type"));
in = conn.getInputStream();
doc = Jsoup.parse(in,encode,strUrl);
} catch (Exception e) {
e.printStackTrace();
}finally{
if(null != in){
try {
in.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
return doc;
}
//有的HTML页是UTF-8,有的是GBK,如果页面指定了编码格式,直接取来用,没指定就默认用UTF-8了
/*publicString getEncode(String headerField) {
String encode = "utf-8";
if(null == headerField || "".equals(headerField)){
return encode;
}
headerField = headerField.toLowerCase();
if(headerField.contains("charset=") && !headerField.contains("charset=utf-8")){
if(headerField.contains("charset=gbk")){
encode = "gbk";
}else if(headerField.contains("charset=gb2312")){
encode = "gb2312";
}else if(headerField.contains("charset=iso-8859-1")){
encode = "iso-8859-1";
}
}
return encode;
}*/
//追加 getEncode这样判断编码格式不行,方法已注掉