æè¿è®©åä¸ä¸ªç»è®¡æ°æ®çä½ä¸ï¼å°±æ³å»å»ç¬åéåºçæ¿ä»·ï¼å³å®ç¬åå®å± 客çæ°æ®ï¼ç¶ååæ³çç»ä¹ ä¸ä¸å¤çº¿ç¨å°±ç¬åäºææçæ°æ®ãå®æ´ä»£ç GitHub
çç½é¡µç代ç ç»æ
é¦å çç½é¡µçç»æï¼å®å± 客æ¯æç §åå¸åç±»ï¼åå¸åå为ä¸åçå°åºï¼æ就计åä¸ä¸ªä¸ä¸ªå°åºçæ¥æåæ°æ®ãé¦å çä¸ä¸ææåå¸é£ä¸ªçé¢çç»æ
<!-- æ们éè¦çæ°æ®å¨ä¸é¢çaæ ç¾éé¢ -->
<html>
<body>
<div class="content">
<div class="city-itm">
<div class="letter_city">
<ul>
<li>
<div class="city_list">
<a href="" class="hot"></a>
<a href="" class="hot"></a>
</div>
</li>
<li>
<div class="city_list">
<a href="" class="hot"></a>
<a href="" class="hot"></a>
</div>
</li>
</ul>
</div>
</div>
</div>
</body>
</html>
é便æ¾ä¸ä¸ªåå¸è¿å»æ¾å°é£ä¸ªåå¸çææå°åºï¼æ们è¦çæ¯çé¢éé¢æ°æ¿ä¸é¢çæå°åºæ¾æ¿çå°åï¼è¿éè¦æ³¨ææäºåå¸æ¯æ²¡ææ°æ¿ï¼æå¼ååªæçé¨æ¿åãæå¼ä¸ä¸ªæå°åºçåå¸ç代ç ç»æ
<!-- æ们éè¦çæ°æ®å¨æå°åºä¸é¢çaæ ç¾éé¢ -->
<!-- å½æ²¡ææ°æ¿å°åºçåå¸class="clearfix"æ ç¾ä¸é¢çå
容就ä¸ä¸æ ·äº -->
<html>
<body>
<div id="content">
<div class="left-cont fl">
<div class="buy-house tab-contents" id="content_Rd1">
<div class="clearfix">
<!-- è¿éæ¯äºææ¿çå表 --->
<div class="details float_l"></div>
<!-- è¿éæ¯äºææ¿çå表 --->
<div class="details float_l">
<!-- æå°åº --->
<div class="areas">
<a> </a>
</div>
<!-- æä»·æ ¼ --->
<div class="prices"></div>
</div>
</div>
</div>
</div>
</div>
</body>
</html>
ç¶åæ们æ¾ä¸ä¸ªå°åºçç»æ
<!-- è¥æ¯ç¬¬ä¸é¡µå¹¶ä¸è¿æå
¶ä»ç页ï¼å°±è¦è·åé£äºé¾æ¥ --->
<html>
<body>
<div id="container">
<div class="list-contents">
<div class="list-results">
<div class="key-list">
<!-- è¿æ¯æ¥¼ççå表 -->
<div class="item-mod"></div>
</div>
<!-- è¿æ¯é¡µç çæ ç¾ -->
<div class="list-page">
<div class="pagination">
<a></a>
</div>
</div>
</div>
</div>
</div>
</body>
</html>
ç¶å楼ççç»æ,è¿æ¯åæççé¢ï¼åªæ¯æ们è¦çclass=âitem-modâçdivéé¢çç»æã
<div class="item-mod" data-link="" data-soj="" rel="">
<a class="pic"><image></a>
<div class="info">
<a class"lp-name">
<h3><span class="items-name>èåç½è±¡è¡</span></h3>
</a>
<a class="address">
<span>[ æ¸ä¸ 解æ¾ç¢ ] å¯æè·¯16å·</span>
</a>
<a class="tags-wrap>
<div class="tag-panel">
<!-- è¿éå¯è½ä¸æ¯ä¸¤ä¸ªé½æï¼ä½æ¯class="status-icon wuyetp"è¯å®æ -->
<i class="status-icon onsale">å¨å®</i>
<i class="status-icon wuyetp">ä½å®
</i>
</div>
</a>
</div>
<a class="favor-pos">
<p class="price">æä½<span>1000</span>ä¸å
/å¥èµ·</p>
</a>
</div>
å¼å§å代ç
å åä¸ä¸ªæ¨¡æhttp请æ±çæ¹æ³
public class HttpUtils {
public static String CreatHttpGet(String url) {
HttpClient httpclient = HttpClients.createDefault();
HttpGet httpget = new HttpGet(url);
httpget.setHeader("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.1.2)");
httpget.setHeader("Accept-Language", "zh-cn,zh;q=0.5");
httpget.setHeader("Accept-Charset", "GB2312,utf-8;q=0.7,*;q=0.7");
HttpResponse response;
String result = null;
try {
response = httpclient.execute(httpget);
int statusCode = response.getStatusLine().getStatusCode();
if ((statusCode == HttpStatus.SC_MOVED_PERMANENTLY) || (statusCode == HttpStatus.SC_MOVED_TEMPORARILY)
|| (statusCode == HttpStatus.SC_SEE_OTHER) || (statusCode == HttpStatus.SC_TEMPORARY_REDIRECT)) {
String newUri = response.getLastHeader("Location").getValue();
httpclient = HttpClients.createDefault();
httpget = new HttpGet(newUri);
response = httpclient.execute(httpget);
}
HttpEntity entity = response.getEntity();
if (entity != null) {
// å°æºç æµä¿åå¨ä¸ä¸ªbyteæ°ç»å½ä¸ï¼å 为å¯è½éè¦ä¸¤æ¬¡ç¨å°è¯¥æµï¼
byte[] bytes = EntityUtils.toByteArray(entity);
String charSet = "";
charSet = EntityUtils.getContentCharSet(entity);
result = new String(bytes, charSet);
}
} catch (ClientProtocolException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
httpclient.getConnectionManager().shutdown();
return result;
}
}
å¼å§å解æç½é¡µçæ¹æ³
解æç½é¡µè·åæ°æ®å¯ä»¥ç¨Patternï¼å¼å§æå°±æ¯ç¨çè¿ä¸ªï¼æµè¯æ¶æ¨¡æçå符串çè½æ£å¸¸ä½¿ç¨ï¼ä½æ¯è§£æç½é¡µæ¶å°±åºé®é¢äºï¼åæ¥ç¾åº¦å使ç¨äºJsoupï¼è§£æç½é¡µå¾æ¹ä¾¿ã
解æç½é¡µçæè·¯æ³å¥½ä»¥åæå°±æ³è·åææåå¸ç¨ä¸ä¸ªçº¿ç¨ãè·ååå¸çå°åºç¨ä¸ä¸ªçº¿ç¨æ± ï¼å¨åå¸çlistéé¢å»åé¾æ¥åï¼è·å¾ç½é¡µè§£æåºå°åºçé¾æ¥ãè·å楼çç¨ä¸ä¸ªçº¿ç¨æ± ï¼å¨å°åºé¾æ¥çlistä¸åé¾æ¥ï¼è§£æåºè¿ä¸ªå°åºç楼çãææ°æ®æ·»å å°æ°æ®åºä¹ç¨ä¸ä¸ªçº¿ç¨æ± ï¼å¾å°ä¸ä¸ªå°åºææç楼çåå°±å线ç¨æ± ä¸æ·»å ä¸ä¸ªçº¿ç¨ã
å¼å§æ¶ææ¯ç´æ¥å¼å§5个线ç¨ï¼æ¯ä¸ªçº¿ç¨ä¸ç´å»åé¾è¡¨çæ°æ®ï¼ç´å°é¾è¡¨ä¸ºç©ºï¼çº¿ç¨å°±ç»æãä½æ¯è¿æ ·ä¸æ¯ææ³è¦çææï¼ç¶åå°±æ¹æäºå¾å°ä¸ä¸ªé¾æ¥å°±å线ç¨æ± ä¸æ交ä¸ä¸ªä»»å¡ãææ主è¦ç代ç æ¾åºæ¥ï¼å®æ´ç代ç Github
æç´¢åå¸
public class CitySearchThread extends Thread {
public void run() {
String result = null;
result = HttpUtils.CreatHttpGet("https://www.anjuke.com/sy-city.html");
if (result != null) {
try {
Document doc = Jsoup.parse(result);
Element cityItmElement = doc.getElementsByClass("city-itm").first();
Element lettercityElement = cityItmElement.getElementsByClass("letter_city").first();
Element ulElement = lettercityElement.getElementsByTag("ul").first();
Elements liElements = ulElement.getElementsByTag("li");
for (int i = ; i < liElements.size() - ; i++) {
Element cityListElement = liElements.get(i).getElementsByClass("city_list").first();
Elements cityElements = cityListElement.getElementsByTag("a");
for (int j = ; j < cityElements.size(); j++) {
Element cityElement = cityElements.get(j);
String url = cityElement.absUrl("href");
synchronized (DataList.cityList) {
DataList.cityList.add(url);
if (DataList.cityList.size() == ) {
DataList.cityList.notifyAll();
}
}
}
}
System.out.println("åå¸æç´¢å®æ");
}catch (Exception e) {
System.out.println("ç½é¡µè¢«æ¦æªäº");
}finally {
// æ å¿æç´¢ææçåå¸å®æ
DataList.cityFlag = false;
}
}else {
// æ å¿æç´¢ææçåå¸å®æ
DataList.cityFlag = false;
System.out.println("åå¸æç´¢å®æ");
}
}
}
æç´¢å°åº
public class DistrictSearchThread extends Thread {
public void run() {
ExecutorService pool = Executors.newFixedThreadPool();
boolean flag = true;
while (flag) {
String url = null;
synchronized (DataList.cityList) {
try {
url = DataList.cityList.getFirst();
DataList.cityList.removeFirst();
} catch (Exception e) {
}
if (url == null) {
if (DataList.cityFlag == false) {
flag = false;
} else {
try {
DataList.cityList.wait();// é²æ¢æåä¸ç´çå¾
ï¼æ²¡æå¤éä»
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
pool.execute(new SearchUtil(url));
}
pool.shutdown();
while (true) {
if (pool.isTerminated()) {
DataList.districtFlag = false;
System.out.println("å°åºæç´¢å®æ");
break;
}
try {
Thread.sleep();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
class SearchUtil extends Thread {
private String url;
public SearchUtil(String url) {
super();
this.url = url;
}
@Override
public void run() {
String result = null;
// å½çº¿ç¨æ¯åå¤éçé£stræ¯null
if (url != null) {
result = HttpUtils.CreatHttpGet(url);
if (result != null) {
Document doc = Jsoup.parse(result);
try {
Element contentElement = doc.getElementById("content_Rd1");
Element detailsfloat_lElement = contentElement.getElementsByClass("details float_l").get();
Element areasElement = detailsfloat_lElement.getElementsByClass("areas").first();
Elements aElements = areasElement.getElementsByTag("a");
for (int i = ; i < aElements.size(); i++) {
Element cityElement = aElements.get(i);
String districtUrl = cityElement.absUrl("href");
synchronized (DataList.districtList) {
DataList.districtList.add(districtUrl);
if (DataList.districtList.size() == ) {
DataList.districtList.notifyAll();
}
}
}
} catch (Exception e) {
System.out.println(Thread.currentThread().getName() + "=====åºå:" + url + "æ ç¸å
³æ°æ®");
}
}
}
}
}
}
è·å楼çåæ·»å 楼ç
public class HouseSearchThread extends Thread {
ExecutorService pool = Executors.newFixedThreadPool();
ExecutorService pool1 = Executors.newFixedThreadPool();
@Override
public void run() {
boolean flag = true;
while (flag) {
String url = null;
synchronized (DataList.districtList) {
try {
url = DataList.districtList.getFirst();
DataList.districtList.removeFirst();
} catch (Exception e) {
}
if (url == null) {
if (DataList.districtFlag == false) {
flag = false;
} else {
try {
DataList.districtList.wait();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
pool.execute(new SearchUtil(url));
}
pool.shutdown();
while (true) {
if (pool.isTerminated()) {
System.out.println("楼çæç´¢å®æ");
break;
}
try {
Thread.sleep();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
pool1.shutdown();
while (true) {
if (pool1.isTerminated()) {
System.out.println("楼çæ·»å å®æ");
break;
}
try {
Thread.sleep();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
System.out.println("å®æ");
}
class SearchUtil extends Thread {
private String url;
public SearchUtil(String url) {
this.url = url;
}
@Override
public void run() {
String baseUrl = url;
String result = null;
if (url != null) {
result = HttpUtils.CreatHttpGet(url);
if (result != null) {
LinkedList<Elements> resultList = new LinkedList<Elements>();
Map<String, Object> map = ExcisionUtil.roughExcisionAPage(result);// è·åå½å°çé¦é¡µè·åä¿¡æ¯å页æ°
Integer page = (Integer) map.get("page");
resultList.add((Elements) map.get("list"));
// è·åå½å°çå©ä¸å 页
if (page != null) {
for (; page > ; page--) {
String nextUrl = baseUrl + "p" + page + "/";
result = HttpUtils.CreatHttpGet(nextUrl);
resultList.add(ExcisionUtil.roughExcisionNPage(result));
}
}
// æ·»å ä¿¡æ¯
Iterator<Elements> it = resultList.iterator();
while (it.hasNext()) {
Elements ele = it.next();
if (ele != null) {
LinkedList<House> houseList = ExcisionUtil.exactExcision(ele);
pool1.execute(new AddUtil(houseList));
}
}
}
}
}
}
class AddUtil extends Thread{
private LinkedList<House> houseList;
public AddUtil(LinkedList<House> houseList) {
this.houseList = houseList;
}
@Override
public void run() {
SqlSession sqlSession = MyBatisUtil.getSqlSession(true);
HouseMapper mapper = sqlSession.getMapper(HouseMapper.class);
Iterator<House> it = houseList.iterator();
while(it.hasNext()) {
House house = it.next();
mapper.addHouse(house);
}
sqlSession.close();
}
}
}
解æç½é¡µè·å¾æ¥¼ç
public class ExcisionUtil {
/**
* åå²åºæ¥å
å«æ¥¼ççelements,å
å«é¡µæ°
* @param result
* @return
*/
public static Map<String,Object> roughExcisionAPage(String result){
HashMap<String, Object> map = new HashMap<String, Object>();
if(result != null) {
Document doc = Jsoup.parse(result);
try {
Element elementRoot = doc.getElementById("container");
Element listElement = elementRoot.getElementsByClass("list-contents").first();
//Element element2 = elements.first();
Element keyListElement = listElement.getElementsByClass("key-list").first();
Elements houseElements = keyListElement.getElementsByClass("item-mod");
Element listPageElement = listElement.getElementsByClass("list-page").first();
Elements pageElement = listPageElement.getElementsByTag("a");
map.put("list", houseElements);
map.put("page", pageElement.size()+);
}catch(Exception e) {
}
}
return map;
}
/**
* ç²ç¥åå²ï¼ä¸å
å«é¡µæ°
* @param result
* @return
*/
public static Elements roughExcisionNPage(String result){
Elements houseElements = null;
if(result != null) {
Document doc = Jsoup.parse(result);
try {
Element elementRoot = doc.getElementById("container");
Element listElement = elementRoot.getElementsByClass("list-contents").first();
Element keyListElement = listElement.getElementsByClass("key-list").first();
houseElements = keyListElement.getElementsByClass("item-mod");
}catch(Exception e) {
}
}
return houseElements;
}
/**
* è·åæ¿åçä¿¡æ¯
* @param elements
* @return
*/
public static LinkedList<House> exactExcision(Elements elements){
LinkedList<House> resultList = new LinkedList<House>();
for(int i = ;i<elements.size();i++) {
String name = null;
String address = null;
String state = null;
String describe = null;
String price = null;
Element element = elements.get(i);
Element infoElement = element.getElementsByClass("infos").first();
try {
Element nameElement = infoElement.getElementsByClass("lp-name").first();
Element h3Element = nameElement.getElementsByTag("h3").first().getAllElements().first();
name = h3Element.text();
}catch(Exception e) {
}
try {
Element addressElement = infoElement.getElementsByClass("address").first();
Element spanElement = addressElement.getElementsByTag("span").first();
address = spanElement.text();
}catch(Exception e) {
}
Element stateElement = null;
Element describeElement = null;
try {
Element tagswrapElement = infoElement.getElementsByClass("tags-wrap").first();
Element tagpanelElement = tagswrapElement.getElementsByClass("tag-panel").first();
Elements sdElements = tagpanelElement.getElementsByTag("i");
if(sdElements.size() == ) {
describeElement = sdElements.first();
}
if(sdElements.size() == ) {
stateElement = sdElements.first();
describeElement = sdElements.get();
}
if(stateElement != null) {
state = stateElement.text();
}
if(describeElement != null) {
describe = describeElement.text();
}
}catch(Exception e) {
}
try {
Element favorposElement = element.getElementsByClass("favor-pos").first();
Element pElement = favorposElement.getElementsByTag("p").first();
price = pElement.text();
}catch(Exception e) {
}
if(name != null) {
House house = new House(name, address, state, describe, price);
resultList.add(house);
}
}
return resultList;
}
}
è¿æ¬¡åè¿ä¸ªéå°äºä¸äºé®é¢ï¼å¤çº¿ç¨çå¾å¤ä¸è¥¿ä»¥åé½å¾æ¨¡ç³ï¼åå®åå¢å äºæå·§åç解ï¼ç»èå¨åé¢ååºæ¥ã