網站采集器介紹

常用的網絡采集器主要分為桌面版和伺服器版：桌面版是基于window等平台，通過本地帶寬來進行資料采集與釋出程式，主要代表有“火車頭網站采集器”和 “editortools”；伺服器版是采用php或asp程式設計，運作于linux或windows主機，通過伺服器帶寬來進行資料采集與釋出程式，主要代表有“小蜜蜂網站采集器”。兩大類采集器孰優孰劣不言而喻

php采集程式中常用的函數

<?php

//去除html标記

function text2html($txt){

$txt = str_replace(" ","　",$txt);

$txt = str_replace("<","&lt;",$txt);

$txt = str_replace(">","&gt;",$txt);

$txt = preg_replace("/[\r\n]{1,}/isu","<br/>\r\n",$txt);

return $txt;

}

//清除html标記

function clearhtml($str){

$str = str_replace('<','&lt;',$str);

$str = str_replace('>','&gt;',$str);

return $str;

//相對路徑轉化成絕對路徑

function relative_to_absolute($content, $feed_url) {

preg_match('/(http|https|ftp):\/\//', $feed_url, $protocol);

$server_url = preg_replace("/(http|https|ftp|news):\/\//", "", $feed_url);

$server_url = preg_replace("/\/.*/", "", $server_url);

if ($server_url == '') {

return $content;

}

if (isset($protocol[0])) {

$new_content = preg_replace('/href="\//', 'href="'.$protocol[0].$server_url.'/', $content);

$new_content = preg_replace('/src="\//', 'src="'.$protocol[0].$server_url.'/', $new_content);

} else {

$new_content = $content;

return $new_content;

//取得所有連結

function get_all_url($code){

preg_match_all('/<a\s+href=["|\']?([^>"\' ]+)["|\']?\s*[^>]*>([^>]+)<\/a>/i',$code,$arr);

return array('name'=>$arr[2],'url'=>$arr[1]);

//擷取指定标記中的内容

function get_tag_data($str, $start, $end){

if ( $start == '' || $end == '' ){

return;

$str = explode($start, $str);

$str = explode($end, $str[1]);

return $str[0];

//html表格的每行轉為csv格式數組

function get_tr_array($table) {

$table = preg_replace("'<td[^>]*?>'si",'"',$table);

$table = str_replace("</td>",'",',$table);

$table = str_replace("</tr>","{tr}",$table);

//去掉 html 标記

$table = preg_replace("'<[\/\!]*?[^<>]*?>'si","",$table);

//去掉空白字元

$table = preg_replace("'([\r\n])[\s]+'","",$table);

$table = str_replace(" ","",$table);

$table = explode(",{tr}",$table);

array_pop($table);

return $table;

//将html表格的每行每列轉為數組，采集表格資料

function get_td_array($table) {

$table = preg_replace("'<table[^>]*?>'si","",$table);

$table = preg_replace("'<tr[^>]*?>'si","",$table);

$table = preg_replace("'<td[^>]*?>'si","",$table);

$table = str_replace("</td>","{td}",$table);

$table = explode('{tr}', $table);

foreach ($table as $key=>$tr) {

$td = explode('{td}', $tr);

array_pop($td);

$td_array[] = $td;

return $td_array;

//傳回字元串中的所有單詞 $distinct=true 去除重複

function split_en_str($str,$distinct=true) {

preg_match_all('/([a-za-z]+)/',$str,$match);

if ($distinct == true) {

$match[1] = array_unique($match[1]);

sort($match[1]);

return $match[1];

網站采集器介紹

繼續閱讀

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

NOSQL安全攻擊

mybatis_入門程式Mybatis入門

samba伺服器的功能

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

php 去掉字元串的最後一個字元及截取原字元串1,2,3,4,5,6,

Effective Java 8:通用程式設計

【Linux】UDP廣播封包接收速率問題

php——水印

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

Linux裝置模型（中）之上層容器

scala (3) Function 和 Method

PowerPC平台 Linux移植三