ASP.NET 中抓取網頁内容是非常友善的,而其中更是解決了 ASP 中困擾我們的編碼問題。
需要三個類:WebRequest、WebResponse、StreamReader。
WebRequest、WebResponse 的名稱空間 是:
System.Net
StreamReader 的名稱空間是:
System.IO
核心代碼
WebRequest request = WebRequest.Create("http://www.cftea.com/");
WebResponse response = request.GetResponse();
StreamReader reader = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding("gb2312"));
- WebRequest 類的 Create 為靜态方法,參數為要抓取的網頁的網址;
- Encoding 指定編碼,Encoding 中有屬性 ASCII、UTF32、UTF8 等全球通用的編碼,但沒有 gb2312 這個編碼屬性,是以我們使用 GetEncoding 獲得 gb2312 編碼。
示例
private static string getContent(string Url)
{
string strResult = "";
try
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(Url);
//聲明一個HttpWebRequest請求
request.Timeout = 30000;
//設定連接配接逾時時間
request.Headers.Set("Pragma", "no-cache");
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream streamReceive = response.GetResponseStream();
Encoding encoding = Encoding.GetEncoding("GB2312");
StreamReader streamReader = new StreamReader(streamReceive, encoding);
strResult = streamReader.ReadToEnd();
streamReader.Close();
}
catch
{
throw;
}
return strResult;
}
private string GetUrl(string url)
{
string str = string.Empty;
try
{
WebRequest request = WebRequest.Create(url);
WebResponse response = request.GetResponse();
StreamReader reader = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding("gb2312"));
str = reader.ReadToEnd();
reader.Close();
reader.Dispose();
response.Close();
return str;
}
catch (Exception ex)
{
str = ex.Message;
return str;
}
}
private string GetPostContent(string strUrl)
{
string strMsg = string.Empty;
try
{
string data = "";
byte[] requestBuffer = System.Text.Encoding.GetEncoding("gb2312").GetBytes(data);
WebRequest request = WebRequest.Create(strUrl);
request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";
request.ContentLength = requestBuffer.Length;
using (Stream requestStream = request.GetRequestStream())
{
requestStream.Write(requestBuffer, 0, requestBuffer.Length);
requestStream.Close();
}
WebResponse response = request.GetResponse();
using (StreamReader reader = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding("gb2312")))
{
strMsg = reader.ReadToEnd();
reader.Close();
}
}
catch
{ }
return strMsg;
}
一般情況下會出現這個問題 解決如下
伺服器送出了協定沖突. Section=ResponseHeader Detail=CR 後面必須是 LF
The server committed a protocol violation. Section=ResponseHeader Detail=CR must be followed by LF
主體意思是微軟沒有容忍不符合RFC 822中的httpHeader必須以CRLF結束的規定的伺服器響應。
一個解決方案是在application.config或web.config檔案裡加入
<system.net>
<settings>
<httpWebRequest useUnsafeHeaderParsing= "true " />
</settings>
</system.net>
允許系統容忍(tolerant)隻以CR或LF結尾的hearder資訊