HttpWebRequest讀取網頁源碼轉字元串不完整問題
在讀取網站源碼時發現部分頁面讀到的内容不完整,浏覽器打開正常
1 說明不是人家伺服器問題
2 fiddler裡打開發現也不完整,而且亂碼,但在transformer裡設定成 no compression 後也正常。說明讀取的東西是完整的,是後續處理的問題
3 c#裡調試發現讀取的字元串被截斷,copy字元串到notepad++裡發現被截斷的地方有\0\0,原來如此,\0表示字元串結su呢
處理的方法如下:
try
{
strUrl = "http://www.xxx.com";
CookieContainer cc = new CookieContainer();
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(strUrl);
request.Method = "Get";
request.CookieContainer = cc;
request.KeepAlive = true;
request.ContentType = "text/html";
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36";
request.Headers.Add("x-requested-with:XMLHttpRequest");
request.Headers.Add(HttpRequestHeader.AcceptLanguage, "zh-CN,zh;q=0.8,en;q=0.6,nl;q=0.4,zh-TW;q=0.2");
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip |
DecompressionMethods.None;
request.Headers.Add("Accept-Encoding", "gzip, deflate");
if (request.Method == "POST")
{
(request as HttpWebRequest).ContentType = "application/x-www-form-urlencoded";
}
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
//StreamReader reader = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding("gb2312"));
StreamReader reader = new StreamReader(response.GetResponseStream(), encoder);
strMsg = reader.ReadToEnd();
// .\0為null,空字元,也是字元串結束标志
strMsg = strMsg.Replace("\0", "");
reader.Close();
reader.Dispose();
response.Close();
}
catch
{ }