天天看點

使用libsvm對政府網站通路資料的分析和預測

創新性

選取題材,政府網站通路資料。想通過使用人工智能的方法,根據已有資料建立模型,對新的通路資料進行鑒别,區分網站通路是否為正常通路,進而進行一些業務處理。更進一步,預測哪些是危險通路(網絡攻擊),進而能及時做出一些政策應對,保障政府網站的安全,保障正常網民的通路。

實驗完整性

資料描述

相關約定

  • 此處正常通路,定義為網民的正常通路,其餘皆為垃圾通路
  • 垃圾通路可能來自網站掃描(死鍊檢測等)、網頁抓取、惡意攻擊等,都不算是正常的通路資料
  • 以一個月某ip通路是否超過3000次(30天,每天100次)為臨界值,判定是否為正常有效的通路,高于3000,認為是垃圾通路。

資料及來源

  • 選取的是某市政府網站通路資料。資料庫總資料量6000多萬,其中2019年3月份資料320萬左右。
  • 現在隻取3月1日-10日的資料中,session id尾号為1的,4萬2000多條
    使用libsvm對政府網站通路資料的分析和預測
    使用libsvm對政府網站通路資料的分析和預測

資料處理

  • 把通路記錄的一些次元,通過sql腳本資料處理,轉為數字,友善構造測試樣本
  • 原始資料:導出的mysql腳本 huainan_guest_statistic190310.sql
  • 超過3000的ip:ip3000.xls
  • 處理過程:資料處理1.sql 資料處理2.sql
-- 根據統計出的一月通路量超過3000的認為是垃圾資料,對資料加标簽
UPDATE cms_site_chart_main_copy1 set IS_GOOD = 1;
UPDATE cms_site_chart_main_copy1 set IS_GOOD = 0 where ip in ('45.194.128.134','220.178.96.24','121.41.29.248','183.166.59.35','120.78.131.13','59.111.104.146','183.166.61.6','183.166.61.1','203.207.121.14','60.171.44.29','58.242.245.178','59.111.104.158','183.166.61.13','203.207.121.3','203.207.121.13','218.22.156.219','183.166.61.2','203.207.121.8','59.111.104.147','112.26.205.196','43.249.175.170','114.255.59.101','218.22.163.50','183.166.61.7','203.207.121.9','203.207.121.1','119.23.27.3','59.111.104.49','210.45.123.30','183.166.61.10','203.207.121.5','203.207.121.2','59.111.104.115','59.111.104.137','183.166.59.149','60.166.58.227','59.111.104.134','183.166.61.3','60.21.221.203','119.23.40.231','59.111.104.138','59.111.104.75','203.207.121.11','183.166.61.14','59.111.104.135','59.111.104.44','183.166.61.5','220.178.199.10','59.111.104.132','120.77.240.164','59.111.104.14','59.111.104.148','221.231.138.216','183.166.79.204','203.207.121.7','58.242.245.179','183.166.61.8','58.242.245.180','220.178.96.26','183.166.24.222','220.178.96.25','218.22.161.170','111.39.184.24','59.111.104.127');

-- 處理is_pc字段,轉為0,1
UPDATE cms_site_chart_main_copy1 set IS_PC = 1 where IS_PC = 'true';
UPDATE cms_site_chart_main_copy1 set IS_PC = 0 where IS_PC = 'false' or IS_PC is null or IS_PC = '';

-- 處理國家字段,轉為數字
SELECT COUNTRY,count(1) FROM cms_site_chart_main_copy1 GROUP BY COUNTRY;
UPDATE cms_site_chart_main_copy1 set COUNTRY = '-1' where COUNTRY is null or COUNTRY = '';		
UPDATE cms_site_chart_main_copy1 set COUNTRY = '0' where COUNTRY = '未配置設定或者内網IP';
UPDATE cms_site_chart_main_copy1 set COUNTRY = '1' where COUNTRY = '中國';
UPDATE cms_site_chart_main_copy1 set COUNTRY = '2' where COUNTRY = '俄羅斯';		
UPDATE cms_site_chart_main_copy1 set COUNTRY = '3' where COUNTRY = '加拿大';			
UPDATE cms_site_chart_main_copy1 set COUNTRY = '4' where COUNTRY = '印度';	
UPDATE cms_site_chart_main_copy1 set COUNTRY = '5' where COUNTRY = '巴拿馬';			
UPDATE cms_site_chart_main_copy1 set COUNTRY = '6' where COUNTRY = '新加坡';			
UPDATE cms_site_chart_main_copy1 set COUNTRY = '7' where COUNTRY = '日本';
UPDATE cms_site_chart_main_copy1 set COUNTRY = '8' where COUNTRY = '高棉';			
UPDATE cms_site_chart_main_copy1 set COUNTRY = '9' where COUNTRY = '比利時';			
UPDATE cms_site_chart_main_copy1 set COUNTRY = '10' where COUNTRY = '沙特阿拉伯';			
UPDATE cms_site_chart_main_copy1 set COUNTRY = '11' where COUNTRY = '泰國';		
UPDATE cms_site_chart_main_copy1 set COUNTRY = '12' where COUNTRY = '愛爾蘭';			
UPDATE cms_site_chart_main_copy1 set COUNTRY = '13' where COUNTRY = '美國';			
UPDATE cms_site_chart_main_copy1 set COUNTRY = '14' where COUNTRY = '英國';	
UPDATE cms_site_chart_main_copy1 set COUNTRY = '15' where COUNTRY = '荷蘭';			
UPDATE cms_site_chart_main_copy1 set COUNTRY = '16' where COUNTRY = '菲律賓';			
UPDATE cms_site_chart_main_copy1 set COUNTRY = '17' where COUNTRY = '越南';			
UPDATE cms_site_chart_main_copy1 set COUNTRY = '18' where COUNTRY = '南韓';		
UPDATE cms_site_chart_main_copy1 set COUNTRY = '19' where COUNTRY = '香港';	
UPDATE cms_site_chart_main_copy1 set COUNTRY = '20' where COUNTRY = '馬來西亞';		

-- 處理語言
SELECT `LANGUAGE`,count(1) FROM cms_site_chart_main_copy1 GROUP BY `LANGUAGE` HAVING count(1) > 10;
UPDATE cms_site_chart_main_copy1 set `LANGUAGE` = '-1' where `LANGUAGE` is null or `LANGUAGE` = '' or `LANGUAGE` not in ('en-US','zh-CN','zh-Hans-CN');	
UPDATE cms_site_chart_main_copy1 set `LANGUAGE` = '0' where `LANGUAGE` = 'en-US';
UPDATE cms_site_chart_main_copy1 set `LANGUAGE` = '1' where `LANGUAGE` = 'zh-CN';
UPDATE cms_site_chart_main_copy1 set `LANGUAGE` = '2' where `LANGUAGE` = 'zh-Hans-CN';

-- 處理省份 
SELECT PROVINCE,count(1) FROM cms_site_chart_main_copy1 GROUP BY `PROVINCE` HAVING count(1) > 20;
UPDATE cms_site_chart_main_copy1 set PROVINCE = '-1' where PROVINCE is null or PROVINCE = '';
UPDATE cms_site_chart_main_copy1 set PROVINCE = '0' where PROVINCE = '上海市';
UPDATE cms_site_chart_main_copy1 set PROVINCE = '2' where PROVINCE = '内蒙古自治區';
UPDATE cms_site_chart_main_copy1 set PROVINCE = '2' where PROVINCE = '北京市';
UPDATE cms_site_chart_main_copy1 set PROVINCE = '3' where PROVINCE = '四川省';
UPDATE cms_site_chart_main_copy1 set PROVINCE = '4' where PROVINCE = '天津市';
UPDATE cms_site_chart_main_copy1 set PROVINCE = '5' where PROVINCE = '安徽省';
UPDATE cms_site_chart_main_copy1 set PROVINCE = '6' where PROVINCE = '山東省';
UPDATE cms_site_chart_main_copy1 set PROVINCE = '7' where PROVINCE = '山西省';
UPDATE cms_site_chart_main_copy1 set PROVINCE = '8' where PROVINCE = '廣東省';
UPDATE cms_site_chart_main_copy1 set PROVINCE = '9' where PROVINCE = '江蘇省';
UPDATE cms_site_chart_main_copy1 set PROVINCE = '10' where PROVINCE = '江西省';
UPDATE cms_site_chart_main_copy1 set PROVINCE = '11' where PROVINCE = '河北省';
UPDATE cms_site_chart_main_copy1 set PROVINCE = '12' where PROVINCE = '河南省';
UPDATE cms_site_chart_main_copy1 set PROVINCE = '13' where PROVINCE = '浙江省';
UPDATE cms_site_chart_main_copy1 set PROVINCE = '14' where PROVINCE = '湖北省';
UPDATE cms_site_chart_main_copy1 set PROVINCE = '15' where PROVINCE = '湖南省';
UPDATE cms_site_chart_main_copy1 set PROVINCE = '16' where PROVINCE = '福建省';
UPDATE cms_site_chart_main_copy1 set PROVINCE = '17' where PROVINCE = '貴州省';
UPDATE cms_site_chart_main_copy1 set PROVINCE = '18' where PROVINCE = '遼甯省';
UPDATE cms_site_chart_main_copy1 set PROVINCE = '19' where PROVINCE = '重慶市';
UPDATE cms_site_chart_main_copy1 set PROVINCE = '20' where PROVINCE = '陝西省';
UPDATE cms_site_chart_main_copy1 set PROVINCE = '21' where PROVINCE = '黑龍江省';

-- 處理資訊來源
SELECT SOURCE_TYPE,count(1) FROM cms_site_chart_main_copy1 GROUP BY `SOURCE_TYPE` HAVING count(1) > 20;
UPDATE cms_site_chart_main_copy1 set SOURCE_TYPE = '0' where SOURCE_TYPE = '其他外部連結';
UPDATE cms_site_chart_main_copy1 set SOURCE_TYPE = '1' where SOURCE_TYPE = '标簽或浏覽器輸入位址';
UPDATE cms_site_chart_main_copy1 set SOURCE_TYPE = '2' where SOURCE_TYPE = '站内連結';

-- 處理作業系統
SELECT OS,count(1) FROM cms_site_chart_main_copy1 GROUP BY `OS` HAVING count(1) > 20;
UPDATE cms_site_chart_main_copy1 set OS = '0' where OS = 'Linux';
UPDATE cms_site_chart_main_copy1 set OS = '1' where OS = 'Mac';
UPDATE cms_site_chart_main_copy1 set OS = '2' where OS = 'Windows';
UPDATE cms_site_chart_main_copy1 set OS = '3' where OS = 'Windows 10';
UPDATE cms_site_chart_main_copy1 set OS = '4' where OS = 'Windows 2003';
UPDATE cms_site_chart_main_copy1 set OS = '5' where OS = 'Windows 7';
UPDATE cms_site_chart_main_copy1 set OS = '6' where OS = 'Windows 8';
UPDATE cms_site_chart_main_copy1 set OS = '7' where OS = 'Windows XP';
UPDATE cms_site_chart_main_copy1 set OS = '8' where OS = 'Windows+7';
UPDATE cms_site_chart_main_copy1 set OS = '9' where OS = 'Windows7';
UPDATE cms_site_chart_main_copy1 set OS = '10' where OS = 'X11';

-- 導出結果和需要的次元
SELECT IS_GOOD,COUNTRY,PROVINCE,SOURCE_TYPE,OS,COLOR_DEPTH,`LANGUAGE`,IS_PC FROM cms_site_chart_main_copy1
           
UPDATE cms_site_chart_main_copy1 set IS_GOOD = -1 where IS_GOOD = 0;
UPDATE cms_site_chart_main_copy1 set IS_GOOD = '+1' where IS_GOOD = 1;

-- 查出少于20的,舍棄
SELECT PROVINCE,count(1) FROM cms_site_chart_main_copy1 GROUP BY `PROVINCE` HAVING count(1) <= 20;
DELETE FROM cms_site_chart_main_copy1 WHERE PROVINCE in ('雲南省','吉林省','甯夏回族自治區','廣西壯族自治區','新疆維吾爾自治區','海南省','甘肅省','西藏自治區','青海省','香港特别行政區')


SELECT CONCAT(IS_GOOD,' ', '1:',COUNTRY,' 2:',PROVINCE,' 3:',SOURCE_TYPE,' 4:',OS,' 5:',COLOR_DEPTH,' 6:',`LANGUAGE`,' 7:',IS_PC)  FROM cms_site_chart_main_copy1 LIMIT 40000;

SELECT CONCAT(IS_GOOD,' ', '1:',COUNTRY,' 2:',PROVINCE,' 3:',SOURCE_TYPE,' 4:',OS,' 5:',COLOR_DEPTH,' 6:',`LANGUAGE`,' 7:',IS_PC) FROM cms_site_chart_main_copy1 ORDER BY ID desc LIMIT 1000;
           
  • 處理之後的資料:mysql腳本檔案 cms_site_chart_main_copy1.sql 和excel檔案的處理好的資料.xls
  • 處理之後的資料檔案:

    訓練資料:trainfile

    測試資料:testfile

    使用libsvm對政府網站通路資料的分析和預測

實驗條件

  • Centos7
  • Windows7
  • Mysql
  • libsvm
  • Python
  • gnuplot

實驗過程

支援向量機 SVM(多組實驗)

使用libsvm庫(中國 台灣 林智仁教授封裝的SVM庫)

支援向量機(support vector machine,SVM)是目前最好的有監督學習算法之一。

使用libsvm對政府網站通路資料的分析和預測

實驗結果

cd /home/yyl/libsvm-3.24

./svm-train trainfile

./svm-predict -q testfile trainfile.model outputs1

使用libsvm對政府網站通路資料的分析和預測

實驗分析

使用libsvm對政府網站通路資料的分析和預測
使用libsvm對政府網站通路資料的分析和預測

PPT展示

下載下傳位址:https://download.csdn.net/download/u010882234/12122680

參考

  • https://www.cnblogs.com/mfmdaoyou/p/7258707.html
  • https://www.csie.ntu.edu.tw/~cjlin/
  • http://blog.sciencenet.cn/blog-713101-797917.html
  • http://www.360doc.com/content/18/0516/15/54605916_754421513.shtml
  • https://www.cnblogs.com/leezx/p/5668702.html

繼續閱讀