python for machine-learning

pd.columnss

輸出為不包括第一列的表名
pd.merge

類似于資料庫表的合并，data1，data2代表要合并的兩個資料表，how表示連接配接的方式，on表示連接配接的條件
.np.round

對資料進行小數點位數處理
str(yr)

可以直接把數字變成字元
df.boxplot(‘Income’,by=’Regin’,rot=90) rot : label rotation angle

畫盒圖
X = scipy.stats.norm(loc=diff, scale=1)

正态分布，loc=mean，scale=deviation
plt.legend([“a={}”.format(a)for a in a_values],loc=0)

一般圖的标注含有變量的時候就可以使用這個功能。
plt.yscale(‘log’)
merged = pd.groupby(‘Region’, as_index=False).mean()

單單使用groupby沒有什麼效果，要結合如mean等使用。
population.columns = [‘Country’] + list(list(population.columns)[1:])

對表頭列名進行從新組織，在實際使用中，list的使用出現了寫編譯問題，網上說有時候jupyter需要重新整理一下的原因。
http://www.cnblogs.com/txw1958/archive/2011/12/21/2295698.html

python3網絡抓取資源的N種方法
source.count(bytes(‘Soup’,’UTF-8’))
X.sf(a)
subplot的基本使用方法 x2=np.arange(35,71,1) fig, ax = plt.subplots(2,1) ax[0].vlines(x2/100, 0, binom.pmf(x2, N, thep), colors='b', lw=5, alpha=0.5) ax[1].vlines(x[1:], 0, y, lw=5, colors=dark2_colors[0]) ax[0].set_xlim(0.35,0.75) ax[1].set_xlim(0.35,0.75) plt.show()
plt.xticks(rotation=90)

對圖像的x軸标注旋轉90度，這種情況适用于x軸是比較長的标注。
if l is not None and l[:4]==’http’

這是用于網絡連接配接篩選的代碼，在實際應用中，存在很多資料列為空的情況，是以該功能還是非常強大友善的。
[l for l in link_list if l is not None and l.startswith(‘http’)]for

python的for循環使用非常優美簡介，實際掌握還是需要大量的聯系。
有時候擷取網絡資源時候，網站會阻止爬蟲，這時候就需要對你的爬蟲程式進行僞裝 req = urllib.request.Request(url,headers={'User-Agent': 'Mozilla/5.0'}) source = urllib.request.urlopen(req).read()
jupyter多版本切換問題解決，兩條指令

pip2 install ipython
 ipython2 kernelspec install-self

畫圖的兩種方法

. 
    data_to_plot = ranking.overall
  plt.bar(data_to_plot.index, data_to_plot)
  plt.show()
 .
  ranking_categories_weighted.head().plot(kind='bar')

legend的使用

.ax = ranking_categories_weighted.head().plot(kind='bar', legend=False)
    # Put a legend to the right of the current axis
    ax.legend(loc='center left', blebox_to_anchor=(, ))

    plt.show()

jupyter數學公式書寫

http://blog.csdn.net/winnerineast/article/details/52274556

http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Typesetting%20Equations.html
一個網絡資料處理的過程

URL = "http://www.pollster.com/08USPresGEMvO-2.html"
   html=requests.get(URL).text
   dom=web.Element(html)
   rows=dom.by_tag('tr')
   table=[]
   for row in rows:
       table_row=[]
       data=row.by_tag('td')
       for value in data:
           table_row.append(web.plaintext(value.content))
      table.append(table_row)

.正規表達式re子產品

http://www.cnblogs.com/huxi/archive////.html

df2 = new_df.iloc[keep]
密度圖也被稱作KDE圖，調用plt時加上kind=’kde’即可生成一張密度圖。

diamonds.boxplot(‘price’, by = ‘color’)

by是x軸，price是Y軸
産生随機數的各種情況

np.random.randint(a, b, N)
   np.random.rand(n, m)
   np.random.randn(n, m)

array的一些操作

z.reshape((,))
.z.flatten()/To flatten an array (convert a higher dimensional array into a vector), use flatten()

線上下載下傳zip并處理的全套python2代碼

zip_folder = requests.get('http://seanlahman.com/files/database/lahman-csv_2014-02-14.zip').content
       zip_files = StringIO()
       zip_files.write(zip_folder)
       csv_files = ZipFile(zip_files)
       teams=csv_files.open('Teams.csv')
       teams=read_csv(teams)

DataFrame資料的基本構造

data=pd.DataFrame({'level':['a','b','c','b','a'],
               'num':[,,,,]})

grouped = df.groupby("playerID", as_index=False)
    #print grouped.head()
    rookie_idx = grouped["yearID"].aggregate({'min_index':f})['min_index'].values
    #獲得每組的第一個出現的資料組
    rookie = df.loc[rookie_idx][["playerID", "AB", "H"]]

jupyter markdown效果

https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet
tab = tab.dropna()

去除表中存在數值為空的行，這個功能比較實用
1. 對線上csv檔案進行擷取

url_exprs = "https://raw.githubusercontent.com/cs109/2014_data/master/exprs_GSE5859.csv"
exprs = pd.read_csv(url_exprs, index_col=)

sklearn中包含很多機器學習的模型，下面給出一些例子

from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.cross_validation import cross_val_score, KFold
from sklearn.linear_model import LinearRegression


selector = SelectKBest(f_regression, k=).fit(x, y)
best_features = np.where(selector.get_support())[]
print(best_features)

xt = x[:, best_features]
clf = LinearRegression().fit(xt, y)

錯誤的CrossValidation：
for train, test in KFold(len(y), ):
    xtrain, xtest, ytrain, ytest = xt[train], xt[test], y[train], y[test]
    clf.fit(xtrain, ytrain)
    yp = clf.predict(xtest)

    plt.plot(yp, ytest, 'o')
    plt.plot(ytest, ytest, 'r-')


plt.xlabel("Predicted")
plt.ylabel("Observed")

正确的CrossValidation:
for train, test in KFold(len(y), n_folds):
    xtrain, xtest, ytrain, ytest = x[train], x[test], y[train], y[test]

    b = SelectKBest(f_regression, k)
    b.fit(xtrain, ytrain)
    xtrain = xtrain[:, b.get_support()]
    xtest = xtest[:, b.get_support()]

    clf.fit(xtrain, ytrain)    
    scores.append(clf.score(xtest, ytest))

    yp = clf.predict(xtest)
    plt.plot(yp, ytest, 'o')
    plt.plot(ytest, ytest, 'r-')

plt.xlabel("Predicted")
plt.ylabel("Observed")

print("CV Score is ", np.mean(scores))

scipy下面一些非常有用的子產品

scipy.stats
 scipy.integrate
 scipy.signal
 scipy.optimize
 scipy.special
 scipy.linalg

mtcars.ix[‘Maserati Bora’]

擷取資料的一行
any和all的使用差別結果的差別

(mtcars.mpg >= ).any() True
(mtcars > ).all() true false true true ...

畫多個關系對比

rom pandas.tools.plotting import scatter_matrix
scatter_matrix(mtcars[['mpg', 'hp', 'cyl']],  figsize = (, ), alpha = , diagonal='kde')

bs4常用的一些屬性

soup.head.contents
soup.head.children
oup.head.title
soup.head.title.string
for child in soup.head.descendants:
     .stripped_strings
    soup.find_all('a')
    soup.find_all('a')[].get('href')

json格式使用

a = {'a': , 'b':} a # a dictionary
s = json.dumps(a) s # s is a string containing a in JSON encoding
a2 = json.loads(s) a2 # reading back the keys are now in unicode

Create a pandas DataFrame from JSON

pd的時間的格式化，但是不知道什麼樣的時間可以格式化

data['gameDate']=pd.DatetimeIndex(data.datetime).date
data['gameTime']=pd.DatetimeIndex(data.datetime).time

幾種分布函數

data = stats.binom.rvs(n = , p = , size = )#貝努力随機
 y = stats.poisson.pmf(n, lam) 泊松分布
 y = stats.norm.pdf(x, , ) 正态分布
 y=stats.beta.pdf(x, a, b) b分布
 lam =  x = np.arange(, , ），y = lam * np.exp(-lam * xstats.expon.rvs(scale = , size = ) e分布

sklearn自帶的資料庫

from sklearn.datasets import load_boston
              boston = load_boston()

statsmodels子產品

statsmodels is python module specifically for estimating statistical models (less machine learning compared to sklearn). It can estimate many types of statistical models, but today we will focus on linear regression.
eg:import statsmodels.api as sm
import statsmodels.api as sm
model = sm.OLS(y, X)
results = model.fit()
print results.summary()
results.params.values
X = sm.add_constant(X)

畫圖參數設定較全的一個

residData.plot(title = 'Residuals from least squares estimates across years', figsize = (, ), color=map(lambda x: 'blue' if x=='OAK' else 'gray',df.teamID))

矩陣轉置求線性回歸的例子

np.linalg.inv(np.dot(X.T, X)).dot(X.T).dot(y)

python for machine-learning

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入