天天看點

複合資料類型,英文詞頻統計

作業來源于:https://edu.cnblogs.com/campus/gzcc/GZCC-16SE2/homework/2696

一、清單,元組,字典,集合分别如何增删改查及周遊?

1、清單

list=['Jack','Lucy','Mary']
list.append('Pony')
print("末尾增加元素 :{}".format(list))

list.insert(1,"Lili")
print("指定位置增加元素 :{}".format(list))

list=['Jack','Lucy','Mary']
list.remove('Jack')
print("删除元素 :{}".format(list))

list=['Jack','Lucy','Mary']
list.pop(0)
print("删除元素 :{}".format(list))

list=['Jack','Lucy','Mary']
list[0]='a'
print("修改元素 :{}".format(list))

list=['Jack','Lucy','Mary']
print("查找元素 :{}".format(list[0]))

print("周遊清單")
for l in list:
    print("{} :{}".format(list.index(l),l))

           
複合資料類型,英文詞頻統計

2、元組

tup=('Jack','Lucy',0)
tup2=('Lili',)
tup3=tup+tup2
print("連接配接/增加元素 :{}".format(tup3))

tup=('Jack','Lucy',0)
print("通路元素 :tup[2]={},tup[0:1]={}".format(tup3[2],tup[0:2]))

tup=('Jack','Lucy',0)
print("删除元祖")
del tup

tup=('Jack','Lucy',0)
print("周遊元組:")
for t in tup:
    print(t)
           
複合資料類型,英文詞頻統計

3、字典

dict={'Jack':90,'Mary':80,'Tony':70}
dict["abc"]=100
print("增加abc:{}".format(dict))

dict={'Jack':90,'Mary':80,'Tony':70}
del dict['Jack']
print("删除Jack:{}".format(dict))

dict={'Jack':90,'Mary':80,'Tony':70}
dict.pop('Tony')
print("删除Tony:{}".format(dict))

dict={'Jack':90,'Mary':80,'Tony':70}
dict['Tony']=99
print("修改Tony的值:{}".format(dict))
dict['a']=dict.pop('Jack')
print("修改Tony的鍵:{}".format(dict))

dict={'Jack':90,'Mary':80,'Tony':70}
print("查找Mary的值:{}".format(dict.get('Mary')))

print("周遊字典:")
for d in dict:
    print("{} : {}".format(d,dict[d]))

           
複合資料類型,英文詞頻統計

4、集合

a=set('abcd')
a.add('z')
print("增加'z':",a)
a.update({1,2})
print("增加 1,2:",a)

b=set(('a','b','c'))
b.remove('a')
print("删除a :",b)

a=set('abcd')
a.discard('b')
print("删除a :",a)

a=set('Jack')
a.pop()
print("pop删除 :",a)

a=set('Jack')
a.clear()
print("清空集合 :",a)

a=set('Jack')
b=set(('1','2'))
c=set.union(a,b)
print("集合的合集",c)

print("周遊集合:")
for s in c:
    print(s)
           
複合資料類型,英文詞頻統計

二、總結清單,元組,字典,集合的聯系與差別。參考以下幾個方面:括号、有序無序、可變不可變、重複不可重複、存儲與查找方式。

清單:[],有序,可變,可重複,按值存儲,序列中的每個元素都配置設定一個索引,按索引号查找,元素可以是任意類型,可切片

元組:(),有序,與清單類似,但不可變,添加元素時用逗号隔開,按索引号查找,可切片

字典:{},有序,可變容器模型,可存儲任意類型對象,按key:value形式存儲,但key不可重複,不可切片

集合:(),無序,不可重複,建立格式:set()或parame = {value01,value02,...},每個元素可以是清單,元組,字典,不可切片

三、詞頻統計

要求:對文本進行預處理、去掉停用詞并排序輸出

源代碼:

fo = open(r'G:\test\TheLittlePrince.txt', encoding='utf-8-sig')
theLittlePrinceTxt = fo.read()
txt = theLittlePrinceTxt.lower()
fo.close()
sep = '''  ,./:?/! '\n  " [] ()  ~ '''
stops = {'by', 'his', 'their', 'again', 'off', 'where', 'now', 'up', 'this', 'before', 'which', 'after', 'a', 'then',
 "haven't", 'weren', 'll', 'down', 'or', 'no', "shan't", 'herself', 'in', 'some', 'such', "she's", 'does', 'nor', 
'just', "won't", 'them', 'further', 'how', 'am', 'mightn', 'it', 'too', 'ourselves', 'is', 'couldn', 'themselves', 
'should', 'ain', 'o', 'hadn', 'under', 'shan', 'him', "it's", 've', 'to', "don't", 'at', 'these', 'our', 'same', 'between', 
"you'd", 'isn', 'yourselves', 'until', 't', "mustn't", 'didn', 'few', 'each', 're', 'through', 'above', 'all', "you're", 
'been', 'hers', 'have', 'being', 'if', 'theirs', 'most', "doesn't", "hasn't", 'an', 'and', 'below', "couldn't", 'i', 'we', 
"hadn't", 'mustn', 'about', "shouldn't", 'there', 'her', 'y', 'here', 'was', "isn't", "needn't", 'were', 'haven', 'out', 
'ours', 'over', 'once', 'having', 'against', 'don', 'has', 'but', 'wouldn', 'with', 'other', 'doesn', 'itself', 'aren', 'when',
 'as', "wasn't", 'myself', "you'll", 'because', 'the', 'so', "didn't", 'are', 'for', "you've", 'd', 'hasn', 'wasn', 'on', 'he', 
's', 'of', 'they', 'needn', 'ma', 'while', 'than', 'from', "weren't", 'those', 'what', 'who', 'himself', "should've", 'will', 
'whom', 'more', "mightn't", 'do', 'its', 'why', 'only', 'that', 'during', 'not', 'had', 'own', 'm', 'me', 'very', 'doing', 'can',
 'be', 'my', 'both', 'into', "aren't", 'shouldn', 'won', 'yours', 'did', 'you', 'she', 'yourself', 'your', "wouldn't", 'any', "that'll"} 
 # 定義停用詞

for s in sep:
    txt = txt.replace(s, " ")

allWord = txt.split()
mset = set(allWord)  # 去掉重複的單詞,将文本轉換為集合
mset = mset - stops  # 去除停用詞
mdict = {}  # 定義字典,用于輸出

for m in mset:
    mdict[m] = allWord.count(m)  # 統計每個字典的key的頻數

mlist = list(mdict.items())  # 字典轉換成清單
mlist.sort(key=lambda x: x[1], reverse=True)  # 清單排序
print(mlist[0:10])
print(mlist[10:20])  #輸出top20

import pandas as pd  # 在預設目錄生成csv檔案
pd.DataFrame(data=mlist).to_csv('Little.csv', encoding='utf-8')

           

結果:

複合資料類型,英文詞頻統計
複合資料類型,英文詞頻統計

線上詞雲統計:

複合資料類型,英文詞頻統計