第一周作業 續

首先，感謝鄒老師的關注和回複。根據鄒老師和楊老師的回複，我更新了周六的程式。

鄒老師的問題我了解為如下兩方面：

1. 引入stop word 該如何改程序式

2. 程式耗時最多的地方是哪裡？如何解釋？

楊老師的問題我的了解有如下方面：

對hash 和 array性能進行對比
列特性和需求，考慮核心接口。

根據老師們的建議，目前對程式進行了如下修改：

将讀寫檔案的操作單獨用一個類實作。MyFile.java
修改了讀源檔案時候的處理方式，将String 改成了StringBuffer，原因是計算版本1程式的運作時間時，發現讀檔案花費了很長的時間，看程式發現是因為讀檔案時用到了String，影響了性能。
将程式改進，不同功能的子產品由不同函數完成。
根據鄒老師提出的問題，添加了新的函數用來加載stoplist.txt。同時在詞頻統計函數中進行判斷。

注：因為之前對stop word這個術語不了解，通過上網查詢獲知stop word的意思。換成自己的了解就是：對于一個英文著作中，有很多如am is are之類的詞，我們統計這樣詞的出現頻率并沒有太大意義，是以這樣的詞可以出現在stoplist清單中，統計的時候隻算一次，或者忽略不計就可以了。我猜想這可能是鄒老師出這個問題的初衷，是這樣嗎？
修改後，主函數直接按照算法的流程圖調用各個函數：
1. 擷取stop清單
2. 擷取詞頻統計源檔案内容
3. 預處理源檔案内容
4. 詞頻統計
5. 将統計結果寫入結果檔案

思考：

對于鄒老師提到的問題2，通過運作程式，發現耗時最多的地方是預處理這個函數，在我的電腦上，各個函數的運作時間如下：

getStopList	2ms
getSourceContent	28ms
pretreatmentContent	447ms
getFreq	53ms
writeFile	90ms

關于楊老師提到的比較hash和array性能這個思考，待有時間時繼續。

最後附上程式：

MyFile.java

import java.io.BufferedReader;

import java.io.BufferedWriter;

import java.io.File;

import java.io.FileReader;

import java.io.FileWriter;

import java.io.IOException;

import java.util.*;

public class MyFile {

public static String readFile(String path) {

StringBuffer result = new StringBuffer();

try {

BufferedReader br = new BufferedReader(new FileReader(

new File(path)));

String tmp = null;

while ((tmp = br.readLine()) != null) {

result.append(tmp);

result.append(" ");

}

br.close();

} catch (IOException e) {

e.printStackTrace();

}

return result.toString();

}

public static void writeFile(List<Map.Entry<String, Integer>> lst) {

BufferedWriter bw = new BufferedWriter(new FileWriter(new File(

System.getProperty("user.dir") + "//result.txt")));

for(int i=0;i<lst.size();i++)

{

bw.append(lst.get(i).getKey()+":"+lst.get(i).getValue()+"\n");

bw.flush();

bw.close();

}

WordFreqStatistics.java

import java.util.ArrayList;

import java.util.Collections;

import java.util.Comparator;

import java.util.HashMap;

import java.util.HashSet;

import java.util.List;

import java.util.Map;

import java.util.Set;

import java.util.Map.Entry;

public class WordFreqStatistics {

public static String sourceFilePath = System.getProperty("user.dir")

+ "//anna.txt";

public static String stopWordFilePath = System.getProperty("user.dir")

+ "//stoplist.txt";

public static Map<String,Integer> mp = new HashMap<String, Integer>();

public static Set<String> stop = new HashSet<String>();

public static String words[] = null;

public static String sourceContent =null;

public static void getStopList()

{

String stopContent = MyFile.readFile(stopWordFilePath);

String stopWords[] = stopContent.split("\\s+|\\r|\\n|\\t");

for(String word : stopWords)

{

stop.add(word);

public static String getSourceContent(String filepath)

return MyFile.readFile(filepath);

public static String[] pretreatmentContent(String content)

content = content.toLowerCase();

content = content.replaceAll("[^A-Za-z]", " ");

content = content.replaceAll("\\s+", " ");

return content.split("\\s+");

public static void getFreq(String[] words)

for(int i=0;i<words.length;i++)

if(!stop.contains(words[i]))

if((mp.get(words[i]))!=null)

{

int value = ((Integer)mp.get(words[i])).intValue();

value++;

mp.put(words[i], new Integer(value));

}

else{

mp.put(words[i], new Integer(1));

public static List<Map.Entry<String,Integer>> sort()

ArrayList<Entry<String, Integer>> lst = new ArrayList<Entry<String, Integer>>(

mp.entrySet());

Collections.sort(lst,new Comparator<Object>() {

public int compare(Object e1,Object e2) {

int v1 = Integer.parseInt(((Entry<String, Integer>) e1)

.getValue().toString());

int v2 = Integer.parseInt(((Entry<String, Integer>) e2)

return v2-v1;

});

return lst;

public static void main(String[] args)

getStopList();

sourceContent = getSourceContent(sourceFilePath);

words = pretreatmentContent(sourceContent);

getFreq(words);

MyFile.writeFile(sort());

待續

第一周作業 續

第一周作業續