軟工實踐第三次作業（結對第二次作業）

作業描述

	連結
這個作業屬于哪個課程	https://edu.cnblogs.com/campus/fzu/SoftwareEngineering1916W
這個作業要求在哪裡	https://edu.cnblogs.com/campus/fzu/SoftwareEngineering1916W/homework/2688
結對學号	221600131、221600439
作業目标	實作一個能夠對文本檔案中的單詞的詞頻進行統計的控制台程式。
PDF

GitHub

基礎需求：https://github.com/temporaryforfzuse/PairProject1-C

進階需求：https://github.com/temporaryforfzuse/PairProject2-C

分工

221600131：WordCount基礎、測試資料構造、爬蟲、附加題

221600439：WordCount主體

PSP表格

PSP2.1	Personal Software Process Stages	預估耗時（分鐘）	實際耗時（分鐘）
Planning	計劃
- Estimate	估計這個任務需要多少時間	5
Development	開發
- Analysis	需求分析 (包括學習新技術)	30（學習新技術被計入具體編碼部分）	240
- Design Spec	生成設計文檔	10
- Design Review	設計複審
- Coding Standard	代碼規範 (為目前的開發制定合适的規範)
- Design	具體設計
- Coding	具體編碼		660
- Code Review	代碼複審	貫穿代碼開發過程，不作為單獨流程
- Test	測試（自我測試，修改代碼，送出修改）
Reporting	報告	30
- Test Report	測試報告
- Size Measurement	計算工作量
- Postmortem & Process Improvement Plan	事後總結, 并提出過程改進計劃
合計	370	1020

如截圖。因不明原因助教隻使用Windows評測C++，必須使用遠端桌面開發。此處即為計時。可以注意到，因為需求嚴重不明确，本身周末就搞定了的項目，不得不在工作日進行大量修改。

“時間總能擠在重寫上的。”

遇到的困難及解決方法

結對本身不存在困難，合作非常愉快。

221600131 被評價為：思維活躍、創新能力強，學習熱情高，非常認真。熟練掌握 Python 語言，擅長資料挖掘。

221600439 被評價為：代碼能力強，工程能力強，有較強的Bug查找能力。

困難

需求極度不明确。

解決

硬寫啊。

解題思路描述

公共部分

劃分一個DLL和一個MainProject。考慮到今後可能會被其他語言調用，暴露出的接口必須為C式，那麼就不能以C++ STL結構作為輸入或輸出，必須自己構造struct。同時需要考慮記憶體回收，誰初始化的記憶體，誰負責清理。

考慮作業需求，隻需要讀一次就夠了，具體行為由DLL内部自行處理，傳回資料的處理讓外部調用者來做。是以，DLL内暴露2個API：

extern "C" __declspec(dllexport) WordCountResult CalculateWordCount(const char * fileName);
extern "C" __declspec(dllexport) void ClearWordAppear(WordCountResult * resultStruct);

基礎需求

解題思路描述 / 設計實作過程

沒什麼好思考的，一個大循環就寫完了……僅需一個函數，一百行不到的算法就能解決的事情，硬要把它拆成三四個部分完全是over design。

優化思路

本算法時間複雜度肯定是O(n)（n為字元數量）的，其中使用的HashMap讀取的時間複雜度為O(1)，排序算法時間複雜度為O(nlgn)（n為單詞數量）。慢應當慢在I/O上和STL上。

初步實作是直接逐位元組讀檔案。以下為對4400萬的規律資料進行測試，用時4.296秒。

這種做法可能較慢，因為I/O次數較大。性能優化方法是把檔案讀到記憶體Buffer裡，再從Buffer裡逐位元組取出。将其改為

stringstream

後，優化至3秒。

觀察性能得知，最慢的代碼在此處。相信隻要棄用

stringstream

而直接從記憶體數組取資料，性能就能更高。同時因

std::map

使用紅黑樹而非HashMap，更換為HashMap還可以更加優化性能。

換成C式讀寫，并将

std::map

換成

std::unordered_map

後，僅需0.3秒。此處性能低在：單詞出現次數排序上、HashMap的增查、記憶體比較，基本可認為無繼續優化的必要。

接下來還需要優化的話，重點在于記憶體占用上。目前因檔案是一次讀入，且需要在記憶體内記錄所有單詞，導緻可能需要3倍于檔案大小的記憶體，是以大檔案也需要編譯為64位才可處理。對此可增加I/O，例如一次隻讀入100M檔案。至于記憶體中的單詞計數，暫時還沒有比較好的解決方案。

目前對一個大約760M的文本檔案進行了測試，4位元組長的單詞約有70萬個，耗時37秒。

測試腳本

測試資料以及腳本：https://files.cnblogs.com/files/aaaaaaaaaaaaaa/結對2測試資料.rar

我認為，一個合理的測試方式是：

GitHub 加一個tests檔案夾，裡面存放測試資料。
GitHub 加一個TestProject，作為測試工程，丢在Repo裡。
GitHub 加一個 .travis.yml 或 appveyor.yml ，自動建構并自動測試。

但此次作業完全沒有提到CI的重要性，且不允許一并送出測試資料。我直接使用腳本來處理而非而非使用VS的單元測試工程，原因即在于不允許送出測試工程。另外，我個人認為，一個好的測試，如非必要，不應與外界環境耦合。這一點我的測試并不佳。與其說它是單元測試，更應該說它是回歸測試。

代碼覆寫率測試僅 Visual Studio Enterprise 有，免費的 Community 無。由于我日常并不進行Windows開發，不再花費時間找各類工具/破解版。

const testCaseDir = 'cases-1/'
const testCaseCount = 11
const cp = require('child_process')
const fs = require('fs')
const path = require('path')

for (let i = 1; i <= testCaseCount; i++) {
  const randomFileName = (Math.random() * 10000000000 + new Date().getTime()).toString(16)
  const inFileName = path.resolve(__dirname, `${testCaseDir}input${i}.txt`)
  const outFileName = path.resolve(__dirname, `${testCaseDir}result${i}.txt`)
  const stdout = cp.execSync(`"D:\\Projects\\Homework\\fzuse-hw3\\221600131\&221600439\\src\\x64\\Release\\WordCount.exe" ${inFileName}`).toString('utf-8')
  if (fs.existsSync(outFileName))  {
    const fout = fs.readFileSync(outFileName, 'utf-8')
    if (stdout.trim() !== fout.trim()) {
      console.log(`Failed at ${i}`)
      console.log(`=== Excepted ====`)
      console.log(fout)
      console.log(`=== Actual ====`)
      console.log(stdout)
    } else {
      console.log(`OK at ${i}`)
    }
  } else {
    fs.writeFileSync(outFileName, stdout, 'utf-8')
  }
}

構造測試資料的思路

特殊符号
邊界條件
1. 如何定義一個單詞？舉例， aaaa 是， 0aaaa 不是， a0aa a|aa aaaa0 是。
2. 如何定義一行？
3. 如何定義空白字元？

基本思想就是在各處if周邊試探，寫各種可能讓if出錯的edge case。把這些處理清楚，測試資料就寫完了。部分測試資料如圖。

關鍵代碼

配合注釋，基本做到代碼自解釋。

EXTERN WordCountResult CalculateWordCount(const char * fileName)
{
	auto ret = WordCountResult();
	bool runStateMachine = true;
	char c = 0;
	std::ifstream file(fileName);
	std::string word = ""; // 不想做動态配置設定記憶體，std::string省事 
	size_t wordLength = 0; // <= wordAtLeastCharacterCount，超過則不再計數
	bool isValidWordStart = true;
	bool hasNotBlankCharacter = false;
	auto map = std::unordered_map<std::string, size_t>();
	FILE* f;
	if (fopen_s(&f, fileName, "rb") != 0) {
		ret.errorCode = WORDCOUNTRESULT_OPEN_FILE_FAILED;
		return ret;
	}
	fseek(f, 0, SEEK_END);
	long fileLength = ftell(f);
	fseek(f, 0, SEEK_SET);
	char * string = (char*)malloc(fileLength + 1);
	fread(string, fileLength, 1, f);
	fclose(f);
	string[fileLength] = 0;

	size_t currentPosition = 0;
	while (runStateMachine) {
		c = string[currentPosition];
		if (currentPosition == fileLength) {
			runStateMachine = false; // 檔案讀取結束，不立即退出，處理一下之前未整理幹淨的狀态
			c = 0;
		}
		else {
			currentPosition++;
			if (c == '\r') continue; // Thanks God
			ret.characters++;
		}

		if (c >= 'A' && c <= 'Z') {
			c = c - 'A' + 'a';
		}

		if (!isEmptyChar(c)) {
			hasNotBlankCharacter = true;
		}

		if (isCharacter(c)) {
			if (isLetter(c)) {
				// 判斷一下首幾個字母是不是字母，不是的話就不是單詞
				if ((wordLength > 0 && wordLength < wordAtLeastCharacterCount) || (wordLength == 0 && isValidWordStart)) {
					if (isAlphabet(c)) {
						word += c;
						wordLength++;
					}
					else {
						isValidWordStart = false;
						word = "";
						wordLength = 0;
					}
				}
				else {
					word += c;
				}
			}
		}

		if (!isLetter(c)) {
			if (wordLength >= wordAtLeastCharacterCount) { // 不是數字字母了，就可能是個單詞的結束
				if (map.find(word) == map.end()) {
					map[word] = 0;
					ret.uniqueWords++;
				}
				map[word]++;
				ret.words++;
			}
			word = "";
			wordLength = 0;

			if (isSeparator(c)) { // 隻有有分隔符分割的，才是一個單詞的開始
				isValidWordStart = true;
			}
		}

		if (isLf(c) || !runStateMachine) {
			if (hasNotBlankCharacter) { // 任何包含非空白字元的行，都需要統計。
				ret.lines++;
			}
			hasNotBlankCharacter = false;
		}
	}

	auto sortedMap = std::vector<WordCountPair>(map.begin(), map.end());
	std::sort(sortedMap.begin(), sortedMap.end(), [](const WordCountPair& lhs, const WordCountPair& rhs) noexcept {
		if (lhs.second == rhs.second) {
			return lhs.first < rhs.first;
		}
		return lhs.second > rhs.second;
	});
	ret.wordAppears = new WordCountWordAppear[ret.uniqueWords];
	size_t i = 0;
	for (auto &it : sortedMap) {
		ret.wordAppears[i].word = new char[it.first.length() + 1];
		strcpy_s(ret.wordAppears[i].word, it.first.length() + 1, it.first.c_str());
		ret.wordAppears[i].count = it.second;
		i++;
	}

	free(string);
	return ret;
}

進階需求

爬蟲

爬蟲選用的是python語言，因為請求庫和解析庫有很多而且友善。我這裡主要用的是Request請求庫和BeautifulSoup + lxml解析庫。由于這部分隻要求爬取title和abstract部分，是以首先分析前端html發現這兩個部分的div都有很明顯的id标志，是以直接通過xpath定位到這兩個div取出text即可。第一遍正常套路爬一遍耗時超過十分鐘。

性能優化

因為總共将近一千篇論文爬一遍耗時太久了，是以我使用多程序爬蟲以使性能得到提升。我們知道在python下多程序更好，因為每個程序有獨立的GIL，互不幹擾，可以真正意義上實作并行執行。而python多線程下，每個線程執行方式是擷取GIL，執行代碼直到sleep或是虛拟機将其挂起，最後釋放GIL。而每次釋放GIL後線程都會進行鎖的競争，切換線程，進而造成資源的消耗。是以我這裡選擇用多程序爬蟲。

修改代碼後開到32程序再次測試，爬取一遍不用20秒。

WordCount

先是指令行處理。這一點，直接用庫即可。我選用CLI11，避免重複造輪子。

這題更好的解法是正規表達式。應當用正則的理由如下：

我寫的這種自動機難以維護，需要提前預知所有狀态，一旦添加狀态就要檢查狀态有無遺漏。
狀态機對于資料的封閉不利，需要共享資料。
如果我的了解沒問題，進階需求沒有特殊字元。

我不用正規表達式的原因如下：

需求不明确，不知道要改幾次需求。比調正則相對好調些。

既然不用正規表達式，那就直接一個狀态機解決了。詞法分析、文法分析、語義分析全部忽略，直接使用最簡單的狀态轉換算法，連詞法帶語義一起處理。

至于找資料..找啥？

圖

核心僅一個函數，畫類圖有點強人所難。狀态轉換圖如下：

和基礎類似，不再贅述。

單元測試

分 C++ 内部測試與 Nodejs 外部測試兩個部分。使用Nodejs測試的原因是，不友善将測試資料進行PR，更不友善把它丢到 C++ 代碼内部。

C++部分的部分測試：

TEST_METHOD(TestPharse)
{
	auto config = WordCountConfig();
	config.statByPharse = true;
	config.pharseSize = 3;
	config.useDifferentWeight = false;
	auto out = doTest("0\nTitle: Monday Tuesday Wednesday Thursday\nAbstract: Friday", &config);
	Assert::AreEqual(out.characters, (size_t)40);
	Assert::AreEqual(out.words, (size_t)5);
	Assert::AreEqual(out.lines, (size_t)2);
	Assert::AreEqual(out.uniqueWordsOrPharses, (size_t)2);
	Assert::AreEqual(out.wordAppears[0].word, "monday tuesday wednesday");
	Assert::AreEqual(out.wordAppears[0].count, (size_t)1);
	Assert::AreEqual(out.wordAppears[1].word, "tuesday wednesday thursday");
	Assert::AreEqual(out.wordAppears[1].count, (size_t)1);
	ClearWordAppear(&out);
}

Nodejs部分的測試：

const testCaseDir = 'cases-2/'
const testCaseCount = 7
const cp = require('child_process')
const fs = require('fs')
const path = require('path')

for (let i = 1; i <= testCaseCount; i++) {
  const randomFileName = (Math.random() * 10000000000 + new Date().getTime()).toString(16) + '.txt'
  const inFileName = path.resolve(__dirname, `${testCaseDir}input${i}.txt`)
  const argFileName = path.resolve(__dirname, `${testCaseDir}arg${i}.txt`)
  const outFileName = path.resolve(__dirname, `${testCaseDir}result${i}.txt`)
  const arg = fs.readFileSync(argFileName, 'utf-8')
  cp.execSync(`"D:\\Projects\\Homework\\fzuse-hw3-2\\221600131\&221600439\\src\\Debug\\WordCount.exe" -i ${inFileName} -o ${randomFileName} ${arg}`)
  const stdout = fs.readFileSync(randomFileName, 'utf-8')
  if (fs.existsSync(outFileName))  {
    const fout = fs.readFileSync(outFileName, 'utf-8')
    if (stdout.trim() !== fout.trim()) {
      console.log(`Failed at ${i}`)
      console.log(`=== Excepted ====`)
      console.log(fout)
      console.log(`=== Actual ====`)
      console.log(stdout)
    } else {
      console.log(`OK at ${i}`)
    }
  } else {
    fs.writeFileSync(outFileName, stdout, 'utf-8')
  }
  fs.unlinkSync(randomFileName)
}

部分測試如圖：

核心算法

需要搭配狀态轉換圖檢視，注釋數量尚可。

EXTERN WordCountResult CalculateWordCount(struct WordCountConfig config)
{
	auto ret = WordCountResult();
	bool runStateMachine = true;
	char prev = 0, c = 0;
	std::string word = ""; // 不想做動态配置設定記憶體，std::string省事 
	std::string separator = "";
	std::string token = "";
	size_t wordLength = 0; // <= wordAtLeastCharacterCount，超過則不再計數
	auto map = std::unordered_map<std::string, size_t>();
	bool isValidWordStart = false;

	FILE* f;
	if (fopen_s(&f, config.in, "rb") != 0) {
		ret.errorCode = WORDCOUNTRESULT_OPEN_FILE_FAILED;
		return ret;
	}
	fseek(f, 0, SEEK_END);
	long fileLength = ftell(f);
	fseek(f, 0, SEEK_SET);
	char * string = (char*)malloc(fileLength + 1);
	fread(string, fileLength, 1, f);
	fclose(f);
	string[fileLength] = 0;

	ReadingStatus currentStatus = ALREADY;
	WordStatus wordStatus = NONE;
	std::list<WordInPharse> pharse;

	size_t currentPosition = 0;
	while (runStateMachine) {
		prev = c;
		c = string[currentPosition];
		if (currentPosition == fileLength) {
			runStateMachine = false; // 檔案讀取結束，不立即退出，處理一下之前未整理幹淨的狀态
			c = 0;
		}
		else {
			currentPosition++;
		}

		if (c >= 'A' && c <= 'Z') {
			c = c - 'A' + 'a';
		}

		bool switchStatusInCurrentToken = true;
		// 直接把read token和parse做在一起，就不拆開了
		while (switchStatusInCurrentToken) {
			switchStatusInCurrentToken = false;
			// 避免這個大switch的方法是把這個狀态轉換寫成一個類
			// 不過沒啥必要，不考慮後續維護
			switch (currentStatus) {
			case ALREADY:
				if (isNumber(c)) {
					currentStatus = READING_PAPER_INDEX;
					switchStatusInCurrentToken = true;
					continue;
				}
				// else if (isEmptyChar(c)) { // 正常， do nothing
				// }
				else { // @TODO: 此處要抛錯
				}
				break;
			case READING_PAPER_INDEX:
				if (isNumber(c)) {
					token += c;
				}
				else if (isEmptyChar(c)) { // 編号讀完，狀态轉換開始
					token = ""; // 這個編号資料沒啥用，我也不知道讀了幹啥
					currentStatus = WAITING_FOR_TITLE;
				}
				else { // @TODO: 此處要抛錯
				}
				break;
			case WAITING_FOR_TITLE:
				if (isEmptyChar(c) && c != ':') { // 可能是還沒讀完Title，也可能是已經讀完了
					if (token == "title:") { // 讀完了
						isValidWordStart = true;
						currentStatus = FINDING_WORD_START;
						wordStatus = TITLE;
						token = "";
					}
					else {  // @TODO: 此處要抛錯
					}
				}
				else {
					token += c; // 暫不判斷title:是否完全正确，假設其規範；之後加入錯誤提示
				}
				break;
			case WAITING_FOR_ABSTRACT:
				if (isEmptyChar(c) && c != ':') { // 同title
					if (token == "abstract:") { // 讀完了
						isValidWordStart = true;
						currentStatus = FINDING_WORD_START;
						wordStatus = ABSTRACT;
						token = "";
					}
					else {  // @TODO: 此處要抛錯
					}
				}
				else {
					token += c;
				}
				break;
			case FINDING_WORD_START:
				if (isLetter(c)) {
					if (wordLength == 0) {
						separator = token;
						token = "";
					}
					// 後半部分判斷是為了處理01abcdefg這種情況
					if ((wordLength > 0 && wordLength < wordAtLeastCharacterCount) || (wordLength == 0 && isValidWordStart)) {
						if (isAlphabet(c)) {
							wordLength++;
							if (wordLength == wordAtLeastCharacterCount) {
								currentStatus = READ_WORD;
								switchStatusInCurrentToken = true;
							}
							else {
								ret.characters++;
								word += c;
							}
							continue;
						}
					}
				}

				if (config.statByPharse) {// 單詞長度不達标則清空詞組
					if (wordLength > 0) {
						pharse.clear();
					}
				}

				isValidWordStart = false;
				word = "";
				wordLength = 0;
				currentStatus = READ_WORD_END;
				switchStatusInCurrentToken = true;
				continue;
				break;
			case READ_WORD: // 确定已經是單詞了，繼續搞
				if (isLetter(c)) { // 仍然是字母的情況下，繼續讀
					word += c;
					ret.characters++; // 非單詞的情況下字元統計交給READ_WORD_END
				}
				else { // 不是字母了，開始處理剩下的了
					currentStatus = READ_WORD_END;
					switchStatusInCurrentToken = true;
					continue;
				}
				break;
			case READ_WORD_END:

				if (word != "") {
					ret.words++; // 這個時候就能确定讀到了一個完整的單詞了
					if (config.statByPharse) {
						pharse.push_back(WordInPharse{
							word = word,
							separator = separator
						}); 
						if (pharse.size() == config.pharseSize) {
							auto pharseString = getPharse(pharse);
							if (map.find(pharseString) == map.end()) {
								map[pharseString] = 0;
								ret.uniqueWordsOrPharses++;
							}
							if (config.useDifferentWeight) {
								if (wordStatus == TITLE) {
									map[pharseString] += titleWeight;
								}
								else {
									map[pharseString] += 1;
								}
							}
							else {
								map[pharseString]++;
							}

							pharse.pop_front();
						}
					}
					else {

						// 略微重複代碼，建議抽象成宏
						if (map.find(word) == map.end()) {
							map[word] = 0;
							ret.uniqueWordsOrPharses++;
						}

						if (config.useDifferentWeight) {
							if (wordStatus == TITLE) {
								map[word] += titleWeight;
							}
							else {
								map[word] += 1;
							}
						}
						else {
							map[word]++;
						}
					}

					isValidWordStart = false;
				}

				word = "";
				wordLength = 0;

				if (isLf(c) || !runStateMachine) { // 如果是個換行符，就可以切換狀态是讀TITLE還是讀ABSTRACT了
					ret.lines++;
					pharse.clear();
					token = "";
					if (wordStatus == TITLE) {
						currentStatus = WAITING_FOR_ABSTRACT;
					}
					else {
						currentStatus = ALREADY;
					}
					if (isLf(c)) {
						ret.characters++;
					}
				}
				else { // 單詞處理完成了，該等新的單詞了。
					if (!isValidWordStart) {
						if (isSeparator(c)) {
							isValidWordStart = true;
							token += c;
						}
					}
					if (isCharacter(c)) {
						ret.characters++;
					}
					currentStatus = FINDING_WORD_START;
				}

				break;
			}
		}
	}

	auto sortedMap = std::vector<WordCountPair>(map.begin(), map.end());
	std::sort(sortedMap.begin(), sortedMap.end(), [](const WordCountPair& lhs, const WordCountPair& rhs) {
		if (lhs.second == rhs.second) {
			return lhs.first < rhs.first;
		}
		return lhs.second > rhs.second;
	});
	ret.wordAppears = new WordCountWordAppear[ret.uniqueWordsOrPharses];
	size_t i = 0;
	for (auto &it : sortedMap) {
		ret.wordAppears[i].word = new char[it.first.length() + 1];
		strcpy_s(ret.wordAppears[i].word, it.first.length() + 1, it.first.c_str());
		ret.wordAppears[i].count = it.second;
		i++;
	}

	return ret;

}

附加題

要進行資料分析首先得有足夠的資料集。是以我将前面的爬蟲程式進行改進，将CVPR官網上有用的資訊都爬取下來。我這裡是通過Request擷取前端代碼分析時發現底部有個神奇的bibref類，裡面存放了很多資訊，甚至還有沒展示的屬性，比如月份。通過觀察這些資訊的結構都一樣。

是以直接編寫正則一次性将所需資訊取出。結果如下

但是就這些資料種類可玩性還是太低了，一開始我的想法是能根據論文的研究方向做一個聚類，或者是通過論文使用的測試資料集來畫一個研究進展的趨勢圖（也可以通過時間序列進行未來預測），又或者是根據作者所屬國家畫一個區域熱力圖。但可惜的是這些資料都沒有，去GitHub上找别人整理的資訊也無非是多了一個論文屬性，并不是我想要的。雖然有一種操作是利用已有的作者名或者論文名再去其它地方爬相關資訊，以後有時間再嘗試。

是以最後就隻能在作者這個屬性上做點文章了。我的目的是繪制一個作者關系圖，用圓來代表作者，一起發過論文的作者用線互相連接配接。發論文量越多的作者圓越大。代碼過程是通過pandas将作者屬性提取，之後将所有作者放入list裡進行周遊計數，先計算所有作者的發文數，之後進行兩重循環計算作者之間的關聯。最後可視化使用的是基于百度echarts上的pyecharts，可以在jupyter上處理完資料後直接導入做可視化，也可以導出像echarts的Web，而不必另寫js代碼。

可視化結果

當滑鼠放到某個作者圓圈上時，其它圓圈變暗，與其一起發表過論文的作者圓圈和連線高亮。放大效果如下：

有興趣可點連結下載下傳，即可打開Web。

跟之前一樣使用多程序爬蟲。32程序時用時20秒左右，與之前差不多，這裡不再贅述。