.NET中删除空白字元串的10大方法

我們有無數方法可用于删除字元串中的所有空白，但是哪個更快呢？

介紹

我們有無數方法可用于删除字元串中的所有空白。大部分都能夠在絕大多數的用例中很好工作，但在某些對時間敏感的應用程式中，是否采用最快的方法可能就會造成天壤之别。

如果你問空白是什麼，那說起來還真是有些亂。許多人認為空白就是<code>space</code> 字元（unicodeu+0020，ascii 32，html<code>&#32;</code>），但它實際上還包括使得版式水準和垂直出現空格的所有字元。事實上，這是一整類定義為unicode字元資料庫的字元。

本文所說的空白，不但指的是它的正确定義，同時也包括string.replace(” “, “”)方法。

這裡的基準方法，将删除所有頭尾和中間的空白。這就是文章标題中“所有空白”的含義。

背景

這篇文章一開始是出于我的好奇心。事實上，我并不需要用最快的算法來删除字元串中的空白。

檢查空白字元

檢查空白字元很簡單。所有你需要的代碼就是：

char wp = ' ';

char a = 'a';

assert.true(char.iswhitespace(wp));

assert.false(char.iswhitespace(a));

但是，當我實作手動優化删除方法時，我意識到這并不像預期得那麼好。一些源代碼在微軟的參考源代碼庫的char.cs挖掘找到：

public static bool iswhitespace(char c) {

if (islatin1(c)) {

return (iswhitespacelatin1(c));

}

return charunicodeinfo.iswhitespace(c);

}

然後charunicodeinfo.iswhitespace成了：

internal static bool iswhitespace(char c)

{

unicodecategory uc = getunicodecategory(c);

// in unicode 3.0, u+2028 is the only character which is under the category "lineseparator".

// and u+2029 is th eonly character which is under the category "paragraphseparator".

switch (uc) {

case (unicodecategory.spaceseparator):

case (unicodecategory.lineseparator):

case (unicodecategory.paragraphseparator):

return (true);

return (false);

getunicodecategory()方法調用internalgetunicodecategory()方法，而且實際上相當快，但現在我們依次已經有了4個方法調用！以下這段代碼是由一位評論者提供的，可用于快速實作定制版本和jit預設内聯：

// whitespace detection method: very fast, a lot faster than char.iswhitespace

[methodimpl(methodimploptions.aggressiveinlining)] // if it's not inlined then it will be slow!!!

public static bool iswhitespace(char ch) {

// this is surprisingly faster than the equivalent if statement

switch (ch) {

case '\u0009': case '\u000a': case '\u000b': case '\u000c': case '\u000d':

case '\u0020': case '\u0085': case '\u00a0': case '\u1680': case '\u2000':

case '\u2001': case '\u2002': case '\u2003': case '\u2004': case '\u2005':

case '\u2006': case '\u2007': case '\u2008': case '\u2009': case '\u200a':

case '\u2028': case '\u2029': case '\u202f': case '\u205f': case '\u3000':

return true;

default:

return false;

删除字元串的不同方法

我用各種不同的方法來實作删除字元串中的所有空白。

分離合并法

這是我一直在用的一個非常簡單的方法。根據空格字元分離字元串，但不包括空項，然後将産生的碎片重新合并到一起。這方法聽上去有點傻乎乎的，而事實上，乍一看，很像是一個非常浪費的解決方式：

public static string trimallwithsplitandjoin(string str) {

return string.concat(str.split(default(string[]), stringsplitoptions.removeemptyentries));

linq

這是優雅地聲明式地實作這個過程的方法：

public static string trimallwithlinq(string str) {

return new string(str.where(c => !iswhitespace(c)).toarray());

正規表達式

static regex whitespace = new regex(@"\s+", regexoptions.compiled);

public static string trimallwithregex(string str) {

return whitespace.replace(str, "");

字元數組原地轉換法

該方法将輸入的字元串轉換成字元數組，然後原地掃描字元串去除空白字元（不建立中間緩沖區或字元串）。最後，經過“删減”的數組會産生新的字元串。

public static string trimallwithinplacechararray(string str) {

var len = str.length;

var src = str.tochararray();

int dstidx = 0;

for (int i = 0; i < len; i++) {

var ch = src[i];

if (!iswhitespace(ch))

src[dstidx++] = ch;

return new string(src, 0, dstidx);

字元數組複制法

這種方法類似于字元數組原地轉換法，但它使用array.copy複制連續非空白“字元串”的同時跳過空格。最後，它将建立一個适當尺寸的字元數組，并用相同的方式傳回一個新的字元串。

public static string trimallwithchararraycopy(string str) {

int srcidx = 0, dstidx = 0, count = 0;

if (iswhitespace(src[i])) {

count = i - srcidx;

array.copy(src, srcidx, src, dstidx, count);

srcidx += count + 1;

dstidx += count;

len--;

}

if (dstidx < len)

array.copy(src, srcidx, src, dstidx, len - dstidx);

return new string(src, 0, len);

循環交換法

用代碼實作循環，并使用<code>stringbuilder</code>類，通過依靠stringbuilder的内在優化來建立新的字元串。為了避免任何其他因素對本實施産生幹擾，不調用其他的方法，并且通過緩存到本地變量避免通路類成員。最後通過設定stringbuilder.length将緩沖區調整到合适大小。

// code suggested by http://www.codeproject.com/members/thebasketcasesoftware

public static string trimallwithlexerloop(string s) {

int length = s.length;

var buffer = new stringbuilder(s);

var dstidx = 0;

for (int index = 0; index < s.length; index++) {

char ch = s[index];

switch (ch) {

case '\u0020': case '\u00a0': case '\u1680': case '\u2000': case '\u2001':

case '\u2002': case '\u2003': case '\u2004': case '\u2005': case '\u2006':

case '\u2007': case '\u2008': case '\u2009': case '\u200a': case '\u202f':

case '\u205f': case '\u3000': case '\u2028': case '\u2029': case '\u0009':

case '\u000a': case '\u000b': case '\u000c': case '\u000d': case '\u0085':

length--;

continue;

default:

break;

buffer[dstidx++] = ch;

buffer.length = length;

return buffer.tostring();;

循環字元法

這種方法幾乎和前面的循環交換法相同，不過它采用if語句來調用iswhitespace()，而不是亂七八糟的<code>switch</code>伎倆 :)。

public static string trimallwithlexerloopchariswhitespce(string s) {

char currentchar = s[index];

if (iswhitespace(currentchar))

length--;

else

buffer[dstidx++] = currentchar;

原地改變字元串法（不安全）

這種方法使用不安全的字元指針和指針運算來原地改變字元串。我不推薦這個方法，因為它打破了.net架構在生産中的基本約定：字元串是不可變的。

public static unsafe string trimallwithstringinplace(string str) {

fixed (char* pfixed = str) {

char* dst = pfixed;

for (char* p = pfixed; *p != 0; p++)

if (!iswhitespace(*p))

*dst++ = *p;

/*// reset the string size

* only it didn't work! a garbage collection access violation occurred after using it

* so i had to resort to return a new string instead, with only the pertinent bytes

* it would be a lot faster if it did work though...

int32 len = (int32)(dst - pfixed);

int32* pi = (int32*)pfixed;

pi[-1] = len;

pfixed[len] = '\0';*/

return new string(pfixed, 0, (int)(dst - pfixed));

原地改變字元串法v2（不安全）

這種方法幾乎和前面那個相同，不過此處使用類似數組的指針通路。我很好奇，不知道這兩種哪種存儲通路會更快。

public static unsafe string trimallwithstringinplacev2(string str) {

fixed (char* pstr = str) {

int dstidx = 0;

for (int i = 0; i < len; i++)

if (!iswhitespace(pstr[i]))

pstr[dstidx++] = pstr[i];

// since the unsafe string length reset didn't work we need to resort to this slower compromise

return new string(pstr, 0, dstidx);

string.replace（“”，“”）

這種實作方法很天真，由于它隻替換空格字元，是以它不使用空白的正确定義，是以會遺漏很多其他的空格字元。雖然它應該算是本文中最快的方法，但功能不及其他。

但如果你隻需要去掉真正的空格字元，那就很難用純.net寫出勝過string.replace的代碼。大多數字元串方法将回退到手動優化本地c ++代碼。而string.replace本身将用comstring.cpp調用c ++方法：

fcimpl3(object*,

comstring::replacestring,

stringobject* thisrefunsafe,

stringobject* oldvalueunsafe,

stringobject* newvalueunsafe)

下面是基準測試套件方法：

public static string trimallwithstringreplace(string str) {

// this method is not functionaly equivalent to the others as it will only trim "spaces"

// whitespace comprises lots of other characters

return str.replace(" ", "");

許可證

這篇文章，以及任何相關的源代碼和檔案，依據the code project open license (cpol)的許可。

作者：小峰

來源：51cto

.NET中删除空白字元串的10大方法

繼續閱讀

艱難安裝LDAP,SSL認證

《Linux指令行與Shell腳本程式設計大全第2版.布盧姆》pdf

MySQL的4種隔離級别？出現問題

資料結構與算法（27）——排序（二）

XX系統實施過程問題總結

無元件上傳圖檔到資料庫中，最完整解決方案

【MySQL資料庫】資料庫索引事務1.索引2.事務

neo4j之cypher使用文檔

Dijkstra--簡易版（最短路徑）

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

NOSQL安全攻擊

mybatis_入門程式Mybatis入門

面試題解析：你接口測試是怎麼做的？

登入plsql 報錯 the account is locked --使用者被鎖

SequoiaDB巨杉資料庫C++驅動概述

hdu7108哈希