天天看點

SubString match with a big file

SubString match with a big file

Problem:

===================

Give you a big txt file, whose size is 5GBytes, and another substring which has a size about 1K bytes, how to find the position that this substring occured in the big file??

please take care of two different scenario:

1. if the main memory is enough to load 5G bytes

2. if the main memory is less than 5G bytes

e.g:

the big file:

"goodmorning............"

the matching string:

"morning"

return 4;

Discuss:

======================

This is an extension of the typical substring matching problem:

Given that you have one string of length N and a small string of length M. How do you efficiently is the small string a substring of the larger one? if it is, give the position.

Simple Solution: (Time Complexity = O(N*M) )

====================

int isAnagram(string strA, string strB)

{

    bool matched;

    for(int i=0; i< (strA.length()-strB.length()+1) ; i++)

   {

      for(int j=0; j< strB.length(); j++){

        matched = ture;

        if(strA[i+j] != strB[j]){

          matched = false;

          break;

        }

      }

      if (matched) return i;

    }

    return N+1;//not found

}

Knuth–Morris–Pratt (KMP)Solution: (Time Complexity = O(N) )

==========================

void failLink (int* s, int* next)

{

  int j,k;

  next[0] = -1;

  for (j=1; s[j] != '/0'; j++)

  {

    k = next[j-1];

    while (k != -1 && s[j] != s[k+1])

      k = next[k];

    if(s[j] == s[k+1]) next[j]=k+1;

    else{

      next[j]=-1;   

    }

  }

}

char* kmpmatch(char* s, char* t, int* next)

{

  int j=0;

  int k=0;

  while (s[j] != '/0' && t[k] != '/0')

  {

    if(s[j]==t[k]){j++; k++;}

    else if (j==0){k++;}

    else{

      j=next[j]+1;

    }

  }

  if (s[j]=='/0') return t-k+j;

  return NULL;

}

繼續閱讀