SubString match with a big file
Problem:
===================
Give you a big txt file, whose size is 5GBytes, and another substring which has a size about 1K bytes, how to find the position that this substring occured in the big file??
please take care of two different scenario:
1. if the main memory is enough to load 5G bytes
2. if the main memory is less than 5G bytes
e.g:
the big file:
"goodmorning............"
the matching string:
"morning"
return 4;
Discuss:
======================
This is an extension of the typical substring matching problem:
Given that you have one string of length N and a small string of length M. How do you efficiently is the small string a substring of the larger one? if it is, give the position.
Simple Solution: (Time Complexity = O(N*M) )
====================
int isAnagram(string strA, string strB)
{
bool matched;
for(int i=0; i< (strA.length()-strB.length()+1) ; i++)
{
for(int j=0; j< strB.length(); j++){
matched = ture;
if(strA[i+j] != strB[j]){
matched = false;
break;
}
}
if (matched) return i;
}
return N+1;//not found
}
Knuth–Morris–Pratt (KMP)Solution: (Time Complexity = O(N) )
==========================
void failLink (int* s, int* next)
{
int j,k;
next[0] = -1;
for (j=1; s[j] != '/0'; j++)
{
k = next[j-1];
while (k != -1 && s[j] != s[k+1])
k = next[k];
if(s[j] == s[k+1]) next[j]=k+1;
else{
next[j]=-1;
}
}
}
char* kmpmatch(char* s, char* t, int* next)
{
int j=0;
int k=0;
while (s[j] != '/0' && t[k] != '/0')
{
if(s[j]==t[k]){j++; k++;}
else if (j==0){k++;}
else{
j=next[j]+1;
}
}
if (s[j]=='/0') return t-k+j;
return NULL;
}