laitimes

A brief discussion - Liang Bin's search engine book

Now go back and revisit the entry-level search engine book I read, and many of the first-person statements involved in it are written from the perspective of the time, don't mind. In the past few days, I have read the book "Into the Search Engine" for the first time by Teacher Liang Bin, this book is very basic, the language is also very easy to understand, the content in the book is not difficult to understand, it is the same type as the vernacular. It has to be said to be a good introductory book for people who have no foundation. I've read this book several times so far. The whole text consists of 7 chapters, the first chapter introduces the basic background knowledge of search engines, and reviews the development process of search engines by introducing the history of search engines. Now that there is thunder outside, you can record a good day in college, and the second chapter deals with search engines, as well as the main system divisions of search engines and the knowledge points of architecture. Easy to understand. The language is concise and clear, what parts the search engine is composed of, what role each part plays, what functions it plays in the search engine, and so on. Chapter 3 introduces the background of the search engine download system, design principles and techniques, and the design of web libraries. The fourth chapter introduces the calculation principles and implementation details of information extraction, web page double checking, Chinese word segmentation, and submodules of analytical systems such as PageRank. Chapter 5 introduces the basic concepts of full-text search, document numbering, forward and inverted tables, and comprehensively introduces the indexing technology of the core search engine. Chapter 6 describes the two main functional modules of the query system. Chapter 7 systematically introduces the interrelationships of search engine systems by answering frequently asked questions about search engines, and looks forward to the future development of search engines, a summary of the book or some suggestions for our readers after writing this book.

The following is a detailed summary of your understanding of the knowledge in this book according to the content of each chapter. After nearly a month of study, the Internet is also a query search engine or information retrieval in a lot of information, so that the understanding of this aspect is also increased a lot, including reading some papers and some books, the basic principles of the search engine have an understanding, the search engine is a typical application of information retrieval, it can be said that information retrieval is a field, the search engine is also a great direction, which involves too much knowledge or direction. Most of the information retrieval books I read in the early days are very well described above. This book of Liang Bin can be said to be my primer, mainly because the language in the book is concise and easy to understand, after all, it was too difficult when I read the book recommended to me by my mentor, and many of the details involved were not very well understood. Now I am looking for my own ideas from Liu Ting's search engine practice book, after all, time has passed quickly, and I am about to start the topic, and then I am facing the graduation design and the publication of a small paper. Until now, there is still no idea, no way, reading papers all day is also purposefully looking for methods or knowledge points that interest you. The first chapter introduces the three search engines we recognize as a service method: catalog search engine, full-text search engine, and metasearch engine. About the directory search engine, my understanding is that there are more manual interventions, the cost is relatively large, the early Yahoo website uses a directory search engine, of course, this also has advantages, query is relatively fast, the accuracy rate is relatively high, after all, the website has helped you classify well, users want to find what type of information, you can search for the information you want through the navigation on the website, the difficulty is that manual update is quite slow. About the full-text search engine, typical is Google, Baidu and other large search engines, Google's pagerank and Baidu's super chain analysis method are based on full-text search, such a search information is relatively large, the update is also relatively timely, no need for manual intervention, of course, there are drawbacks, that is, users need to choose the information they need from a large amount of information, and there may be some theme drift or other shortcomings.

The last one is the metasearch engine, which is also a very important category and a very special category. This type of search engine does not have its own data, it is by sending the user's query to multiple search engines at the same time, and then the results sent by multiple search engines are processed by their own algorithms and then back to the user, the advantage is that the amount of information returned is large, the disadvantage is that the function of the original search engine cannot be fully used, the user needs to do more screening, here involves a process of data fusion, that is, when receiving a lot of search engine return results, the relevant sorting process must be carried out. The sorting here is carried out according to a certain algorithm, many researchers are studying the efficiency, usefulness, robustness and other issues of this type of algorithm, about data fusion, Teacher Wu has the book that the tutor gave me last time, I have not yet come and translated, recently or before graduation should try to translate it, their own requirements are to start translating, the translation of the whole book may be used for a long time, anyway, it may be translated while learning, so there will be no difficulties in it, There may be some professional words that need to look at the special information on the Internet, I probably looked at the book a few days ago, and I feel that there are many formulas involved in it, that is, the amount of data in the practice link is relatively large, and there are more things or knowledge points that need to be verified by experiments.

About the history of the development of search engines is just to understand it yourself, from the beginning of the first search engine to understand, in fact, that is, the start time of the major search engine portals, how to start and so on. The first chapter finally introduced the more famous domestic search engines, on several of the more famous domestic search engines there are several or good, at least the search results are still satisfactory, Baidu's bidding ranking is a bit of a breakthrough in the convention, this is also understandable, now since learning the search engine after each time to retrieve the information they want will unconsciously observe the quality of that search engine, the accuracy of the search engine, this feeling is good, is conducive to the later search engine information evaluation.

The second chapter begins with a detailed explanation of the architecture of the search engine, what parts of the search engine are composed, how each part works, and what role it plays. Simply put, the search engine consists of four systems: download, analysis, indexing, and query system. My understanding at that time was that the download system crawled a large number of web pages through the web crawler to the Internet, here to talk about the crawling algorithm of the web crawler, what kind of algorithm is generally used for web page crawling, which must ensure the efficiency of crawling, the speed must be fast, and then how to crawl more web pages, of course, the web pages involved here are also useful web pages, there is no use for the web page crawling down or useless. Now the idea is the improvement of Lucene's crawler algorithm, directly to modify the source code, have the master brother's own way to write the web crawler book, have time to look, but the above design to the knowledge points are more complex. Generally, the methods introduced in the papers or books I have read are to crawl web pages according to the principle of depth priority or width priority of the graph. Generally speaking, many crawlers crawl at the same time, and now they are distributed web crawlers working together, such efficiency is very high, and when the web crawler works, it will also involve related algorithms, that is, web crawling algorithms, choose a good web crawling algorithm can better crawl at a faster speed, after crawling down, by analyzing the keywords in the web page, here through the analysis system work to complete. The analysis process will also involve a lot of knowledge points, that is, how to divide the web page, according to what kind of standards to divide, the information on the general web page is a title, the large title, the information involved in the subheading is useful information, in the next index will use the information here, and then the key information is classified and summarized in the search engine database, that is, the index library. After the analysis is completed, the relevant index is established, the index database is used to facilitate the later query, and finally the user submits his own query keywords, and then the search engine will feedback the results to the user through the query index library. The above description is only a rough process, the actual query is more complex, each system has its own algorithm, using its own principles to achieve will also be introduced in the subsequent chapters.

The third chapter began to explain the download system, starting from the network crawler, I have seen several papers on the network crawler, most of the code is not easy to understand, the principle is to understand, this I feel that I have to learn slowly, a little accumulation. Web crawlers crawl information resources on the network in accordance with a certain algorithm or regulations, in the principle of crawlers I understand that crawlers as a program we set, when visiting the website, we must first do not disturb others, of course, this is somewhat anthropomorphic, this is actually similar to being a person, you go to other people's homes to get information, of course, to do as much as possible not to disturb the master, so when we set up crawlers, we should follow some principles, and do not crawl web resources when the other server is busy. The resources accessed must also be accessible only after permission, and there must be no access to unauthorized web resources, and these guidelines are some minimum respect. So it can be seen that there is a ROBOTS protocol for web crawlers on the Internet, which stipulates what content of this website is allowed to be accessed by you, which content is not allowed to crawl, when you can come over to collect information, and when you can't come over to crawl. That is, the equivalent of the politeness problem, which is very important in real life, of course, on the Internet is also very important.

There is also about the Internet, itself is a bow shape, there are many links on the web page, all the links to their own called backlinks, there are their own starting links to other web pages called forward links, we all know that a web page has a lot of links to it to indicate that the web page is very important, not many web pages point to it means that the web page is not very important. And the web crawler of course has to crawl the important web pages, so how to filter out the unimportant web pages is the key problem for us designers. We know that the Internet presents the structure of the bow, so we try our crawlers to climb up from the left part of the bow, so that we can traverse the entire Internet and collect more and more useful information to ensure that the collected information is more comprehensive.

Web pages are generally divided into directory-type web pages and authoritative web pages, directory-type web pages are mainly for users, to help users understand the information of the website in depth, through the directory-type web pages we can link to authoritative web pages, authoritative web pages are generally in the middle or right of the bow, such web pages have more backlinks, so it is generally considered that the importance of such web pages is relatively high. The crawling principle of the crawler or the strategy called the crawling is roughly divided into two types, the first is the depth priority strategy, the second is the width priority strategy, from the papers I have read, the width priority strategy is relatively more efficient, this specific also considers the environment or field used, in different places or scopes using different strategies for crawling. Of course, when crawling, we must also pay attention to not repeatedly crawling the same web page, otherwise the efficiency of the crawler will become very low and very low, but how to ensure that the crawler does not repeatedly crawl the same web page, which involves several methods, I think it is a good method to have MD5 signature, give different signatures with each web page, so that we can identify those web pages are the same, here of course, with the unique URL of each web page to calculate the signature, because each web page only has a unique URL. The calculation methods involved here are also learned by researchers in long-term practice, and the results of some people's decades of research may be just a formula.

The signature function here also involves a hash function, using a hash table as we know it to complete the relevant transformation work. Let's talk about the priority strategy of web page crawling, what kind of web page should be crawled first, of course, the usual sense of important web pages need us to crawl in time, there are many measures of importance, such as link popularity, link importance, average link depth and so on. Here you can define it yourself, the more you define, this is very authoritative, more is more just, the corresponding amount of calculation at that time is very large. Define the link popularity is determined by the number and quality of the backlink, the more the number, the better the quality, of course, we think that the popularity of the link is relatively high, as for the link importance and popularity is almost the same, the higher the quality or authority of the linked web page, the higher the importance of the link. The average link depth is guaranteed by the width priority policy rules, which involves a web page return problem, what kind of web page needs our crawler to return to re-crawl, our news page needs our crawler to crawl according to the indefinite time, because this kind of web page update is relatively fast, only regular crawling can ensure the effectiveness and novelty of information. The frequency of updates of general web pages satisfies the Poisson distribution. This is the knowledge in probability theory. When crawling web pages, our crawlers should also pay attention to politeness problems, the general website has a corresponding ROBOTS protocol, used to constrain the crawler's crawling activities, what kind of web pages you can crawl, what kind of web pages you can not crawl, and there is my site inside where you can come in, where you can not come in. In particular, desktop search engines, files in the user's computer, that folder is accessible, which folders can not be accessed casually. These are already written in the ROBOTS protocol. So there is also a time to crawl the site, this should respect the administrator of the site, ask about it, access can not cause the other party's server to paralyze.

Let's talk about the web page library, that is, the web page database established in the index database after the crawler has crawled the web page. We all know that the web pages crawled by the crawler must be saved to our disk in time, and then the index library is built for future users to query. I have seen a lot of papers talk about the speed of crawler crawling, of course, in real life, our search engine must ensure that the user's query results are returned efficiently and quickly, only in this way will users be willing to use such a search engine. Then in the four major systems of the search engine, the reading and writing problem is also a key problem affecting the speed, how to improve the speed of reading and writing to improve the efficiency of our search engine, there will inevitably be such a problem, update problems, how to update, in what way to update to achieve the fastest speed, in order to meet the needs of users. This book talks about three methods: log structure, hash-based structure, and one is a hash log, look at the name rescue can distinguish the hash log is definitely the best, in fact, the hash log is to talk about the hash structure and log structure of the advantages of combined together, more convenient to improve the speed, convenient for the user. We have talked about hash tables in the data structure class, and we are more familiar with the generation or calculation process of hash functions. Here involves finding a hash function, of course, here also involves the reading and writing of files in the operating system and the reading and writing of disks, specific subdivisions of many things, before always think that the knowledge learned is useless, such as the operating system, it feels completely pure theory, although the final course design when part of the algorithm is implemented, but still feel that the actual life is not much use, but now it seems that it is different, but at that time we have not yet involved this piece or this field. In the storage of a file and the use of B+ - trees in the data structure, when the examination and research carefully reviewed the data structure, now it seems to be good and useful. At least it won't feel so hard to read these articles or books.

Finally, review the download system of the search engine, summarize it into three points, grasp the whole, grasp fast, low cost, is the general principle of our search engine, and now the large commercial search engine considers more. There are also dynamic web page support, targeted crawling, static web crawling, the third generation of search engines that will develop in the future involve intelligent search engines, which are more user-oriented and more humane things.

Next, we have to start the fourth chapter of learning, the fourth chapter is a large chapter, which involves a lot of knowledge, the fourth chapter is about the analysis system, the download system will download the web page to analyze in order to facilitate the later indexing.

The second of the four major systems of search engines is the analysis system, and the main work done by the analysis system includes information extraction, web page weight loss, Chinese word segmentation and calculation of pagerank algorithm. The following is a detailed summary of my own thinking according to the specific content of each chapter.

Before talking about that information extraction, let's talk about the html language, we all know that the html language is a specialized programming language for creating files on the www server, and there is a text on our web page to help users better understand the pointing of the hyperlink, which we call anchor text, anchor text usually appears in the form of pictures and text, and the text text in the hyperlink is what we call anchor text. The role of anchor text is to facilitate user queries, and the data on the web page is what we usually call semi-structured data. This is different from a normal text file. It contains some data information that is different from plain text. The above anchor text knowledge is only to understand, about the composition of the search engine does not play a small role, basically every book about the search engine or every paper will mention this part of the content. Regarding information extraction and the structured processing of web pages, as the name suggests, information extraction is to extract valuable information from web pages that we crawl from the network, so the key question is how do we efficiently extract the valuable information contained in it, for our use, for our users. First of all, the goal of web page structuring is 5, including anchor text, title, body title, body text, and forward links. These five properties of a web page are essential for our information retrieval. The specific explanation of these five parts will not be said, and it will be clear by looking at it.

In general, when we deal with the original web page, we follow a two-step approach, first establishing an html tag tree, and then identifying the text in the body through the voting method, and organizing the text according to the depth-first method. This part does not look very comfortable, my understanding is like this: that is, the title or anchor text of the body part of the web page is extracted, that is, the five attributes just mentioned, one by one extracted from the web page, and then recognized as the abstract of the web page is equivalent to the function of the abstract, of course, it can not be called a summary. It's just the information used to illustrate the page, what the page says. Want to say something to our users. The process of building a tag tree uses the storage structure of the stack in the data structure we have learned, which is relatively easy to understand, we all know that the html symbols in the web page are paired, so it is convenient to give our stack storage, we know that the stack is advanced and backward, it is the use of this feature of the stack, we can handle and establish the correct tag very well. Convenient for our later processing. The next step is to get our body text by voting.

There are three types of text blocks on a general web page, a body text block, a catalog text block, and a picture type text block. It is the same as our real-life voting, most of them think that it is the body of the body, so the probability of error is relatively low. The specific method I will simply say, that is, we first set it, if a text block is several points long, the longer the score will be correspondingly higher, and then according to the position of the text block in the web page left or right or middle or what place, according to different positions to give different scores. Finally, the score of each text block is calculated, arranged in order from highest to lowest, and the text with the higher score is selected, and we think of it as the body text.

Next, let's talk about the page checkup, the average person on the Internet will not care, that page is primitive, some similar pages we usually do not pay too much attention, the reason is very simple, as long as we can meet our query needs, whether it is not the original web page and our users The relationship is indeed not very large. But for search engines is not the same, the same or similar pages, which means that our search engine has to repeat the processing once, a web page is fine, if there are many web pages, then our search engine can not deal with it, so it is a waste of time and processing very slowly, it is really troublesome. So our approach to excluding the same pages is best to keep the original pages. Regarding the method of checking the page, I will talk about it, that is, the judgment process is divided into several parts, in fact, there are four cases, the content and format of the two web pages are the same, the content of the two web pages is the same but the format is different, the two web pages have some important content and the same format, and the important content of the two web pages is the same but the format is different. For now, we will consider these four situations. Here involves an IMAGIN method, that is, to extract high-frequency words from the web page, and then several web pages to compare the high-frequency words, that is, to extract those feature words that can represent the main content of the web page as much as possible to compare. There is also a killing algorithm, which is almost the same as the previous method, but it is to extract multiple feature words to show the difference with the imaging algorithm, and we use these two methods to double check the web page. Of course, there are also some calculation formulas involved, and I will not list them here. In summary, I will briefly summarize, that is, to sum up, the three steps that must be taken by the web page check are feature word extraction, similarity calculation and evaluation, followed by the elimination of duplicate web pages, the web page check work is an indispensable part of the analysis system, which involves the efficiency problem is also more important, how to save time, save space, reduce the cost of query. These are all questions that our current graduate students should consider.

After the double-check work on our web page, there is also Chinese word segmentation work, which is also very important and more complicated. At present, researchers at home and abroad are also racking their brains to think of various ways to deal with word segmentation, first of all, what is the Chinese participle, Chinese does not have obvious word segmentation symbols like English, Chinese there is no word segmentation, and there are many problems involved in the Chinese, and there are too many ambiguities in the Chinese. The complexity of the corresponding processing goes up, we generally now have methods to divide words through dictionaries, and there are several kinds of ambiguity, including intersectional, combinatorial, and mixed types. There are three basic word segmentation methods that are the maximum positive match, the maximum negative match, and the simultaneous match of both sides. Then there is a kind of rely on the principle of statistics to divide words, the user entered a lot of Chinese search terms, we only speculate through some words that people often use in daily life, I have seen some papers on word segmentation, most of them are the two methods mentioned here, here is also can be considered research, think about what better method we can use for efficient word segmentation.

The last point is the study of the pagerank algorithm, this part of my previous research is also more, a variety of improved algorithms for the algorithm, in fact, the original original algorithm has a lot of shortcomings, our later readers also put forward a lot of their own improved algorithms on this basis, the effect is also very good, this piece also has a lot of worth our study and discussion.

The following is the summary and introduction of the fifth chapter, the index system stores a large number of web pages, we know that the index system must provide our users with less than the second level of retrieval time, so the search is fast, the storage is fast, the storage is our minimum requirements. Speaking of indexes, in fact, indexes are also a kind of information, or information called information, which can also be said to describe information. Like the indexes in every book, the indexes help us to look up the bibliography faster and find the information we want. There are four types used here: inverted indexes, inverted tables, temporarily inverted files, and finally inverted files. I will talk about the definition of these four, the first is that the inverted index is an abstract concept, unlike the last four, the last four are three different manifestations of the inverted index. The latter three are all related to storage, and there is a slight difference between the size of the temporary and final scale. The rest is pretty much the same. Full-text search is now the main search engine search method, full-text search is a revolution in the field of information retrieval, it refines the granularity of information retrieval, so that we can better query the information we want to get. Provides a new information retrieval experience with multiple perspectives and a full range, so now mainstream search engines are using this method for information retrieval.

Of course, full-text search also has related problems, such as the search results are not sorted reasonably, and now only the title can be searched, the reason for these problems is because we did not take into account the content of the document. Full-text search, as the name suggests, is to search the entire document or web page content, and now we only search for a part of the information, such as the title, or abstract or whatever. Therefore, the essence of full-text search I summarize into two: all the text of the document participates in the index, and the search results can provide the location information of the search term in the document. This takes into account both the consistency of the text content and the relevance of the location information, which can well meet the retrieval needs of our users. In the process of searching, our users enter a few keywords, and then our search engine will search the full text according to these keywords, and finally sort the results back to our users for reference. Regarding the numbering of documents, let me summarize my own understanding, each document should be unique on the network, has its own unique number, so we give each document a number, just like the student number used by our students, a web page is crawled by our crawlers and then given a corresponding number. The difference between the document number and the number in our daily life is that the document number does not need to give its meaning, that is, we have asked why this document is given this number, why is the document given that number. This number is also for the convenience of our subsequent operation. Of course, this change is not given casually, but also to meet the corresponding conditions, each document in its relative life cycle can only have one number, any two different documents can not be the same number. In order to facilitate calculations, the shorter the number of our documents, the better, convenient with the computer storage, less waste of space. I will not summarize the following calculation problems about the specific storage aspect of the inverted index, each method of computation is different, has its simple and convenient side, but also has its data structure of the basic reference. The creation of indexes is similar to the linked list in C++, and like the mathematical function references, the scheduling or storage of disks mentioned in the operating system is also based on this. In general, the knowledge points designed in the fifth chapter are still relatively large, and there are still many more difficult places that I have not yet eaten thoroughly, and I have to read a few more articles, slowly digest, and then further write out the parts involving methods. I am still reading a few books, or more introductory books, and after reading them, the harvest is different every time.

Chapter 6 is the final step, the ultimate purpose of the search engine is then the user enters their own query keywords, and then our search engine searches through the keywords. In the four major systems of the search engine, the fourth system is called the query system, the query system directly faces our users, after accepting our users' query requests, through the search, sorting and summary calculations, etc., the calculation results are organized into search results pages back to our users. And our search engines must ensure that the entire query process must be fast, and must be able to provide the results returned to the user so that the user is satisfied. If you just return the results to our users very quickly, and do not guarantee the satisfaction of the results, it is certainly not enough. In the query system involves a concept is information entropy, information entropy is also a quantitative process of information, undergraduate data structure has introduced Haffman coding, the encoding by calculating the word frequency of different words to build a Huffman tree or Havalman code, usually the coding of high-frequency vocabulary is relatively short, the coding of low-frequency vocabulary is relatively long, but the intuitive things still can not explain a lot of things, the following example to illustrate the concept of information entropy, There are many examples in our lives where to agree on where to meet similar questions, how to ensure that the other party has received your message without considering information security, generally you send a text message to your good friend, only he replies to your text message, you can be sure that he received the message you sent him, but how does he know that you received the message he sent you, this is an infinite loop of questions, there are many kinds of codes for our messages, for the above questions or can not be a good answer.

In our mathematics, or in computer networks, when we have learned to communicate, the more information contains, the more valuable our information is, and probability theory is divided into many cases, of course, the consumption required for communication is relatively large. Information entropy only illustrates the relationship between probability and information, that is, the greater the uncertainty of the variable, the greater its entropy value, and the greater the amount of information required to figure it out, from which we know that information entropy is a very important concept. Let me introduce the difference between retrieval and query, the convention of this chapter has a premise, in fact, the query corresponds to a search, and the query on the user side is the search engine corresponding to the search. That is, the result of the query is the search engine search page, of course, the query term and the search term are also different, the ordinary user submits to the query system the word is called the query term, and then when our query term is submitted to our search system, it becomes a search term. The last concept is the automatic text summary, as the name suggests, that is, extracted from the text can represent the meaning of the full text of the summary, the user only needs to browse the summary to roughly understand the main content of the text, the user only need to look at their query words and the document in the summary of the correlation can know whether the document is their own wanted document.

The second section of this chapter deals with several models of retrieval, which are introduced at the outset in many books. This part involves a lot of formulas, can understand very little of it, a lot has been formed, starting from the simplest Boolean model, the non-boolean model is the Boolean model, on two cases, it is also relatively simple, do not need too much understanding of a lot of cases are many search engines first use the Boolean model and then use the vector space model to further query relevance, and the similarity of the user query. As for how to generate the search results of the page I will not go into detail here, there are things in it that I do not understand much now, so I understand enough deeply enough to understand this part of the knowledge thoroughly. Sort it out well, and then write it out, to be honest teacher, you asked me to write this bi-weekly report, in fact, I also know that I wrote it out, I also know that it is not for you to see, it is for myself to see, and the future big paper writing involves hundreds of thousands of words, but also to complete it myself. The usual accumulation will be able to reflect the value when the time comes, maybe I usually write when the tone is a little colloquial. I will try to change it in the future, write it like my own summary, and also follow the writing specifications of the paper. I will also pay attention to the training of this method. The summary of each part of the above is superficial, and I will continue to write a little deeper in the future, because each concept must be understood, only in this way can we better lay a foundation and have a convenient follow-up development. This summary I have read no less than 5 times, each time I feel OK, although the writing is not very good, but after all, it is the result of my months, for the search engine in the summary and thinking has been introduced, the next will continue to consolidate the results continue to read more papers and good books, only continue to improve their knowledge, or called to expand their knowledge, in order to better face the greater challenges in the future. I believe in my own ability, and I will be able to do what others can do!!! Just like Teacher Shi you said, now do not understand that it is okay, at that time must understand the end, deceive yourself is not interesting, the university struggled for four years, to have today's results, I will still work hard, lay a good foundation for the future, prepare for the PhD, and then do a good job in academic research, I myself like this kind of work.

Read on