Understanding Search Engines
The informative little book by Michael Berry and Murray Browne, Understanding Search Engines: Mathematical Modeling and Text Retrieval, provides an excellent explanation of vector space models, especially LST, and contains several examples and sample code. Our mathematical readers will enjoy and its application of linear algebra algorithms in the context of traditional information retrieval.Types of search engine are described below.
There`s actually a fourth model for traditional search engines, meta-search engines, which combines the three classic models. Meta-search engines are based on the principle that while one search engine is good, two (or more) are better at another task. Thus, meta-search engines such as Copernic(www.capernic.com) and SurfWax (www.surfwax.com) were created to simultaneously exploit the best features of many individual search engines. Meta-search engines send the query to several search engines at once and return the results from all of the search engines in one long unified list. Some meta-search engines also include subject- specific search engines, which can be helpful when searching within one particular discipline. For example, Monster(www.monster.com) is an employment search engine.
Boolean Model Search Engine
The Boolean model of information retrieval, one of the earliest and simplest retrieval methods, uses the notion of exact matching to match documents to a user query. Its more refined descendants are still used by most libraries. The adjective Boolean refers to the use of Boolean algebra, whereby words are logically combined with the Boolean operators AND,OR, and NOT. For example, the Boolean AND of two statements means that both x and y must be satisfied, while the Boolean OR of these two statements means that at least one of these statements must be satisfied. Any number of logical statements can be combined using the three statements must be satisfied. Any number of logical statements can be combined using the three Boolean operators.
The Boolean model of information retrieval operates by considering which keywords are present or absent in a document. Thus, a document is judged as relevant or irrelevant; there is no concept of a partial match between documents and queries. This can lead to poor performance. More advanced shade of gray. For example, a title search for car and maintenance on a Boolean engine causes the virtual machine to return all documents that use both words in the title. A engines use fuzzy logic to categorize this document as somewhat relevant and return it to the user.
The car maintenance query example introduces the main drawback of Boolean search engines; they fall prey to two of the most common information retrieval problems, synonymy and polysemy. Synonymy refers to multiple words having the same meaning, such as car and automobile. A standard Boolean engine cannot return semantically relegated documents whose keywords were not included in the original query. Polysemy refers to words with multiple meanings. For example, when a user types bank as their query, does he or she mean a financial center, a slope on a hill, a shot in pool, or a collection of objects?
The problem of polysemy can cause many documents that are irrelevant to the user `s actual intended query meaning to be retrieved. Many Boolean search engines also require that the user be familiar with Boolean operators and the engine `s specialized syntax. For example, to find information about the phrase iron curtain, many engines require quotation marks around the phrase, while tell the search engine that the entire phrase should be searched as if it were just one keyword. A user who forgets this syntax requirement would be surprised to find retrieved documents about interior decorating and mining for iron ore.
Nevertheless, cariants of the Boolean model do form the basis for many serach engines. These are several reasons for their prevalence. First, creating and programming a Boolean engine is straightforward. Second, queries can be processed quickly; a quick scan thtough the keyword files for the documetns can be executed in parallel. Third, Boolean models scale well to very larrge document collections. Accommodating a growing collection is easy. The programming remains simple; merely the storage and parallel processing capabilities need to grow. References all contain chapters with excellent introductions to the Boolean model and its extensions.
Vector Space Model Search Engines
Another information retrieval technique uses the vector space model , developed by Gerard Salton in the early 1960s, to sidestep some of the information retrieval problems mentioned above. Vector space models transform textual data into numeric vectors and matrices, then employ matrix analysis techniques to discover key features and connections in the document collection. Some advanced vector space models, such as LSI(Latent Semantic Indexing), can access the hidden semantic structure in a document collection. For example, an LSI engine processing the query car will return document whose keywords are related semantically meanings makes vector space models, such as LSI, very powerful information retrieval tools.
Two additional advantages of the vector space model are relevance relevance scoring and relevance feedback. The vector space model allows documents to partially match a query by assigning each document a number between 0 and 1, which can be interpreted as the like lihood of relevance to the query. The group of retrieved documents can then be sorted by degree of relevancy, a luxury not possible with the simple Boolean model. Thus, vector space models return documents in an ordered list, sorted according to a relevance score. The first document returned is judged to be most relevant to the user`s query. Some vector space search engines report the relevance score as a relevancy percentage. For example, a 97% next to a document means that the document is judged as 97% relevant to the user`s query. (See the Federal Communication`s Search engine, htpp://www.fcc.gov/searchtools.html, which is powered by Inktomi) once know to use the vector space model. Enter a query such as taxes and notice the relevancy score reported on the right side.
Relevance feedback, the other advantage of the vector space model, is an information retrieval tuning technique that is a natural addition to the vector space model. Relevance feedback allows the user to select a subset of the retrieved document that are useful. The query is then resubmitted with this additional relevance feedback information, and a revised set of generally more useful documents is retrieved.
A drawback of the vector space model is its computational expense. At query time, distance measure (also known as similarity measures) must be computed between each document and the query. And advanced models, such as LSI, Require an expensice singular value decomposotion of a large matric tha tnumetically represent the entire documtns collection. As the collection grows, the expense of this matrix decomposition become prohibitive. This computational expense also exposes another drawback- vector space models do not scale well. Their succes is limited to small document collection.
Probabilistic Model Search Engines
Probabilistic models attempt to estimate the probability that the user will find a particular document the document is relevant to the query divided by the probability that the document os not relevant to the query). The probabilistic model operates recursively and requres that the underlying algorithm guess at initial parameters then iteratively tries to improve this initial guess to obtain a final raking of relevancy probabilities.
Unfortunately, probabilistic models can be very hard to build and program. Their complexity grows quickly, deterring many researches and limiting their scalability. Probabilistic models also require seceral unrealistic simplifying assumptions, such as independence between terms as well as documents. Of course, the independence assumption is restrictive in most cases For instance, in this document the most likely word to follow information is the word retrieval, but the independence assumption judge each word as equally likely to follow the word information. On the other hand, the probabilistic framework can naturally accommodate a priory preferences, and thus, these models do offer promise of tailoring search results to the preferences of individual user. For example, a user`s query history can be incorporated into the probabilistic model`s initial guess, which generals better query results than a democratic guess.
COMPARING SEARCH ENGINES
Annual information retrieval conference, such as TREC, SIGIR,CIR(for traditional information retrieval), and WWW (for web information retrieval), are used to compare the various information retrieval models underlying search engines and help the field progress toward better, more efficient search engines. The two most common ratings used to differentiate the various search techniques are precision and recall. Precision is the ratio of the number of relevant of document retrieved to the total number of document retrieved. Recall is the ratio of the number of relevant documents retrieved to the total number of relevant documents in the collection. The higher the precision and recall, the better the search engine is. Of course, search engines are tested on document collection with known parameters. For example, The commonly used test collection Medlars, containing 5,831 keywords and 1,033 documents, has been examined so often the its properties are well known. For instance , there are exactly 24 documents relevant to the phrase neoplasm immunology.
Thus, the denominator of the recall ratio for a user query on neoplasm, Immunology. Thus, the denominator of the recall ratio for a user query on neoplasm immunology is 24. If only 10 documents were retrieved by a search engine for this query, then a recall of 10/24=.416 is reported. Recall and precision are information retrieval-specific performance measures, but, of course, when evaluating any computer system, time and space are always performance issues. All else held constant, quick, memory-efficient search engines are preferred to slower, memory-inefficient engines. A search engine with fabulous recall and precision is useless if it requires 30 minutes to perform one query or store the data on 75 supercomputers. Some other performance measures take a user-centered viewpoint and are aimed at assessing user satisfaction and frustration with the information system. A book by Robert Korfhage, Information storage and Retrieval, discusses these and several other measures for comparing search engines.