In our modern in­form­a­tion society, data, facts, and knowledge have a much higher priority than they were half a century ago. Thanks to the internet, in­form­a­tion is more and more ac­cess­ible. When we access in­form­a­tion, it has been retrieved from online sources, which is where search engines come in. How do they find the in­form­a­tion they provide us? The answer is called “In­form­a­tion Retrieval”. Gathering in­form­a­tion – more precisely, in­form­a­tion recovery – is a dis­cip­line in computer science and in­form­a­tion science and, above all, of great im­port­ance for search engines. Using complex in­form­a­tion retrieval systems, they recognise the in­ten­tions that are behind specific search terms and find relevant data on search queries.

The History of Gathering In­form­a­tion

In­form­a­tion retrieval is about making existing knowledge ac­cess­ible. This has been the case since long before the dawn of the digital age. Vannevar Bush, one of the first people who seriously con­sidered how humanity can make their con­cen­trated knowledge more ac­cess­ible in the face of an ever-changing world, published a ground­break­ing article at the dawn if the in­form­a­tion age (1945) titled: “As We May Think”. The article presented a vision for the future of in­form­a­tion gathering and or­gan­isa­tion.

Bush saw the following problem in the sciences: experts are spe­cial­ising more and more, and need more in­form­a­tion – but because of the dif­fer­en­ti­ation caused by this extreme spe­cial­isa­tion, the in­form­a­tion is in­creas­ingly hard to find. Of course, this was at a time when libraries were still organised with analog paper boxes and large catalogs. A keyword search was only possible if a diligent librarian had already bothered to manually index all works. Bush saw a way to make his own in­form­a­tion more ac­cess­ible using the technical de­vel­op­ments available at the time, such as microfilm. His vision was to create a Memex, which was a machine as big as a desk that would serve as a vault for knowledge, and become a serious piece of research equipment. The Memex was never built but the tech­no­logy (users jumping from one article to the next) can be seen as a fore­run­ner of hypertext.

In the 1950’s, the computer scientist Hans Peter Luhn dealt spe­cific­ally with the task of obtaining in­form­a­tion and developed tech­niques that are still relevant today: full-text pro­cessing, auto-indexing and selective in­form­a­tion pro­cessing (SDI) have their roots in his research. These methods were very important for the de­vel­op­ment of the Internet as in­form­a­tion retrieval systems are essential for nav­ig­at­ing the oceans of available in­form­a­tion on the internet. Without them, you would never be able to find the answers you are looking for.

In­form­a­tion Retrieval – a Defin­i­tion

The aim of in­form­a­tion retrieval (IR) is to make machine-stored data dis­cov­er­able: unlike data mining which extracts struc­tures from online records, IR is concerned with filtering specific in­form­a­tion from a set of data. The typical ap­plic­a­tion is an Internet search engine. In­form­a­tion retrieval systems solve two central issues:

  1. Vagueness: User inquiries are often in­ac­cur­ate. The search terms entered by a user often leave a lot of room for in­ter­pret­a­tion. For example, those who search for the term “bank” may be looking for general banking in­form­a­tion or may require dir­ec­tions to the nearest financial in­sti­tu­tion. The problem is com­poun­ded when users them­selves are not sure what in­form­a­tion they are looking for.

  2. Un­cer­tainty: Contents of stored in­form­a­tion are sometimes unknown to the system, which leads to users being presented with the wrong results. This happens, for example, with homonyms –words that have multiple meanings. The user might not be looking for a financial in­sti­tu­tion but in­form­a­tion on a geo­graph­ic­al feature relating to rivers.

In addition, the in­form­a­tion retrieval system should also evaluate in­form­a­tion to provide users with a data sequence. The first result should ideally provide the best answer to the users’ question.

Different Models

There are various in­form­a­tion retrieval models that are not ne­ces­sar­ily mutually exclusive and can be combined with each other. Some of these models only differ slightly in details. However, they can all still be roughly cat­egor­ised into three groups:

  • Set Theory Models: Sim­il­ar­ity re­la­tion­ships are de­term­ined by set op­er­a­tions (Boolean model).
  • Algebraic Models: Sim­il­ar­it­ies are de­term­ined in pairs: documents and search queries can be rep­res­en­ted as vectors, matrices or tuples (vector space model).
  • Prob­ab­il­ist­ic Models: These models establish sim­il­ar­ity by viewing the data sets as multi-stage random ex­per­i­ments.

Below, we will introduce the three ar­chetyp­al models using these cat­egor­ies. The existing models above are all hybrids of the three types. Therefore, the extended Boolean model has the prop­er­ties of set theory, as well as algebraic models.

Boolean Model

The most popular search engines on the web are based on the Boolean principle. These are logical links that help users refine and pinpoint a search. With AND, OR or NOT (AND, OR, NOT) or the cor­res­pond­ing symbols ∧, ∨ or ¬ a request can be specified when, for example, both terms should appear in the result, or content with a certain term should be hidden. These keys also work on Google, according to the same principle. The dis­ad­vant­age of this system is that it doesn’t contain any ranking system for its results.

Vector Space Model

In a math­em­at­ic­al approach, content can also be rep­res­en­ted as vectors. In the vector space model, terms are mapped as co­ordin­ate axes. Both documents and search queries receive specific values related to the term and can be rep­res­en­ted as points or vectors within a vector space. Sub­sequently, both vectors are compared to each other. The vector (or content) closest to that of the query should appear first in the result rankings. The dis­ad­vant­age here is that without Boolean operators, no terms can be excluded.

Prob­ab­il­ist­ic Model

The prob­ab­il­ist­ic model makes use of prob­ab­il­ity theory. Each document is assigned a prob­ab­il­ity value. The results are then sorted according to the prob­ab­il­ity with which they fit each search. How high the chances are that a certain content cor­res­ponds to a user’s wishes is de­term­ined by so-called “relevance feedback”. For example, users may be asked to rate the results manually. At the next identical search, the model will show a different (perhaps better) result list. A dis­ad­vant­age of this procedure is that it starts with two re­quire­ments, neither of which are guar­an­teed. On one hand, the model pre­sup­poses that the users are willing to par­ti­cip­ate in the system by giving feedback. On the other hand, the theory also assumes that users view the results in­de­pend­ently of one another, eval­u­at­ing the content of each source as though it were the first they had read in the search. In practice, users always value in­form­a­tion based on pre­vi­ously viewed content or held knowledge.

Operating Prin­ciples or In­form­a­tion Gathering

With in­form­a­tion retrieval, different methods and tech­niques are used, in­de­pend­ently of the models. The goal is always to simplify in­form­a­tion searches for the user and to deliver more relevant results.

Term Frequency, Inverse Document Frequency

The im­port­ance of a term for a search query is cal­cu­lated by combining how often a term occurs and the inverse document frequency. The value is ab­bre­vi­ated as tf-idf.

  • Term Frequency: Search word density indicates how often a term appears in a document. However, how often a term appears cannot be a sole indicator of how relevant a text is, as some texts may contain the word multiple times due to length, rather than relevancy of content. Therefore, the frequency should be cal­cu­lated in relation to how big the document is. To do this, how often the search term appears is divided by how often the highest-frequency word appears (eg. “and”):
  • Inverse Document Frequency: In IDF, the complete body of text is con­sidered, rather than just a single document. Words that are only found in few documents will have a higher relevance than terms which appear in almost all texts. For example, the term “inverse document frequency” has a high higher value than “and”.

By combining the two tests, in­form­a­tion retrieval systems can provide better results than if they were used alone: if it were just the term frequency that was important, then the search query “The TV show with the mouse” would pri­or­it­ise content in which the words “the” and “with” appear. That would obviously be unhelpful. In contrast, if the inverse document frequency is used, “TV show” and “mouse” are much more important for the search and are re­cog­nised as the actual search terms.  

Query Modi­fic­a­tion

A major problem in in­form­a­tion gathering is the behaviour of users them­selves: wildly in­ac­cur­ate requests bring up incorrect or in­ad­equate in­form­a­tion. To avoid this, in­form­a­tion sci­ent­ists have in­tro­duced query modi­fic­a­tion, a system that auto­mat­ic­ally changes the entered search query. That means, for example, that synonyms are used which provide better results. The system makes use of thesauri and user-feedback to find those synonyms. To avoid depending on user-co­oper­a­tion, they can use so-called “pseudo feedback”. With this method, the system reads related terms from the best search results and rates them as relevant to the search. Inquiries can be extended or improved by using the following tech­niques:

  • Stop Word Elim­in­a­tion: Stop words are those ex­pres­sions that only con­trib­ute in­sig­ni­fic­antly to the content of a text. It makes sense not to treat words like “and” or articles such as “the” as rep­res­ent­at­ive of the document’s content.  
  • Multi-word group iden­ti­fic­a­tion: Groupings of words must be re­cog­nised as such. This iden­ti­fic­a­tion ensures that the search engine also considers parts of compound words to be relevant. 
  • Root and Root Shape Reduction: To search more ef­fect­ively, words must be reduced to their root words. Otherwise, in­flec­tion­al forms of a word would not appear correctly in the search results. 
  • Thesaurus: In addition to the terms used in the relevant document, an in­form­a­tion retrieval system should also consider synonyms of the word as relevant. This is the only way to ensure that users find what they are looking for. 

Recall & Precision

The ef­fect­ive­ness of an in­form­a­tion retrieval system is commonly cal­cu­lated using the factors recall rate and precision. Both are rep­res­en­ted as quotients.

  • Recall: How complete are the search results? To do this, the number of “found, relevant” is compared with the number of “not found, relevant documents”. The quotient, in other words, indicates how probable it is that a relevant document will be found: 
  • Precision: What exactly is the search result? To ascertain this, the number of found, relevant to the number of found, ir­rel­ev­ant documents is provided. The quotient indicates how probable it is that a found document is relevant:

Both values are basically between 0 and 1, where 1 would be a perfect value. In addition, perfect results are excluded for both quotients in practice. Those who increase the com­plete­ness of the search result do so at the expense of accuracy and vice versa. Ad­di­tion­ally, the fallout (i.e., the default rate) can be cal­cu­lated as an ad­di­tion­al value: this quotient reflects the false positive rate; it is de­term­ined by the ratio of found, ir­rel­ev­ant documents to ir­rel­ev­ant contents that were not found. Recall and precision can be rep­res­en­ted as an axis diagram in which each of the two values occupies one axis each.

In­form­a­tion Retrieval: Example of a Search

Every internet search engine is based on in­form­a­tion retrieval. Google, Bing, and Yahoo are examples of prominent computer-aided in­form­a­tion gathering. However, to show how IR works in practice, it makes sense to take a simpler example of our own. Take a search matrix in a (very small) children’s library. There are animals in all the books, but we only want to find books which feature elephants and giraffes, and no cro­codiles. A search with the Boolean method would look like this: elephant AND giraffe NOT crocodile. The result of the search can only ever be 1 or 0: Does the term occur, or not?

The result of the search would be “Tim & Olli at the Zoo” and “Michael and the Crazy Circus”. However, it does not weigh the results. Which book is more about elephants than giraffes? To ascertain this, the system can determine the Term Frequency and the Inverse Document Frequency. 

“Tim and Olli at the Zoo” is probably the right answer when looking for a text with giraffes and elephants over “Michael and the Crazy Circus”, and should come first in the search results. The method we used here only works if the search terms are fixed (con­trolled indexing). This could happen, for example, in spe­cial­ised databases where the users are trained in how to use a search mask. In our example, a query modi­fic­a­tion would make sense: aside from “elephant”, a search for “pa­chy­derms” as well as gram­mat­ic­al variants of these words would yield positive results.

Tip

There are a lot more search engines on the World Wide Web than just Google. For example, al­tern­at­ives to Google often pay more attention to user privacy

Go to Main Menu