A brief overview of classic Information Retrieval
An Information Retrieval (IR) system’s general purpose is to help users find relevant information. According to Ferneda (2003), the IR systems must present documents’ views to users allowing a quick-select of items that fully or partially satisfy their information need, specified through a query. In other words, every IR system’s primary goal is to achieve high effectiveness, that is, to maximize the effort-satisfaction proportion of its users about queries.
Expressive advances in electronic and computer technology have led to the advent of what has been termed as the information age. These days, according to Mitra, information is readily available through electronic and networked media in colossal quantities and on an enormous variety of subjects. This information explosion has resulted in a great demand for efficient and effective means for organizing and indexing the data, so that we may retrieve useful information when it is required through queries.
Once there is a need to perform digital, effective, and efficient searches for any type of documents in different collections, an Information Retrieval (IR) system is responsible to process and index this collection based on different strategies and assign significant weights to each document. In this respect, an IR system aims to provide relevant information to its users, considering different queries and concerning a given collection.
However, ensuring effectiveness for the results produced by an IR system involves an appropriate modeling of the collection documents, in order to properly produce a similarity function that assigns effective similarity scores to collection documents in relation to the performed query.
In this sense, to define similarity functions, initially, the following IR models, called classics, were proposed: boolean model, probabilistic model, and vector space model.
A huge amount of IR models were proposed in the literature, as shown in Figure 2, which shows just a few methods related to textual databases:
The classic Boolean Model (BM) is based on set theory and Boolean algebra. Thus, documents are represented as a set of indexing terms, while queries are represented as a boolean expression on terms, using AND, OR and NOT operators. Moreover, BM uses the idea of exact matches between the user’s query and the collection documents; there is no partial satisfaction in this model, and the generated response by BM is always binary: the document is (1) or is not (0) relevant. Therefore, the similarity calculation between a document dj ∈ D and a query q ∈ Q can be formulated, in a general way, as shown in equation:
where c(q) corresponds to any conjunctive component from query q and c(dj) corresponds to the conjunctive component from document dj .
The Vector Space Model (VSM) recognizes BM limitations in order to propose an algebraic solution that can perform partial matches. In VSM, the indexing terms are mutually independent and are represented as vectors in a t-dimensional space, where t is the number of indexing terms. Thereby, the logical vision of documents and queries in VSM is t-dimensional vectors, built through a weighting scheme called TF-IDF, which aims to assign significant weights to indexing terms. Therefore, the similarity degree between a document dj and a query q is calculated as vectors correlation, that is, the cosine between its angles, as shown in equation:
where wij is the weight of the term ki in relation to the document dj and wiq is the weight of the term ki in relation to the query q.
Finally, the classic Probabilistic Model (PM) it was the last classic model, proposed by Robertson and Spark Jones. The PM main idea consists of, given a user query q and document dj, estimate the probability of the user considering dj relevant; this is the probability of dj ⊂ R, where R is the set of relevant documents to a given query. Thus, the similarity function in PM is calculated as shown in equation:
where dj is a vector representation built through binary weights, that indicates the absence or presence of indexing terms. Perceives that this hypothesis is blurred due to a significant lack of information and properties about the ideal set R. Because of that, Croft and Harper propose a simple method that generates a classification function without any previous relevance information (see References).
Finally, it’s important to mention that IR modeling is based on context and is directly influenced by factors such as document typing, the homogeneous (e.g., news articles) or heterogeneous (e.g., Web pages) property of the collection, the size of the documents, and, even so, practical aspects of IR system’s design. So which of these models are the best? Depends!
For a while, that’s all folks! I hope that this overview may help some of you guys in your first steps in the IR area.
Also, if you get interested on this theme, don’t forget to study more deeply. The references bellow can be a good start point. Enjoy! :)
REFERENCES
Ricardo, Baeza-Yates, and Ribeiro-Neto Berthier. “Modern information retrieval: the concepts and technology behind search.” New Jersey, USA: Addi-son-Wesley Professional (2011).
Berry, M. W., Drmac, Z., & Jessup, E. R. (1999). Matrices, vector spaces, and information retrieval. SIAM review, 41(2), 335–362.
Mitra, Mandar, and B. B. Chaudhuri. “Information retrieval from documents: A survey.” Information retrieval 2.2–3 (2000): 141–163.
Croft, W. B., & Harper, D. J. (1979). Using probabilistic models of document retrieval without relevance information. Journal of documentation.
Robertson, S. E., & Jones, K. S. (1976). Relevance weighting of search terms. Journal of the American Society for Information science, 27(3), 129–146.