wikiwand.com

SMART Information Retrieval System - Wikiwand

The SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System is an information retrieval system developed at Cornell University in the 1960s.[1] Many important concepts in information retrieval were developed as part of research on the SMART system, including the vector space model, relevance feedback, and Rocchio classification.

Gerard Salton led the group that developed SMART. Other contributors included Mike Lesk.

The SMART system also provides a set of corpora, queries and reference rankings, taken from different subjects, notably

To the legacy of the SMART system belongs the so-called SMART triple notation, a mnemonic scheme for denoting tf-idf weighting variants in the vector space model. The mnemonic for representing a combination of weights takes the form ddd.qqq, where the first three letters represents the term weighting of the collection document vector and the second three letters represents the term weighting for the query document vector. For example, ltc.lnn represents the ltc weighting applied to a collection document and the lnn weighting applied to a query document.

The following tables establish the SMART notation:[2]

Symbols and notation
{\textstyle D_{i}=\{w_{i_{1}},w_{i_{2}},\ldots ,w_{i_{t}}\}} represents a document vector, where {\textstyle w_{i_{k}}} is the weight of the term {\textstyle T_{k}} in {\textstyle D_{i}} and {\displaystyle t} is the number of unique terms in {\textstyle D_{i}}. Positive features characterize terms that are present in a document, and the weight of zero is used for terms that are absent from a document.
{\textstyle f_{i_{k}}} Occurrence frequency of term {\textstyle T_{k}} in document {\textstyle D_{i}} {\textstyle u_{i}} Number of unique terms in document {\textstyle D_{i}}
{\displaystyle N} Number of collection documents {\displaystyle \operatorname {avg} (u)} Average number of unique terms in a document
{\textstyle n_{k}} Number of documents with term {\textstyle T_{k}} present {\displaystyle b_{t}} Number of characters in document {\displaystyle D_{i}}
{\displaystyle \max(f_{i_{k}})} Occurrence frequency of the most common term in document {\displaystyle D_{i}} {\textstyle \operatorname {avg} (b)} Average number of characters in a document
{\displaystyle \operatorname {avg} (f_{i_{k}})} Average occurrence frequency of a term in document {\displaystyle D_{i}} {\textstyle G} Global collection statistics
{\displaystyle s} The slope in the context of pivoted document length normalization[3]

More information Term frequency ...

The gray letters in the first, fifth, and ninth columns are the scheme used by Salton and Buckley in their 1988 paper.[4] The bold letters in the second, sixth, and tenth columns are the scheme used in experiments reported thereafter.