|
Our approach to Web search transactions analysis is based on the belief that "real world" queries collected without experimental interference can provide a realistic picture of Web searching in a natural setting. Although the Web corpora are messy and do not meet the "prescribed standards" of information retrieval systems, these transactions not only documented how the Web searchers communicate with the system about their information needs, but also can reveal how they conceptualized the system. |
| Exhibit A – Query Characteristics and Information Needs |
| Queries represent information needs. In general, over time there is an increase in the number of queries submitted daily, but the length of queries shows little variation. Here is the comparison between 2003 and 2004. Popular queries are identified by frequency (number of occurrences). Certain queries are seasonal, corresponding to, for example, academic cycles or events. |
| Exhibit B – Corpus-Linguistic Characteristics and Information Needs |
| The corpus-linguistic analysis is based on unique (identical) queries from the perspective of representation of information needs. A query, regardless of how many times it was submitted by different searchers, or the same searcher, is treated technically as a representation of one information need. We identified the user vocabulary and the co-occurrences of words. The degree of conceptual association of co-occurred words is measured by MI (mutual information), based on which concept maps can be drawn. |
| Using a lexicon tool such as WordNet with expansion to cluster similar queries at different levels: word, conceptual, and semantic. |
| Exhibit C – Interaction Behaviors within a Session |
| For explanation of session identification and technique, see Overview. We also developed an interactive tool to explore different cutoff values for determining session boundaries. |
| The analysis at the session level reveals how a searcher initiated a search with the original query and subsequently reiterated the query. Quantitative analysis clusters interaction behaviors based on variables such as Session Size (the number of submitted queries), Query Length (the average number of terms per query), Term Popularity (the average frequency of terms based on corpus frequency), Term Use (the average frequency of usage of each query term within the session), Query Interval (the average time between consecutive queries), Pages Viewed (the average number of pages requested per query within a session). |
| The long search sessions are likely problematic in that most of the reiterations/moves are either unsuccessful or unsystematic. Quantitative analysis provides a basis for identifying problematic searches, and subsequent qualitative analysis can reveal the underlying cognitive factors (knowledge structure and mental models of Web search systems). |
