Modeling Web Searching Behaviors and Designing New Effective Interactions for Digital Libraries

PIs: Dr. Peiling Wang    Dr. Dietmar Wolfram    Dr. Jin Zhang

Welcome to the Website for IMLS National Leadership Grant LG-06-05-0100-05

This project adopted a data mining approach in order to analyze three Web query corpora to model Web searching behaviors. Using server-side transaction logs, the three query corpora are collected from an academic Website, a general search engine, and a health information site. The transaction logs are processed, transformed, and loaded into a SQL 2005 data warehouse. Follow the link to view the relational data model for the query warehouse.    

Our quantitative analysis focuses on characteristics of queries, query words, word co-occurrences, and interactions during a search session, to provide a general understanding of how Web users search. With quantitative analysis, we can group interaction behaviors into different clusters based on search session characteristics. Qualitative analysis focuses on linguistic and cognitive behaviors underlying the queries that represent searchers' information needs. For the latter, we are interested in how Web searchers enter and reiterate queries during a search session and how the queries in the session cluster.    

One critical issue we encountered is the identification of search session boundaries. A search session is a set of consecutive queries from a single user for a specific information need. With server-side transaction logs, session boundaries are unclear, but establishing these boundaries is essential for the analysis of interactions. Our approach to this problem was to define a session as a set of queries submitted from the same IP address within reasonable query intervals, a query interval being the time lag between two consecutive queries. But what is the best threshold (chosen cutoff value) for query intervals? We have explored various cutoff values for defining session boundaries. We found that the distributions of query intervals vary across query corpora. Based on this observation, we determined that the query interval value representing the 80th percentile of the distribution of query intervals is a reasonable session boundary threshold. For detailed discussion on session boundary, please see papers, reports & presentations. In this project Website, we present you our preliminary results and interactive analytical tools. Please send your comments and suggestions to project director peilingw@utk.edu