Jayanthkumar Kannana
kjk@cs.berkeley.edu |
Beverly Yangb
byang@stanford.edu |
Scott Shenkerc
shenker@icsi.berkeley.edu |
Puneet Sharmad
puneet@hpl.hp.com |
Sujata Banerjeed
sujata@hpl.hp.com |
Sujoy Basud
basus@hpl.hp.com |
Sung-Ju Leed
sjlee@hpl.hp.com |
Abstract
As the academic world moves away from physical journals and proceedings towards online document repositories, the ability to efficiently locate work of interest among the torrent of newly-generated papers will become increasingly important. To aid in this endeavor, we designed SmartSeer, a system that allows users to register personalized continuous queries over the CiteSeer database of technical documents. Users are then alerted whenever papers that match their queries are put online. SmartSeer has two main design requirements. First, to allow effective information retrieval, it should support rich continuous queries (as opposed to simple keyword searches). Second, to make effective use of donated infrastructure, it should be capable of running on a loosely maintained group of unreliable machines spread across multiple organizations (as opposed to assuming a reliable and tightly coupled distributed system). Existing work on distributed continuous query systems fails at least one of these requirements. Our design for SmartSeer is based on Distributed Hash Tables (DHTs), and thereby leverages previous work on DHT-based query systems. A prototype of SmartSeer has been implemented and evaluated on Planetlab. Though we evaluate our design only for the SmartSeer application, we believe it also provides useful insights into other distributed and rich continuous query systems (web alerts, news alerts etc).