RightNow Logo

RightNow provides a strategic solution to drive superior customer experiences
...while dramatically reducing costs.

Customer Experience Report

Colloquium Papers

Neal Richter - May 1, 2008 - Web Search Ranking

Mining the Search Trails of Surfing Crowds: Identifying Relevant Websites From User Activity
Mikhail Bilenko and Ryen W. White
Microsoft Research

Abstract:
The paper proposes identifying relevant information sources from the history of combined searching and browsing behavior of many Web users. While it has been previously shown that user interactions with search engines can be employed to improve document ranking, browsing behavior that occurs beyond search result pages has been largely overlooked in prior work. The paper demonstrates that users’ post-search browsing activity strongly reflects implicit endorsement of visited pages, which allows estimating topical relevance of Web resources by mining large-scale datasets of search trails. We present heuristic and probabilistic algorithms that rely on such datasets for suggesting authoritative websites for search queries. Experimental evaluation shows that exploiting complete post-search browsing trails outperforms alternatives in isolation (e.g., clickthrough logs), and yields accuracy improvements when employed as a feature in learning to rank for Web search.

Mike Harrelson - April 24, 2008 - Neural networks

Confabulation theory
Dr. Robert Hecht-Nielsen
University of California, San Diego

Abstract:
Confabulation theory offers a comprehensive detailed explanation of the mechanism of thought (i.e., “cognition”: vision, hearing, reasoning, language, planning, origination of movement and thought processes, etc.) in humans and other vertebrates (and possibly in invertebrates, such as octopi and bees). For expositional simplicity, only the human case is considered here.

Steve Durbin - April 17, 2008 - Prediction Markets

Using Prediction Markets to Track Information Flows: Evidence from Google
Bo Cowgill*, Justin Wolfers**, and Eric Zitzewitz***
*Google; **Wharton School, Univ. of Pennsylvania; ***Dartmouth College

Abstract:
In the last 2.5 years, Google has conducted the largest corporate experiment with prediction markets we are aware of. In this paper, we illustrate how markets can be used to study how an organization processes information. We document a number of biases in Google’s markets, most notably an optimistic bias. Newly hired employees are on the optimistic side of these markets, and optimistic biases are significantly more pronounced on days when Google stock is appreciating. We find strong correlations in trading for those who sit within a few feet of one another; social networks and work relationships also play a secondary explanatory role. The results are interesting in light of recent research on the role of optimism in entrepreneurial firms, as well as recent work on the importance of geographical and social proximity in explaining information flows in firms and markets.

Neal Richter - April 10, 2008 - Artificial Intelligence

AI Chasers
Patrick Tucker
World Future Society

Abstract:
The “holy grail” of computer science may be within reach. A futurist looks toward tomorrow’s Artificial Intelligence Revolution.

The article was based on these interviews:
MIT roboticist Rodney Brooks, Adaptive A.I. Inc. founder Peter Voss, Self-Aware Systems founder Steve Omohundro, Powerset CEO Barney Pell, and Google research director Peter Norvig discuss how they see AI developing in the years ahead, when a human-level AI might emerge, and how worried we should be about that whole killer-robot-goes-on-rampage scenario.


Neal Richter - April 3, 2008 - Pattern Mining

Clustering Co-occurrences of Maximal Frequent Patterns in Streams
Edgar H. de Graaf, Joost N. Kok, Walter A. Kosters
Leiden Institute of Advanced Computer Science, Leiden University

Abstract:
One way of getting a better view of data is by using frequent patterns. In this paper frequent patterns are (sub)sets that occur a minimal number of times in a stream of itemsets. However, the discovery of frequent patterns in streams has always been problematic. Because streams are potentially endless it is in principle impossible to say if a pattern is often occurring or not. Furthermore, the number of patterns can be huge and a good overview of the structure of the stream is lost quickly. The proposed approach will use clustering to facilitate the “online” analysis of the structure of the stream.

A clustering on the co-occurrence of patterns will give the user an improved view on the structure of the stream. Some patterns might occur so often together that they should form a combined pattern. In this way the patterns in the clustering will approximate the largest frequent patterns: maximal frequent patterns. The number of (approximated) maximal frequent patterns is much smaller and combined with clustering methods these patterns provide a good view on the structure of the stream.

Our approach to decide if patterns occur often together is based on a method of clustering where only the distance between pairs of patterns is known. This distance is the Euclidean distance between points in a 2-dimensional space, where the points represent the frequent patterns, or rather the most important ones. The coordinates are adapted when the records from the stream pass by, and reflect the real support of the corresponding pattern. In this setup the support is viewed as the number of occurrences in a time window. The main algorithm tries to maintain a dynamic model of the data stream by merging and splitting these patterns. Experiments show the versatility of the method.


Richard McAllister - March 27, 2008 - Pattern Mining

Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach
Jiawei Han*, Jian Pei**, Yiwei Yin***, Runying Mao****
*Univ. of Illinois, **SUNY, ***Simon Fraser Univ., ****Microsoft

Abstract:
Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still costly, especially when there exist a large number of patterns and/or long patterns.

In this study, we propose a novel frequent-pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develop an efficient FP-treebased mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a condensed, smaller data structure, FP-tree which avoids costly, repeated database scans, (2) our FP-tree-based mining adopts a pattern-fragment growth method to avoid the costly generation of a large number of candidate sets, and (3) a partitioning-based, divide-and-conquer method is used to decompose the mining task into a set of smaller tasks for mining confined patterns in conditional databases, which dramatically reduces the search space. Our performance study shows that the FP-growth method is efficient and scalable for mining both long and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm and also faster than some recently reported new frequent-pattern mining methods.


Steve Durbin - March 6, 2008 - Data commons

Urban Sensing: Out of the Woods
Dana Cuff, Mark Hansen, and Jerry Kang
UCLA

Abstract:
Embedded networked sensing, having successfully shifted from the lab to the environment, is primed for a more contentious move to the city to where citizens will likely be the target of data collection. This transition will warrant careful study and touch on issues that go far beyond the scientific realm.


Neal Richter - February 28, 2008 - Contextual Advertising

Finding Advertising Keywords on Web Pages
Wentau Yih,* Joshua Goodman,* Vitor R. Carvalho**
*Microsoft Research, **Carnegie Mellon University

Abstract:
A large and growing number of web pages display contextual advertising based on keywords automatically extracted from the text of the page, and this is a substantial source of revenue supporting the web today. Despite the importance of this area, little formal, published research exists. We describe a system that learns how to extract keywords from web pages for advertisement targeting. The system uses a number of features, such as term frequency of each potential keyword, inverse document frequency, presence in meta-data, and how often the term occurs in search query logs. The system is trained with a set of example pages that have been hand-labeled with “relevant” keywords. Based on this training, it can then extract new keywords from previously unseen pages. Accuracy is substantially better than several baseline systems.


Richard McAllister - February 21, 2008 - Clustering

Hierarchical Document Clustering Using Frequent Itemsets
Benjamin C.M. Fung, Ke Wangy, and Martin Esterz
Simon Fraser University

Abstract:
A major challenge in document clustering is the extremely high dimensionality. For example, the vocabulary for a document set can easily be thousands of words. On the other hand, each document often contains a small fraction of words in the vocabulary. These features require special handlings. Another requirement is hierarchical clustering where clustered documents can be browsed according to the increasing specificity of topics. In this paper, we propose to use the notion of frequent itemsets, which comes from association rule mining, for document clustering. The intuition of our clustering criterion is that each cluster is identified by some common words, called frequent itemsets, for the documents in the cluster. Frequent itemsets are also used to produce a hierarchical topic tree for clusters. By focusing on frequent items, the dimensionality of the document set is drastically reduced. We show that this method outperforms best existing methods in terms of both clustering accuracy and scalability.


Neal Richter - February 14, 2008 - Search Advertising

Optimal Delivery of Sponsored Search Advertisements Subject to Budget Constraints
Zoe Abrams, Ofer Mendelevitch, John A. Tomlin
Yahoo!

Abstract:
We discuss an auction framework in which sponsored search advertisements are delivered in response to queries. In practice, the presence of bidder budgets can have a significant impact on the ad delivery process. We propose an approach based on linear programming which takes bidder budgets into account, and uses them in conjunction with forecasting of query frequencies, and pricing and ranking schemes, to optimize ad delivery. Simulations show significant improvements in revenue and efficiency.


Steve Durbin - February 7, 2008 - Web economics

Three topics from The Technium, a book in progress
Kevin Kelly

How much does one search cost?

The Value of Search

Better Than Free


Steve Durbin - January 31, 2008 - Search

Trusted Search Communities
Peter Briggs and Barry Smyth
Adaptive Information Cluster, School of Computer Science and Informatics, University College Dublin

Abstract:
We describe a social search technique that harnesses the search experiences of a community of searchers to generate result recommendations, in a collaborative fashion, to complement results returned from some underlying search engine. We describe a dynamic model of trust as a way to coordinate collaboration, and provide experimental results to show that search performance improves as the network evolves.


Steve Durbin - December 6, 2007 - Design

Creativity Support Tools: Accelerating Discovery and Innovation
Ben Shneiderman
Dept. of Computer Science, Univ. of Maryland, College Park

Abstract:
How can designers of programming interfaces, interactive tools, and rich social environments enable more people to be more creative more often?


Anthony Arnone - November 29, 2007 - Search

SOLR and Lucene

Abstract:
This week we’ll be covering the search engine Lucene and one of its main sub-projects, Solr. I’ll be giving an overview of the concepts behind these two packages, and how you can use them to easily add expansive search capability to just about any Java environment you want.

Lucene is a search engine written in Java that provides high scalability, fast searching, robust handling of generic document structures, and a highly pluggable architecture that makes adding features easy. Lucene has a very active development community and is consistently on the ‘cutting edge’ of search engine technologies.

Solr is a servlet-based front end for the Lucene search engine. It provides XML-based configuration and index schema management, APIs in XML/HTTP, master-slave style replication, caching, and a web based admin interface. It also includes a bunch of trendy features for modern search engines such as faceting, more-like-this, and JSON responses, along with a host of other cool features under development.

Anthony's Powerpoint presentation


Steve Durbin - October 25, 2007 - Ontologies

Deriving a Large Scale Taxonomy from Wikipedia
Simone Paolo Ponzetto and Michael Strube
EML Research, Germany

Abstract:
We take the category system in Wikipedia as a conceptual network. We label the semantic relations between categories using methods based on connectivity in the network and lexico-syntactic matching. As a result we are able to derive a large scale taxonomy containing a large amount of subsumption, i.e. isa, relations. We evaluate the quality of the created resource by comparing it with ResearchCyc, one of the largest manually annotated ontologies, as well as computing semantic similarity between words in benchmarking datasets.


Steve Durbin - October 18, 2007 - Text similarity

Improving Similarity Measures for Short Segments of Text
Wen-tau Yih and Christopher Meek
Microsoft Research

Abstract:
In this paper we improve previous work on measuring the similarity of short segments of text in two ways. First, we introduce a Web-relevance similarity measure and demonstrate its effectiveness. This measure extends the Web-kernel similarity function introduced by Sahami and Heilman (2006) by using relevance weighted inner-product of term occurrences rather than TF×IDF. Second, we show that one can further improve the accuracy of similarity measures by using a machine learning approach. Our methods outperform other state-of-the-art methods in a general query suggestion task for multiple evaluation metrics.


Neal Richter and Steve Durbin - October 4, 2007 - Information Retrieval

Search Engines that Learn from Implicit Feedback
Thorsten Joachims and Filip Radlinski
Cornell University

Abstract:
Search-engine logs provide a wealth of information that machine-learning techniques can harness to improve search quality. With proper interpretations that avoid inherent biases, a search engine can use training data extracted from the logs to automatically tailor ranking functions to a particular user group or collection.


Neal Richter - September 27, 2007 - Information Retrieval

Learning to Rank for Information Retrieval Using Genetic Programming
Jen-Yuan Yeh*, Jung-Yi Lin*, Hao-Ren Ke**, Wei-Pang Yang***
*Dept. of Computer Science, National Chiao Tung University
**Inst. of Information Management, National Chiao Tung University
***Dept. of Information Management, National Dong Hwa University

Abstract:
One central problem of information retrieval (IR) is to determine which documents are relevant and which are not to the user information need. This problem is practically handled by a ranking function which defines an ordering among documents according to their degree of relevance to the user query. This paper discusses work on using machine learning to automatically generate an effective ranking function for IR. This task is referred to as "learning to rank for IR" in the field. In this paper, a learning method, RankGP, is presented to address this task. RankGP employs genetic programming to learn a ranking function by combining various types of evidences in IR, including content features, structure features, and query-independent features. The proposed method is evaluated using the LETOR benchmark datasets and found to be competitive with Ranking SVM and RankBoost.


Steve Durbin - September 20, 2007 - Knowledge base organization

Collaborative Structuring: Organizing Document Repositories Effectively and Efficiently
Harris Wu* and Michael Gordon**
*Old Dominion University; **University of Michigan

Abstract:
Improving how documents are managed and data is stored and exchanged in systems.


Steve Durbin - September 13, 2007 - Chatbots

A Terrorism Question/Answer System
R. P. Schumaker,* Y. Liu,** M. Ginsburg,*** and H. Chen***
*IS Dept, Iona College; **IS Dept., Cal State Long Beach; ***Dept. of Managmement Information Systems, U. of Arizona

Abstract:
The TARA Project examined how a trio of modified chatterbots could be used to disseminate terrorism-related information to the general public.


Ben Werner - August 23, 2007 - HCI

Tangible Bits: Towards Seamless Interfaces between People, Bits and Atoms
Hiroshi Ishii and Brygg Ullmer, MIT Media Lab

Abstract:
This paper presents our vision of Human Computer Interaction (HCI): "Tangible Bits." Tangible Bits allows users to "grasp & manipulate" bits in the center of users’ attention by coupling the bits with everyday physical objects and architectural surfaces. Tangible Bits also enables users to be aware of background bits at the periphery of human perception using ambient display media such as light, sound, airflow, and water movement in an augmented space. The goal of Tangible Bits is to bridge the gaps between both cyberspace and the physical environment, as well as the foreground and background of human activities.
This paper describes three key concepts of Tangible Bits: interactive surfaces; the coupling of bits with graspable physical objects; and ambient media for background awareness. We illustrate these concepts with three prototype systems – the metaDESK, transBOARD and ambientROOM – to identify underlying research issues.


Steve Durbin - August 2, 2007 - Search

How Well does Result Relevance Predict Session Satisfaction?
Scott B. Huffman, Michael Hochster, Google, Inc.

Abstract:
Per-query relevance measures provide standardized, repeatable measurements of search result quality, but they ignore much of what users actually experience in a full search session. This paper examines how well we can approximate a user’s ultimate session-level satisfaction using a simple relevance metric. We find that this relationship is surprisingly strong. By incorporating additional properties of the query itself, we construct a model which predicts user satisfaction even more accurately than relevance alone.


Ben Werner - July 19, 2007 - Usability

Examining the Robustness of Sensor-Based Statistical Models of Human Interruptibility
James Fogarty, Scott E. Hudson, HCI Institute, Carnegie-Mellon
Jennifer Lai, IBM T.J. Watson Research Center

Abstract:
Current systems often create socially awkward interruptions or unduly demand attention because they have no way of knowing if a person is busy and should not be interrupted. Previous work has examined the feasibility of using sensors and statistical models to estimate human interruptibility in an office environment, but left open some questions about the robustness of such an approach. This paper examines several dimensions of robustness in sensor-based statistical models of human interruptibility. We show that real sensors can be constructed with sufficient accuracy to drive the predictive models. We also create statistical models for a much broader group of people than was studied in prior work. Finally, we examine the effects of training data quantity on the accuracy of these models and consider tradeoffs associated with different combinations of sensors. As a whole, our analyses demonstrate that sensor-based statistical models of human interruptibility can provide robust estimates for a variety of office workers in a range of circumstances, and can do so with accuracy as good as or better than people. Integrating these models into systems could support a variety of advances in human computer interaction and computer-mediated communication.


Neal Richter - June 21, 2007 - Search Ranking

Google Keeps Tweaking Its Search Engine
Saul Hansell
New York Times

Abstract:
Article on the Search Quality group at Google

See also the keynote presentation at SIGIR 2005: Challenges in Running a Commercial Search Engine by Amit Singhal of Google


Neal Richter - June 7, 2007 - Search Engines

Why writing your own search engine is hard
Anna Patterson
Stanford University

Abstract:
There must be 4,000 programmers typing away in their basements trying to build the next “world’s most scalable” search engine. It has been done only a few times. It has never been done by a big group; always one to four people did the core work, and the big team came on to build the elaborations and the production infrastructure. Why is it so hard? We are going to delve a bit into the various issues to consider when writing a search engine. This article is aimed at those individuals or small groups that are considering this endeavor for their Web site or intranet. It is fun, but a word of caution: not only is it difficult, but you need two commodities in short supply—time and patience.


Zuzana Gedeon - May 17, 2007 - Search

Multi-factor Clustering for a Marketplace Search Interface
Neel Sundaresan, Kavita Ganesan and Roopnath Grandhi
eBay Research Labs

Abstract:
Search engines provide a small window to the vast repository of data they index and against which they search. They try their best to return the documents that are of relevance to the user but often a large number of results may be returned. Users struggle to manage this vast result set looking for the items of interest. Clustering search results is one way of alleviating this navigational pain. In this paper we describe a clustering system that enables clustering search results in an online marketplace search system.

Adaptive Faceted Browser for Navigation in Open Information Spaces
Michal Tvarožek and Mária Bieliková
Faculty of Informatics and Information Technologies, Slovak University of Technology

Abstract:
Open information spaces have several unique characteristics such as their changeability, large size, complexity and diverse user base. These result in novel challenges during user navigation, information retrieval and data visualization in open information spaces. We propose a method of navigation in open information spaces based on an enhanced faceted browser with support for dynamic facet generation and adaptation based on user characteristics.


Zuzana Gedeon, Hasari Tosun, Steve Durbin - April 12, 2007 - WWW 2007

Navigating the Intranet with High Precision
Huaiyu Zhu, Alexander Loser, Sriram Raghavan, and Shivakumar Vaithyanathan
IBM Almaden Research Center and SAP Research CEC

Abstract:
Despite the success of web search engines, search over large enterprise intranets still suffers from poor result quality. Earlier work that compared intranets and the Internet from the view point of keyword search has pointed to several reasons why the search problem is quite different in these two domains. In this paper, we address the problem of providing high quality answers to navigational queries in the intranet (e.g., queries intended to find product or personal home pages, service pages, etc.). Our approach is based on offline identification of navigational pages, intelligent generation of termvariants to associate with each page, and the construction of separate indices exclusively devoted to answering navigational queries. Using a testbed of 5.5M pages from the IBM intranet, we present evaluation results that demonstrate that for navigational queries, our approach of using custom indices produces results of significantly higher precision than those produced by a general purpose search algorithm.

The Complex Dynamics of Collaborative Tagging
Harry Halpin, Valentin Robu, and Hana Shepherd
University of Edinburgh; Center for Math and CS (Amsterdam); Princeton University

Abstract:
The debate within the Web community over the optimal means by which to organize information often pits formalized classications against distributed collaborative tagging systems. A number of questions remain unanswered, however, regarding the nature of collaborative tagging systems including whether coherent categorization schemes can emerge from unsupervised tagging by users. This paper uses data from the social bookmarking site del.icio.us to examine the dynamics of collaborative tagging systems. In particular, we examine whether the distribution of the frequency of use of tags for "popular" sites with a long history (many tags and many users) can be described by a power law distribution, often characteristic of what are considered complex systems. We produce a generative model of collaborative tagging in order to understand the basic dynamics behind tagging, including how a power law distribution of tags could arise. We empirically examine the tagging history of sites in order to determine how this distribution arises over time and to determine the patterns prior to a stable distribution. Lastly, by focusing on the high-frequency tags of a site where the distribution of tags is a stabilized power law, we show how tag co-occurrence networks for a sample domain of tags can be used to analyze the meaning of particular tags given their relationship to other tags.

Identifying Ambiguous Queries in Web Search
Ruihua Song, Zhenxiao Luo, Ji-Rong Wen, Yong Yu, and Hsiao-Wuen Hon
Shanghai Jiao Tong University, Microsoft Research Asia, Fudan University

Abstract:
It is widely believed that some queries submitted to search engines are by nature ambiguous (e.g., java, apple). However, few studies have investigated the questions of "how many queries are ambiguous?" and "how can we automatically identify an ambiguous query?" This paper deals with these issues. First, we construct the taxonomy of query ambiguity, and ask human annotators to manually classify queries based upon it. From manually labeled results, we find that query ambiguity is to some extent predictable. We then use a supervised learning approach to automatically classify queries as being ambiguous or not. Experimental results show that we can correctly identify 87% of labeled queries. Finally, we estimate that about 16% of queries in a real search log are ambiguous.


Steve Durbin - April 5, 2007 - Personalization

Open User Profiles for Adaptive News Systems: Help or Harm?
Jae-wook Ahn, Peter Brusilovsky, Jonathan Grady, Daqing He and Sue Yeon Syn
School of Information Sciences, University of Pittsburgh

Abstract:
Over the last five years, a range of projects have focused on progressively more elaborated techniques for adaptive news delivery. However, the adaptation process in these systems has become more complicated and thus less transparent to the users. In this paper, we concentrate on the application of open user models in adding transparency and controllability to adaptive news systems. We present a personalized news system, YourNews, which allows users to view and edit their interest profiles, and report a user study on the system. Our results confirm that users prefer transparency and control in their systems, and generate more trust to such systems. However, similar to previous studies, our study demonstrate that this ability to edit user profiles may also harm the system’s performance and has to be used with caution.


Steve Durbin - March 29, 2007 - Reputation Systems

A Content-Driven Reputation System for the Wikipedia
B. Thomas Adler and Luca de Alfaro
University of California, Santa Cruz

Abstract:
We present a content-driven reputation system for Wikipedia authors. In our system, authors gain reputation when the edits they perform to Wikipedia articles are preserved by subsequent authors, and they lose reputation when their edits are rolled back or undone in short order. Thus, author reputation is computed solely on the basis of content evolution; user-to-user comments or ratings are not used. The author reputation we compute could be used to flag new contributions from low-reputation authors, or it could be used to allow only authors with high reputation to contribute to controversial or critical pages. A reputation system for the Wikipedia could also provide an incentive for high-quality contributions.
We have implemented the proposed system, and we have used it to analyze the entire Italian and French Wikipedias, consisting of a total of 691,551 pages and 5,587,523 revisions. Our results show that our notion of reputation has good predictive value: changes performed by low-reputation authors have a significantly larger than average probability of having poor quality, as judged by human observers, and of being later undone, as measured by our algorithms


Steve Durbin - March 22, 2007 - Collective Intelligence

Web pundits trumpet social software as a means of generating knowledge by tapping into the wisdom of crowds. The following two articles offer contrarian perspectives.

The "Dumbness of Crowds"
Kathy Sierra

DIGITAL MAOISM: The Hazards of the New Online Collectivism
Jaron Lanier


Neal Richter - March 15, 2007 - Semantic Web

AI Meets Web 2.0: Building the Web of Tomorrow, Today
Jay M. Tenenbaum

Abstract:
Imagine an Internet-scale knowledge system where people and intelligent agents can collaborate on solving complex problems in business, engineering, science, medicine, and other endeavors. Its resources include semantically tagged websites, wikis, and blogs, as well as social networks, vertical search engines, and a vast array of web services from business processes to AI planners and domain models. Research prototypes of decentralized knowledge systems have been demonstrated for years, but now, thanks to the web and Moore’s law, they appear ready for prime time. This article introduces the architectural concepts for incrementally growing an Internet-scale knowledge system and illustrates them with scenarios drawn from e-commerce, e-science, and e-life.


Steve Durbin - March 8, 2007 - Tag Mining

Ranking Bookmarks and Bistros: Intelligent Community and Folksonomy Development
Benjamin Szekely and Elias Torres
Harvard University

Abstract:
Online communities are an important forum for people to share information on the internet. Historically, online communities have been centered around free-text sharing and simple user rating systems. With the shadow of the Semantic Web looming behind many new innovations on the internet today, online communities are beginning to adopt semantic techniques such as tagging and folksonomies as a vehicle for the sharing and classification of information. Gourmetvillage.org is a prototype online community that uses tagging to provide an innovative restaurant rating mechanism. Users may provide free-text reviews and tags for any aspect of a restaurant. UserRank is an algorithm based on Google’s PageRank that provides a ranking of users based on whose taggings are most often followed. TagRank provides a ranking of tags based on the ranking of users. UserRank may in fact be used to rank any entity in a community based on user association. del.icio.us contains a significant dataset of users and tagged entities with which to test these algorithms. UserRank and TagRank yield different rankings than their count-based counterparts. When drilling down into the results, the top user according to UserRank only made a few taggings, yet hundreds of users in the system agreed with his taggings, showing that these algorithms capture community consensus fairly well.


Steve Durbin - March 1, 2007 - Tag Mining

Collaborative Creation of Communal Hierarchical Taxonomies in Social Tagging Systems
Paul Heyman and Hector Garcia-Molina
Computer Science Department, Stanford University

Abstract:
Collaborative tagging systems -- systems where many casual users annotate objects with free-form strings (tags) of their choosing -- have recently emerged as a powerful way to label and organize large collections of data. During our recent investigation into these types of systems, we discovered a simple but remarkably effective algorithm for converting a large corpus of tags annotating objects in a tagging system into a navigable hierarchical taxonomy of tags. We first discuss the algorithm and then present a preliminary model to explain why it is so effective in these types of systems.


Steve Durbin - February 15, 2007 - User interface

Clustering versus faceted categories for information exploration
Marti Hearst
School of Information Management and Systems
University of California, Berkeley

Abstract:
Information seekers often express a desire for a user interface that organizes search results into meaningful groups, in order to help make sense of the results, and to help decide what to do next. A longitudinal study in which participants were provided with the ability to group search results found they changed their search habits in response to having the grouping mechanism available [2].

There are many open research questions about how to generate useful groupings and how to design interfaces to support exploration using grouping. Currently two methods are quite popular: clustering and faceted categorization. Here, I describe both approaches and summarize their advantages and disadvantages based on the results of usability studies.


Steve Durbin - February 8, 2007 - User interface

Simplicity is Highly Overrated is a short essay on the notion of simplicity in user interfaces.

Cautious Cars and Cantankerous Kitchens: How Machines Take Control
Don Norman
Professor of Computer Science, Psychology, and Cognitive science
Northwestern University

Abstract:
The first chapter of The Design of Future Things, to be published in 2008. It‘s about the ever-increasing role of automation in our homes and automobiles, why it is being done so badly, with suggestions for doing it right through what I call “natural interaction.”


Steve Durbin - February 1, 2007 - Ambient AI

Machine Learning, Reasoning, and Intelligence in Daily Life: Directions and Challenges
Eric Horvitz
Microsoft Research

Abstract:
Technical developments and trends are providing a fertile substrate for creating and integrating machine learning and reasoning into multiple applications and services. I will review several illustrative research efforts on our team, and focus on challenges, opportunities, and directions with the streaming of machine intelligence into daily life.


Steve Durbin - November 30, 2006 - Natural Language Understanding

Machine Reading
Oren Etzioni, Michele Banko, Michael J. Cafarella
Computer Science & Engineering, University of Washington

Abstract:
The time is ripe for the AI community to set its sights on Machine Reading---the automatic, unsupervised understanding of text. In this paper, we place the notion of “Machine Reading” in context, describe progress towards this goal by the KnowItAll research group at the University of Washington, and highlight several central research questions.


Steve Durbin - November 2, 2006 - Social Networks

The Dynamics of Viral Marketing
Jure Leskovec, Carnegie Mellon
Lada A. Adamic, University of Michigan
Bernardo A. Huberman, HP Labs

Abstract:
We present an analysis of a person-to-person recommendation network, consisting of 4 million people who made 16 million recommendations on half a million products. We observe the propagation of recommendations and the cascade sizes, which we explain by a simple stochastic model. We analyze how user behavior varies within user communities defined by a recommendation network. Product purchases follow a 'long tail' where a significant share of purchases belongs to rarely sold items. We establish how the recommendation network grows over time and how effective it is from the viewpoint of the sender and receiver of the recommendations. While on average recommendations are not very effective at inducing purchases and do not spread very far, we present a model that successfully identfies communities, product and pricing categories for which viral marketing seems to be very effective.


Steve Durbin - October 26, 2006 - User Behavior

The origin of bursts and heavy tails in human dynamics
Albert-Laszlo Barabasi
Center for Complex Networks Research and Dept. of Physics, University of Notre Dame

Abstract:
The dynamics of many social, technological and economic phenomena are driven by individual human actions, turning the quantitative understanding of human behaviour into a central question of modern science. Current models of human dynamics, used from risk assessment to communications, assume that human actions are randomly distributed in time and thus letters to nature well approximated by Poisson processes. In contrast, there is increasing evidence that the timing of many human activities, ranging from communication to entertainment and work patterns, follow non-Poisson statistics, characterized by bursts of rapidly occurring events separated by long periods of inactivity. Here I show that the bursty nature of human behaviour is a consequence of a decision-based queuing process: when individuals execute tasks based on some perceived priority, the timing of the tasks will be heavy tailed, with most tasks being rapidly executed, whereas a few experience very long waiting times. In contrast, random or priority blind execution is well approximated by uniform inter-event statistics. These finding have important implications, ranging from resource management to service allocation, in both communications and retail.


Hasari Tosun - October 19, 2006 - Recommender systems

Unifying User-based and Item-based Collaborative Filtering Approaches by Similarity Fusion
Jun Wang, Arjen P. de Vries, and Marcel J. T. Reinders
Information and Communication Theory Group, Delft University of Technology

Abstract:
Memory-based methods for collaborative filtering predict new ratings by averaging (weighted) ratings between, respectively, pairs of similar users or items. In practice, a large number of ratings from similar users or similar items are not available, due to the sparsity inherent to rating data. Consequently, prediction quality can be poor. This paper reformulates the memory-based collaborative filtering problem in a generative probabilistic framework, treating individual user-item ratings as predictors of missing ratings. The final rating is estimated by fusing predictions from three sources: predictions based on ratings of the same item by other users, predictions based on different item ratings made by the same user, and, third, ratings predicted based on data from other but similar users rating other but similar items. Existing user-based and item-based approaches correspond to the two simple cases of our framework. The complete model is however more robust to data sparsity, because the different types of ratings are used in concert, while additional ratings from similar users towards similar items are employed as a background model to smooth the predictions. Experiments demonstrate that the proposed methods are indeed more robust against data sparsity and give better recommendations.


David Tolliver - October 12, 2006 - Clustering

Note special time (Thursday 3:10) and place (1-139 Wilson)

Spectral Approximation Algorithms: Clustering and Graph Partitioning
David Tolliver
Google Fellow and Department of Computer Science, Carnegie Mellon University

Abstract:
I'll provide an overview of recent advances in spectral clustering and partitioning technology. I'll briefly cover a fundamental result in clustering, and then delve into the basic spectral clustering algorithm. I'll walk through a few variations on the basic algorithm, highlighting the specific applications domains and drawbacks. Finally I'll cover our proof of an approximation bound for the Normalized Cut (an NP-hard objection function) for a simple spectral algorithm.


Steve Durbin - October 5, 2006 - Search

A Live-User Evaluation of Collaborative Web Search
Barry Smyth, Evelyn Balfe, Oisin Boydell, Keith Bradley, Peter Briggs, Maurice Coyle, Jill Freyne
Smart Media Institute, Department of Computer Science, University College Dublin

Abstract:
Collaborative Web search exploits repetition and regularity within the query-space of a community of like-minded individuals in order to improve the quality of search results. In short, search results that have been judged to be relevant for past queries are promoted in response to similar queries that occur in the future. In this paper we present the results of a large-scale evaluation of this approach, in a corporate Web search scenario, which shows that significant benefits are available to its users.


Steve Durbin - September 28, 2006 - Tagging

Visualizing Tags over Time
Micah Dubinko, Ravi Kumar, Joseph Magnani, Jasmine Novak, Prabhakar Raghavan, Andrew Tomkins
Yahoo! Research

Abstract:
We consider the problem of visualizing the evolution of tags within the Flickr (flickr.com) online image sharing community. Any user of the Flickr service may append a tag to any photo in the system. Over the past year, users have on average added over a million tags each week. Understanding the evolution of these tags over time is therefore a challenging task. We present a new approach based on a characterization of the most interesting tags associated with a sliding interval of time. An animation provided via Flash in a web browser allows the user to observe and interact with the interesting tags as they evolve over time.
New algorithms and data structures are required to support the efficient generation of this visualization. We combine a novel solution to an interval covering problem with extensions to previous work on score aggregation in order to create an efficient backend system capable of producing visualizations at arbitrary scales on this large dataset in real time.


Neal Richter - September 21, 2006 - Index Mining

Random Sampling from a Search Engine's Index Best Paper Winner at WWW-2006
Ziv Bar-Yossef, Technion - Israel Institute of Technology, Israel Maxim Gurevich, Department of Electrical Engineering, Technion, Israel

Abstract:
We revisit a problem introduced by Bharat and Broder al- most a decade ago: how to sample random pages from a search engine's index using only the search engine's public interface? Such a primitive is particularly useful in creating objective benchmarks for search engines. The technique of Bharat and Broder suffers from two well recorded biases: it favors long documents and highly ranked documents. In this paper we introduce two novel sam- pling techniques: a lexicon-based technique and a random walk technique. Our methods produce biased sample docu- ments, but each sample is accompanied by a corresponding ¿weight¿, which represents the probability of this document to be selected in the sample. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to three well known Monte Carlo simulation methods: rejection sampling, importance sampling and the Metropolis-Hastings algorithm. We analyze our methods rigorously and prove that under plausible assumptions, our techniques are guaranteed to pro- duce near-uniform samples from the search engine's index. Experiments on a corpus of 2.4 million documents substanti- ate our analytical findings and show that our algorithms do not have significant bias towards long or highly ranked doc- uments. We use our algorithms to collect fresh data about the relative sizes of Google, MSN Search, and Yahoo!.


Neal Richter - September 14, 2006 - Click-stream Data and Queries

Time-Dependent Semantic Similarity Measure of Queries Using Historical Click-Through Data
Qiankun Zhao, Steven C. H. Hoi, Tie-Yan Liu, Sourav S Bhowmick, Michael R. Lyu, Wei-Ying Ma
Nanyang Technological University, The Chinese University of HK, Microsoft Research Asia

Abstract:
It has become a promising direction to measure similarity of Web search queries by mining the increasing amount of clickthrough data logged by Web search engines, which record the interactions between users and the search engines. Most existing approaches employ the click-through data for similarity measure of queries with little consideration of the temporal factor, while the click-through data is often dynamic and contains rich temporal information. In this paper we present a new framework of time-dependent query semantic similarity model on exploiting the temporal characteristics of historical click-through data. The intuition is that more accurate semantic similarity values between queries can be obtained by taking into account the timestamps of the log data. With a set of user-defined calendar schema and calendar patterns, our time-dependent query similarity model is constructed using the marginalized kernel technique, which can exploit both explicit similarity and implicit semantics from the click-through data effectively. Experimental results on a large set of click-through data acquired from a commercial search engine show that our time-dependent query similarity model is more accurate than the existing approaches. Moreover, we observe that our time-dependent query similarity model can, to some extent, reflect real-world semantics such as real-world events that are happening over time.

Steve Durbin - September 7, 2006 - Ranking the web

Beyond PageRank: machine learning for static ranking
Matthew Richardson, Amit Prakash, and Eric Brill
Microsoft Research, Inc.

Abstract:
Since the publication of Brin and Page's paper on PageRank, many in the Web community have depended on PageRank for the static (query-independent) ordering of Web pages. We show that we can significantly outperform PageRank using features that are independent of the link structure of the Web. We gain a further boost in accuracy by using data on the frequency at which users visit Web pages. We use RankNet, a ranking machine learning algorithm, to combine these and other static features based on anchor text and domain characteristics. The resulting model achieves a static ranking pairwise accuracy of 67.3% (vs. 56.7% for PageRank or 50% for random).


Steve Durbin - August 31, 2006 - Better answers

Retroactive Answering of Search Queries
Beverly Yang and Glen Jeh
Google, Inc.

Abstract:
Major search engines currently use the history of a user's actions (e.g., queries, clicks) to personalize search results. In this paper, we present a new personalized service, query-specific web recommendations (QSRs), that retroactively answers queries from a user's history as new results arise. The QSR system addresses two important subproblems with applications beyond the system itself: (1) Automatic identification of queries in a user's history that represent standing interests and unfulfilled needs. (2) Effective detection of interesting new results to these queries. We develop a variety of heuristics and algorithms to address these problems, and evaluate them through a study of Google history users. Our results strongly motivate the need for automatic detection of standing interests from a user's history, and identifies the algorithms that are most useful in doing so. Our results also identify the algorithms, some which are counter-intuitive, that are most useful in identifying interesting new results for past queries, allowing us to achieve very high precision over our data set.


Sam Gardner - April 27, 2006 - Neural networks

Self-Organization of Hierarchical Visual Maps with Feedback Connections
Yiu Fai Sit and Risto Miikkulainen
Department of Computer Sciences, University of Texas at Austin

Abstract:
Visual areas in primates are known to have reciprocal connections. While the feedforward bottom-up processing of visual information has been studied extensively for decades, little is known about the role of the feedback connections. Existing feedback models usually employ hand-coded connections, and do not address how these connections develop. The model described in this paper shows how feedforward and feedback connections between cortical areas V1 and V2 can be learned through self-organization simultaneously. Computational experiments show that both areas can form hierarchical representations of the input with reciprocal connections that link relevant cells in the two areas.


Mike Emery - April 20, 2006 - Self-organization

Decentralised Autonomic Computing: Analysing Self-Organising Emergent Behaviour using Advanced Numerical Methods
Tom De Wolf, Giovanni Samaey, Tom Holvoet and Dirk Roose
Department of Computer Science, KU Leuven, Belgium

Abstract:
When designing decentralised autonomic computing systems, a fundamental engineering issue is to assess systemwide behaviour. Such decentralised systems are characterised by the lack of global control, typically consist of autonomous cooperating entities, and often rely on self-organised emergent behaviour to achieve the requirements.
A well-founded and practically feasible approach to study overall system behaviour is a prerequisite for successful deployment. On one hand, formal proofs of correct behaviour and even predictions of the exact systemwide behaviour are practically infeasible due to the complex, dynamic, and often non-deterministic nature of self-organising emergent systems. On the other hand, simple simulations give no convincing arguments for guaranteeing system-wide properties. We describe an alternative approach that allows to analyse and assess trends in systemwide behaviour, based on so-called equation-free macroscopic analysis. This technique yields more reliable results about the system-wide behaviour, compared to mere observation of simulation results, at an affordable computational cost. Numerical algorithms act at the system-wide level and steer the simulations. This allows to limit the amount of simulations considerably.
We illustrate the approach by studying a particular system-wide property of a decentralised control system for Automated Guided Vehicles and we outline a road map towards a general methodology for studying decentralised autonomic computing systems.


Leif Wickland - April 13, 2006 - Web text mining

Deriving Marketing Intelligence from Online Discussion
Glance et al.
Intelliseek Applied Research Center

Abstract:
Weblogs and message boards provide online forums for discussion that record the voice of the public. Woven into this mass of discussion is a wide range of opinion and commentary about consumer products. This presents an opportunity for companies to understand and respond to the consumer by analyzing this unsolicited feedback. Given the volume, format and content of the data, the appropriate approach to understand this data is to use large-scale web and text data mining technologies. This paper argues that applications for mining large volumes of textual data for marketing intelligence should provide two key elements: a suite of powerful mining and visualization technologies and an interactive analysis environment which allows for rapid generation and testing of hypotheses. This paper presents such a system that gathers and annotates online discussion relating to consumer products using a wide variety of state-of-the-art techniques, including crawling, wrapping, search, text classification and computational linguistics. Marketing intelligence is derived through an interactive analysis framework uniquely configured to leverage the connectivity and content of annotated online discussion.


Neal Richter - April 6, 2006 - Social Networking

System of Mobile Agents to Model Social Networks
Gonzalez, Marta C., Lind, Pedro G. and Herrmann, Hans J.
Univ Stuttgart, Departamento de Fisica, Universidade Federal do Ceara

Abstract:
We propose a model of mobile agents to construct social networks, based on a system of moving particles by keeping track of the collisions during their permanence in the system. We reproduce not only the degree distribution, clustering coefficient and shortest path length of a large data base of empirical friendship networks recently collected, but also some features related with their community structure. The model is completely characterized by the collision rate and above a critical collision rate we find the emergence of a giant cluster in the universality class of two-dimensional percolation. Moreover, we propose possible schemes to reproduce other networks of particular social contacts, namely sexual contacts.


Bob Wall - March 30, 2006 - Models of Neural Processing

Computation in the Higher Visual Cortices: Map-Seeking Circuit Theory and Application to Machine Vision
David Arathorn
Center for Computational Biology, MSU, and General Intelligence Corporation

Abstract:
Map-Seeking Circuit theory is a biologically based computational theory of vision applicable to difficult machine vision problems such as recognition of 3D objects in arbitrary poses amid distractors and clutter, as well as to non-recognition problems such as terrain interpretation. It provides a general computational mechanism for tractable discovery of correspondences in massive transformation spaces by exploiting an ordering property of superpositions. The latter allows a set of transformations of an input image to be formed into a sequence of superpositions which are then culled to a composition of single mappings by a competitive process which matches each superposition against a superposition of inverse transformations of memory patterns. The architecture that performs this is based on a number of neuroanatomical features of the visual cortices, including reciprocal dataflows and inverse mappings.


Steve Durbin - March 9, 2006 - Spelling correction

Learning a Spelling Error Model from Search Query Logs
Farooq Ahmad and Grzegorz Kondrak
University of Alberta

Abstract:
Applying the noisy channel model to search query spelling correction requires an error model and a language model. Typically, the error model relies on a weighted string edit distance measure. The weights can be learned from pairs of misspelled words and their corrections. This paper investigates using the Expectation Maximization algorithm to learn edit distance weights directly from search query logs, without relying on a corpus of paired words.


Dave Kortenkamp - March 3, 2006 - Robotics

Note special time (Friday 3:10) and place, 101 Roberts

Worlds to Explore: Autonomy Challenges for Human Space Flight

Abstract:
This talk will look at two distinct artificial intelligence challenges in human space flight. The first challenge can be compared to building HAL 9000 from the movie 2001: A Space Odyssey, that is, autonomous monitoring and control of long duration mission functions. The second challenge can be compared to building Lt. Cmdr. Data from Star Trek: The Next Generation, that is, autonomous control of highly dexterous robots. Both challenges will be highlighted by showing research being done at NASA Johnson Space Center. This research includes autonomous control of advanced life support system tests and control of a humanoid robot called Robonaut. The talk will include many videos of NASA's robotics research.

Dave Kortenkamp's Home Page


Steve Durbin - February 23, 2005 - Language evolution

The "Lexical Contract": Modeling the emergence of word pronunciations
???

Abstract:
Even in the absence of writing, human languages appear to develop and maintain vocabularies of roughly 10,000 morphemes, and at least 100,000 words or phrases whose meaning is not predictable from their constituent parts. This presents a formidable task for the individual language learner, who must learn words from a small number of uses in context, and must manage to remember such a large inventory of arbitrary pronunciations. On a different scale of space and time, this also presents an extraordinary challenge for the speech community, which must must somehow form and maintain a consensus about these tens of thousands of complex and arbitrary conventions. This is accomplished despite the fact that the conventional knowledge at stake is tacit: the individuals involved typically don't know what they know about word pronunciation, and don't try to discuss the details of this knowledge. ... We will try to find simple programs for interacting agents that lead to successful emergence of patterns of word pronunciation. As in the case of "ant colony optimization", we'll look for algorithms that can accomplish this without requiring any designation of authorities or any explicit negotiation to reach consensus.


Zuzana Gedeon - February 16, 2005 - Text clustering

Topic-Driven Clustering for Document Datasets
Ying Zhao and George Karypis
Dept. of Computer Science and Engineering, Univ. of Minnesota

Abstract:
In this paper, we define the problem of topic-drive clustering, which organizes a document collection according to a set of topics. We propose three topic-driven schemes that consider the similarity between documents and topics and the relationship among documents themselves simultaneously. We present a comprehensive experimental evaluation of the proposed topic-driven schemes on five datasets. Our experimental results show that the proposed topic-driven schemes are efficient and effective with topic prototypes of different levels of specificity.


Steve Durbin - February 9, 2005 - Language Evolution

Language evolution: consensus and controversies
Morten H. Christiansen* and Simon Kirby**
*Cornell University and **Edinburgh University

Abstract:
Why is language the way it is? How did language come to be this way? And why is our species alone in having complex language? These are old unsolved questions that have seen a renaissance in the dramatic recent growth in research being published on the origins and evolution of human language. This review provides a broad overview of some of the important current work in this area. We highlight new methodologies (such as computational modeling), emerging points of consensus (such as the importance of pre-adaptation), and the major remaining controversies (such as gestural origins of language). We also discuss why language evolution is such a difficult problem, and suggest probable directions research may take in the near future.


Neal Richter - February 2, 2005 - Text Mining with Genetic Algorithms

A Semantically Guided and Domain-Independent Evolutionary Model for Knowledge Discovery From Texts
John Atkinson-Abutridy, Chris Mellish, and Stuart Aitken

Abstract:
We present a novel evolutionary model for knowledge discovery from texts (KDTs), which deals with issues concerning shallow text representation and processing for mining purposes in an integrated way. Its aims is to look for novel and interesting explanatory knowledge across text documents. The approach uses natural language technology and genetic algorithms to produce explanatory novel hypotheses. The proposed approach is interdisciplinary, involving concepts not only from evolutionary algorithms but also from many kinds of text mining methods. Accordingly, new kinds of genetic operations suitable for text mining are proposed. The principles behind the representation and a new proposal for using multiobjective evaluation at the semantic level are described. Some promising results and their assessment by human experts are also discussed which indicate the plausibility of the model for effective KDT.


Steve Durbin - January 26, 2005 - Bayesian Modelling

Bayesian models of human action understanding
Chris L. Baker, Joshua B. Tenenbaum and Rebecca R. Saxe
Dept. of Brain and Cognitive Sciences, MIT

Abstract:
We present a Bayesian framework for explaining how people reason about and predict the actions of an intentional agent, based on observing its behavior. Action-understanding is cast as a problem of inverting a probabilistic generative model, which assumes that agents tend to act rationally in order to achieve their goals given the constraints of their environment. Working in a simple sprite-world domain, we show how this model can be used to infer the goal of an agent and predict how the agent will act in novel situations or when environmental constraints change. The model provides a qualitative account of several kinds of inferences that preverbal infants have been shown to perform, and also fits quantitative predictions that adult observers make in a new experiment.


Reggie Mead - January 19, 2005 - Modelling

A Review of AI Techniques for Ecological Modelling
Reginald Mead
Dept. of Computer Science, MSU and RightNow Technologies

Abstract:
Environmental models are used to both further scientific understanding of environmental processes and to aid in environmental decision making. Artificial intelligence techniques can enhance the approaches that have traditionally been followed for modelling. Generally, ecological modelling approaches can be classified as either process-based, empirical-based, or a combination of the two. Process-based models use an understanding of underlying sub-processes to model a system. Expert systems can be constructed from expert knowledge and can be used for prediction, planning, interpretation, and more. Process-based models have high explanatory depth, but also tend to have low predictive power. Empirical-based models explicitly exclude any a priori knowledge of the system, and instead model the system based on observation data alone. ...


Rick Sojda - December 8, 2005 - Modelling

An Artificial Intelligence Modelling Approach to Simulating Animal/Habitat Interactions
H. Saarenmaa l, N.D. Stone 2,4, L.J. Folse 3, J.M. Packard 3, W.E. Grant 3, M.E. Makela 2 and R.N. Coulson 2
1 The Finnish Forest Research Institute
Departments of 2 Entomology, and 3 Wildlife and Fisheries Science, Texas A&M University

Abstract:
Ecological modellers have begun to recognize the potential of object-oriented programming techniques in structuring models. However, little has been done to take advantage of artificial intelligence's (AI) symbolic representations to model the decision-making processes of animals. Here, a generic model of animal-habitat interaction and a specific model of moose-, Alces alces L., forest interactions in Finland are described that are event-driven and behavior-based. Individual level simulation is accomplished through an object-oriented knowledge representation scheme and AI techniques to implement a hierarchical decision-making model of behavior. The habitat is likewise represented in an object-oriented scheme, allowing the simulation of a heterogeneous environment. Other AI techniques for modelling behavior, memory, and actions are discussed including LISP methods, rule-based reasoning, and several search algorithms. Simulations of the moose-forest system show the power of this approach but are not intended to advance the theory of large-herbivore behavior and foraging. AI techniques are found to be most beneficial in (a) studying population processes based on individual level models of behavior, (b) modelling spatial heterogeneity, (c) building event-driven models, (d) providing a conceptual clarity to model construction, and (e) providing a structure equally well suited to simulating resource management.


Neal Richter - December 1, 2005 - Implicit Feedback

A Study of Factors Affecting the Utility of Implicit Relevance Feedback
Ryen W. White,* Ian Ruthven,** and Joemon M. Jose^
*Human-Computer Interaction Laboratory, University of Maryland
**Dept. of Computer and Information Sciences, University of Strathclyde
^Dept of Computing Science, University of Glasgow

Abstract:
Implicit relevance feedback (IRF) is the process by which a search system unobtrusively gathers evidence on searcher interests from their interaction with the system. IRF is a new method of gathering information on user interest and, if IRF is to be used in operational IR systems, it is important to establish when it performs well and when it performs poorly. In this paper we investigate how the use and effectiveness of IRF is affected by three factors: search task complexity, the search experience of the user and the stage in the search. Our findings suggest that all three of these factors contribute to the utility of IRF.


Accurately Interpreting Clickthrough Data as Implicit Feedback
Thorsten Joachims,* Laura Granka,** Bing Pan,^ Helene Hembrooke,^ and Geri Gay^
*Dept. of Computer Science, Cornell University
**Dept. of Communication, Stanford University
^Information Science, Cornell University

Abstract:
This paper examines the reliability of implicit feedback generated from clickthrough data in WWW search. Analyzing the users' decision process using eyetracking and comparing implicit feedback against manual relevance judgments, we conclude that clicks are informative but biased. While this makes the interpretation of clicks as absolute relevance judgments difficult, we show that relative preferences derived from clicks are reasonably accurate on average.


Bob Wall - November 17, 2005 - Clustering for Information Retrieval

Creating Concept Hierarchies in an Information Retrieval System
Bob Wall,*^ Neal Richter,*^ and Rafal A. Angryk^
*RightNow Technologies
^Dept. of Computer Science, Montana State University

Abstract:
Most information retrieval systems are comprised of a focused set of domain-specific documents located within a single logical repository. A mechanism is developed by which user queries against such a system are used to generate a concept hierarchy pertinent to the domain. First, an algorithm is described which extracts terms from documents matching user queries, and then reduces this set of terms to a manageable length. The resulting terms are used to generate a feature vector for each query, and the queries are clustered using a Hierarchical Agglomerative Clustering (HAC) algorithm. The HAC algorithm generates a binary tree of clusters, which is not particularly amenable to use by humans and which is slow to search due to its depth, so a subsequent processing step applies min-max partitioning to form a shallower, bushier tree that is a more natural representation of the concept hierarchy inherent in the system.


Steve Durbin - November 10, 2005 - Web computing

Learning from THE WEB (Word doc)
Learning from THE WEB (ACM Queue online)
Adam Bosworth
Google

Abstract:
The Web has much to teach us about managing and modeling distributed data. It's time we began listening.


Steve Durbin - November 3, 2005 - Web search

Automatic Identification of User Goals in Web Search
Uichin Lee, Zhenyu Liu, and Junghoo Cho
Dept. of Computer Science, University of California, Los Angeles

Abstract:
There have been recent interests in studying the "goal" behind a user's Web query, so that this goal can be used to improve the quality of a search engine's results. Previous studies have mainly focused on using manual query-log investigation to identify Web query goals. In this paper we study whether and how we can automate this goal-identification process. We first present our results from a human subject study that strongly indicate the feasibility of automatic query-goal identification. We then propose two types of features for the goal-identification task: user-click behavior and anchor-link distribution. Our experimental evaluation shows that by combining these features we can correctly identify the goals for 90% of the queries studied.


Steve Durbin - October 27, 2005 - Visualization

Arc Diagrams: Visualizing Structure in Strings
Martin Wattenberg
IBM Research

Abstract:
This paper introduces a new visualization method, the arc diagram , which is capable of representing complex patterns of repetition in string data. Arc diagrams improve over previous methods such as dotplots because they scale efficiently for strings that contain many instances of the same subsequence. This paper describes design and implementation issues related to arc diagrams and shows how they may be applied to visualize such diverse data as music, text, and compiled code.


Leif Wickland - October 20, 2005 - Google File System

Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
Google

Abstract:
We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.
 
While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points.
 
The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients.
 
In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.


Zuzana Gedeon - October 13, 2005 - Machine learning and speech recognition for evil purposes

Keyboard Acoustic Emanations Revisited
Li Zhuang, Feng Zhou, and J. D. Tygar
University of California, Berkeley

Abstract:
We examine the problem of keyboard acoustic emanations. We present a novel attack taking as input a 10-minute sound recording of a user typing English text using a keyboard, and then recovering up to 96% of typed characters. There is no need for a labeled training recording. Moreover the recognizer bootstrapped this way can even recognize random text such as passwords: In our experiments, 90% of 5-character random passwords using only letters can be generated in fewer than 20 attempts by an adversary; 80% of 10-character passwords can be generated in fewer than 75 attempts. Our attack uses the statistical constraints of the underlying content, English language, to reconstruct text from sound recordings without any labeled training data. The attack uses a combination of standard machine learning and speech recognition techniques, including cepstrum features, Hidden Markov Models, linear classification, and feedback-based incremental learning.


Anthony Arnone - September 29, 2005 - Web Crawling

User-Centric Web Crawling
Sandeep Pandey and Christopher Olston
Carnegie Mellon University

Abstract:
Search engines are the primary gateways of information access on the Web today. Behind the scenes, search engines crawl the Web to populate a local indexed repository of Web pages, used to answer user search queries. In an aggregate sense, the Web is very dynamic, causing any repository of Web pages to become out of date over time, which in turn causes query answer quality to degrade. Given the considerable size, dynamicity, and degree of autonomy of the Web as a whole, it is not feasible for a search engine to maintain its repository exactly synchronized with the Web.
 
In this paper we study how to schedule Web pages for selective (re)downloading into a search engine repository. The scheduling objective is to maximize the quality of the user experience for those who query the search engine. We begin with a quantitative characterization of the way in which the discrepancy between the content of the repository and the current content of the live Web impacts the quality of the user experience. This characterization leads to a usercentric metric of the quality of a search engine's local repository. We use this metric to derive a policy for scheduling Web page (re)downloading that is driven by search engine usage and free of exterior tuning parameters. We then focus on the important subproblem of scheduling refreshing of Web pages already present in the repository, and show how to compute the priorities efficiently. We provide extensive empirical comparisons of our user-centric method against prior Web page refresh strategies, using real Web data. Our results demonstrate that our method requires far fewer resources to maintain same search engine quality level for users, leaving substantially more resources available for incorporating new Web pages into the search repository.


Neal Richter - September 22, 2005 - Language learning

Unsupervised Learning of Natural Languages
Zach Solan, David Horn, Eytan Ruppin and Shimon Edelman,
Tel Aviv University, and Cornell University

Abstract:
We address the problem, fundamental to linguistics, bioinformatics, and certain other disciplines, of using corpora of raw symbolic sequential data to infer underlying rules that govern their production. Given a corpus of strings (such as text, transcribed speech, chromosome or protein sequence data, sheet music, etc.), our unsupervised algorithm recursively distills from it hierarchically structured patterns. The ADIOS (automatic distillation of structure) algorithm relies on a statistical method for pattern extraction and on structured generalization, two processes that have been implicated in language acquisition. It has been evaluated on artificial context-free grammars with thousands of rules, on natural languages as diverse as English and Chinese, and on protein data correlating sequence with function. This unsupervised algorithm is capable of learning complex syntax, generating grammatical novel sentences, and proving useful in other fields that call for structure discovery from raw data, such as bioinformatics.


Steve Durbin - September 15, 2005 - Web text mining

The Predictive Power of Online Chatter
Daniel Gruhl, R. Guha*, Ravi Kumar, Jasmine Novak and Andrew Tomkins
IBM Almaden Research, *Google

Abstract:
An increasing fraction of the global discourse is migrating online in the form of blogs, bulletin boards, web pages, wikis, editorials, and a dizzying array of new collaborative technologies. The migration has now proceeded to the point that topics reflecting certain individual products are sufficiently popular to allow targeted online tracking of the ebb and flow of chatter around these topics. Based on an analysis of around half a million sales rank values for 2,340 books over a period of four months, and correlating postings in blogs, media, and web pages, we are able to draw several interesting conclusions. First, carefully hand-crafted queries produce matching postings whose volume predicts sales ranks. Second, these queries can be automatically generated in many cases. And third, even though sales rank motion might be difficult to predict in general, algorithmic predictors can use online postings to successfully predict spikes in sales rank.


Steve Durbin - September 8, 2005 - Robot swarms

Dynamic Task Assignment in Robot Swarms
James McLurkin and Daniel Yamins*
Computer Science and Artificial Intelligence Lab, MIT, *Dept. of Mathematics, Harvard University

Abstract:
A large group of robots will often be partitioned into subgroups, each subgroup performing a different task. This paper presents four distributed algorithms for assigning swarms of homogenous robots to subgroups to meet a specified global task distribution. Algorithm Random-Choice selects tasks randomly, but runs in constant time. Algorithm Extreme-Comm compiles a complete inventory of all the robots on every robot, runs quickly, but uses a great deal of communication. The CardDealer's algorithm assigns tasks to individual robots sequentially, using minimal communications but a great deal of time. The Tree-Recolor algorithm is a compromise between Extreme-Comm and Card-Dealer's, balancing communications use and running time. The three deterministic algorithms drive the system towards the desired assignment of subtasks with high accuracy. We implement the algorithms on a group of 25 iRobot SwarmBots, and collect and analyze performance data.


Photos and video of robot swarm behaviors

Neal Richter - August 18, 2005 - Michael Collins's Talk on NLP at the University of Washington.

Abstract:
Natural Language Processing Natural language processing offers a rich problem domain for machine learning approaches. Many NLP problems require the induction of a mapping that involves complex, discrete structures such as strings, labeled sequences, or trees. In this distinguished lecture, Michael Collins describes how 'large margin' methods in machine learning can be generalized to 'structured' problems found in NLP.

Speaker(s): Michael Collins, Massachusetts Institute of Technology
Subject: Natural Language processing
Series: CSE Colloquia - 2005
Produced by: University of Washington, December 7, 2005
Runtime: 00:58:16

Streaming Video at UWTV.org

Steve Durbin - August 4, 2005 - Personal information management

The Haystack Project is investigating approaches designed to let people manage their information in ways that make the most sense to them. Learn more at Haystack Project

The Re:Search Engine: Helping People Return to Information on the Web
Jaime Teevan
MIT

Abstract:
Re-finding information is commonly cited as a problem on the Web. One reason re-finding on the Web is difficult is that while people rely on a considerable amount of context to return to information (e.g., the original path taken to it), the Web makes no guarantee that the context will remain static. The Re:Search Engine is designed to help people return to information in the dynamic environment of the Web by maintaining consistency in the search results it returns across time. For example, if Connie, while looking to purchase a Global Positioning System, found several systems she liked via a search for GPS , she would expect to be able to use the same query to locate the exact same systems again. However, simply returning the original result list when she re-issues the query might omit newly available GPS systems that she would like to see. The ideal result list would contain both the systems Connie remembers having seen and high quality new systems. Because people tend to remember little of what is presented in a result list, when a person repeats a query, the Re:Search Engine can preserve what is remembered about the original result set while still presenting new information.


Neal Richter - May 19th and 26th, 2005 - Steven Wolfram's Talk at University of Michigan.

On October 8, 2003 Stephen Wolfram gave a talk sponsored by University of Michigan Center for Study of Complex Systems titled and about his (then new) book A New Kind of Science.
Wolfram.wmv - Windows Media Player Movie File

Abstract:
The foundational observation of "A New Kind of Science" is that very simple systems are very frequently capable of great complexity. While isolated examples of this have been known for millenia, no fundamental significance was attached to them.

The surprising fact is that if one just starts enumerating simple computational systems, one very quickly encounters great complexity. This is robust fact that does not depend on the details of the system in question, but rather seems to be a generic law of our universe.

The fact that complexity is in fact rather easy to generate suggests an entirely new approach to science. The traditional path of science is to seek out examples of behavior in the natural world and try to break them down into understandable pieces. But an alternative path is to just systematically enumerate simple abstract computational systems, starting with the very simplest, and try to understand what they do.

Links about the Book (and controversy surrounding it):

Steve Durbin - April 7, 2005 - Recommender systems

TiVo: Making Show Recommendations Using a Distributed Collaborative Filtering Architecture
Kamal Ali,* Wijnand van Stam**
*TiVo, Yahoo!; **TiVo

Abstract:
We describe the TiVo television show collaborative recommendation system which has been fielded in over one million TiVo clients for four years. Over this install base, TiVo currently has approximately 100 million ratings by users over approximately 30,000 distinct TV shows and movies. TiVo uses an item-item (show to show) form of collaborative filtering which obviates the need to keep any persistent memory of each user viewing preferences at the TiVo server. Taking advantage of TiVo client-server architecture has produced a novel collaborative filtering system in which the server does a minimum of work and most work is delegated to the numerous clients. Nevertheless, the server-side processing is also highly scalable and parallelizable. Although we have not performed formal empirical evaluations of its accuracy, internal studies have shown its recommendations to be useful even for multiple user households. TiVo architecture also allows for throttling of the server so if more server-side resources become available, more correlations can be computed on the server allowing TiVo to make recommendations for niche audiences.


Steve Durbin - March 31, 2005 - Subjectivity analysis

Learning Subjective Nouns using Extraction Pattern Bootstrapping
Ellen Riloff,* Janyce Wiebe,** Theresa Wilson**
*University of Utah, **University of Pittsburgh

Abstract:
We explore the idea of creating a subjectivity classifier that uses lists of subjective nouns learned by bootstrapping algorithms. The goal of our research is to develop a system that can distinguish subjective sentences from objective sentences. First, we use two bootstrapping algorithms that exploit extraction patterns to learn sets of subjective nouns. Then we train a Naive Bayes classifier using the subjective nouns, discourse features, and subjectivity clues identified in prior research. The boot-strapping algorithms learned over 1000 subjective nouns, and the subjectivity classifier performed well, achieving 77% recall with 81% precision.


Steve Durbin - March 24, 2005 - Social Information Foraging

Footprints: History-Rich Tools for Information Foraging
Alan Wexelblat and Pattie Maes
MIT Media Lab

Abstract:
Inspired by Hill and Hollan's original work, we have been developing a theory of interaction history and building tools to apply this theory to navigation in a complex information space. We have built a series of tools--map, paths, annotations and signposts--based on a physical-world navigation metaphor. These tools have been in use for over a year. Our user study involved a controlled browse task and showed that users were able to get the same amount of work done with significantly less effort.


Ken Dyke - March 10, 2005 - Computer Security & AI

Anomaly Detection of Web-based Attacks
By Christopher Kruegel & Giovanni Vigna
Reliable Software Group University of California, Santa Barbara

Abstract:
Web-based vulnerabilities represent a substantial portion of the security exposures of computer networks. In order to detect known web-based attacks, misuse detection systems are equipped with a large number of signatures. Unfortunately, it is difficult to keep up with the daily disclosure of web-related vulnerabilities, and, in addition, vulnerabilities may be introduced by installation-specific web-based applications. Therefore, misuse detection systems should be complemented with anomaly detection systems. This paper presents an intrusion detection system that uses a number of different anomaly detection techniques to detect attacks against web servers and web-based applications. The system correlates the serverside programs referenced by client queries with the parameters contained in these queries. The application-specific characteristics of the parameters allow the system to perform focused analysis and produce a reduced number of false positives. The system derives automatically the parameter profiles associated with web applications (e.g., length and structure of parameters) from the analyzed data. Therefore, it can be deployed in very different application environments without having to perform time-consuming tuning and configuration.


Neal Richter - March 3, 2005 - Search Engine Security

The Insecure Indexing Vulnerability Attacks Against Local Search Engines
By Amit Klein
Security Researcher

Abstract:
This paper describes several techniques (many of them new) for exposing file contents using the site search functionality. It is assumed that a site contains documents which are not visible/accessible to external users. Such documents are typically future PR items, or future security advisories, uploaded to the website beforehand. However, the site is also searchable via an internal search facility, which does have access to those documents, and as such, they are indexed by it not via web crawling, but rather, via direct access to the files (and therein lies the security breach).

Several attack techniques are described, some very simple and quick, while other require an enormous amount of traffic; not all attacks are relevant to a particular site, as they depend on the richness of syntax supported by the site's search engine.

The paper concludes with methods to detect insecure indexing vulnerability and suggested solutions.

Note that this attack is fundamentally different than exploitation of external (remote) search engines.

Desktop search engines threaten SSL VPN security
By Author Tim Greene

Abstract:
Discusses how new Desktop Search tools from Google and others are a security threat.

Matrix of Desktop Search Engine Indexing Capabilities

Steve Durbin - February 24, 2005 - Knowledge-based search

Exploiting a Thesaurus-Based Semantic Net for Knowledge-Based Search
Peter Clark, John Thompson, Heather Holmback, Lisbeth Duncan
Mathematics and Computing Technology, The Boeing Company

Abstract:
With the growth of on-line information, the need for better resource location services is growing rapidly. A popular goal is to conduct search in terms of concepts, rather than words; however, this approach is frequently thwarted by the high up-front cost of building an adequate ontology (conceptual vocabulary) in the first place. In this paper we describe a knowledge-based Expert Locator application (for identifying human experts relevant to a particular problem or interest), which addresses this issue by using a large, pre-built, technical thesaurus as an initial ontology, combined with simple AI techniques of search, subsumption computation, and language processing. The application has been deployed and in use in our local organization since June, 1999, and a second, larger application was deployed in March 2000. We present the Expert Locator and the AI techniques it uses, and then we evaluate and discuss the application. The significance of this work is that it demonstrates how years of work by library science in thesaurus-building can be leveraged using AI methods, to construct a practical resource location service in a short period of time.

Neal Richter and Steve Durbin - February 17, 2005 - Knowledge discovery from the Web

Automatic Meaning Discovery Using Google
Rudi Cilibrasi and Paul Vitanyi
National Research Institute for Mathematics and Computer Science, Netherlands

Abstract:
We have found a method to automatically extract the meaning of words and phrases from the world-wide-web using Google page counts. The approach is novel in its unrestricted problem domain, simplicity of implementation, and manifestly ontological underpinnings. The world-wide-web is the largest database on earth, and the latent semantic context information entered by millions of independent users averages out to provide automatic meaning of useful quality. We demonstrate positive correlations, evidencing an underlying semantic structure, in both numerical symbol notations and number-name words in a variety of natural languages and contexts. Next, we demonstrate the ability to distinguish between colors and numbers, and to distinguish between 17th century Dutch painters; the ability to understand electrical terms, religious terms, emergency incidents, and we conduct a massive experiment in understanding WordNet categories; the ability to do a simple automatic English-Spanish translation.

Doug Warner and Steve Durbin - February 10, 2005 - Search systems with swarm optimization

User Behavior in an Adaptive Search System
Doug Warner, Stephen D. Durbin, J. Neal Richter, Z. Gedeon
RightNow Technologies

Abstract:
We analyze usage of a search system employing both standard search techniques and Ant Colony Optimization (ACO) approaches. Searching behavior on the system is similar to that on Internet search engines, though low searching rates suggest facilitation due to the ACO methods. Simulations based on site and user models give insight into the adaptive behavior and roughly match observations.

Zuzana Gedeon - February 3, 2005 - Learning ordering

Learning to Order Things
William W. Cohen, Robert E. Schapire, and Yoram Singer
AT&T Labs

Abstract:
There are many applications in which it is desirable to order rather than classify instances. Here we consider the problem of learning how to order, given feedback in the form of preference judgments, i.e., statements to the effect that one instance should be ranked ahead of another. We outline a two-stage approach in which one first learns by conventional means a preference function, of the form PREF(u,v), which indicates whether it is advisable to rank u before v. New instances are then ordered so as to maximize agreements with the learned preference function. We show that the problem of finding the ordering that agrees best with a preference function is NP-complete, even under very restrictive assumptions. Nevertheless, we describe a simple greedy algorithm that is guaranteed to find a good approximation. We then discuss an on-line learning algorithm, based on the "Hedge" algorithm, for finding a good linear combination of ranking "experts." We use the ordering algorithm combined with the on-line learning algorithm to find a combination of "search experts," each of which is a domain-specific query expansion strategy for a WWW search engine, and present experimental results that demonstrate the merits of our approach.

Jeff Elser and John Paxton - January 27, 2005 - Optimizing a search engine

Abstract:
ht://dig performs three major tasks that should be performed in the following order:

  1. Digging
    Before you can search, a database of all the documents that need to be searched has to be created.
  2. Merging
    Merging consists of two processes:
    A. Converting the databases of all documents to specialized databases for simple, fast searching.
    B. Merging changed information into previously existing databases.
    Even though this task could be performed at the same time as the Digging, it is a separate process for efficiency reasons. This also allows for more control over the processes implemented when merging.
  3. Searching
    Finally, the databases that were created in the previous steps can be used for actual searches. Normally, searches will be invoked by a CGI (Common Gateway Interface; a program running on the webserver) which gets input from the user through an HTML form.

Discussion Topics:
(1) An overview of ht://dig
(2) an exploratory discussion of how machine learning might be used to configure ht://dig
Reading: Browse the ht://dig website, http://www.htdig.org/

Steve Durbin - January 20, 2005 - Bayesian bots

Teaching Bayesian Behaviours to Video Game Characters
Ronan Le Hy, Anthony Arrigonia, Pierre Bessiere, Olivier Lebeltel
INRIA, France

Abstract:
This article explores an application of Bayesian Programming to behaviours for synthetic video games characters. We address the problem of real-time reactive selection of elementary behaviours for an agent playing a first person shooter game. We show how Bayesian Programming can lead to condensed and easier formalisation of finite state machine-like behaviour selection, and lend itself to learning by imitation, in a fully transparent way for the player.

Steve Durbin - December 9, 2004 - Perceptions of randomness

Randomness and Coincidences: Reconciling Intuition and Probability Theory
Thomas L. Griffiths & Joshua B. Tenenbaum
Department of Psychology, Stanford University

Abstract:
We argue that the apparent inconsistency between people's intuitions about chance and the normative predictions of probability theory, as expressed in judgments about randomness and coincidences, can be resolved by focussing on the evidence observations provide about the processes that generated them, rather than their likelihood. This argument is supported by probabilistic modeling of sequence and number production, together with two experiments that examine people's judgments about coincidences.

Zuzana Gedeon - December 2, 2004 - Clustering impossibility theorem

An Impossibility Theorem for Clustering
Jon Kleinberg
Department of Computer Science, Cornell University

Abstract:
Although the study of clustering is centered around an intuitively compelling goal, it has been very difficult to develop a unified framework for reasoning about it at a technical level, and profoundly diverse approaches to clustering abound in the research community. Here we suggest a formal perspective on the difficulty in finding such a unification, in the form of an impossibility theorem: for a set of three simple properties, we show that there is no clustering function satisfying all three. Relaxations of these properties expose some of the interesting (and unavoidable) trade-offs at work in well-studied clustering techniques such as single-linkage, sum-of-pairs, k-means, and k-median.

Mike Emery and John Paxton - November 18, 2004 - Structure in multi-agent systems

Cougarr Agent Communities
Ronald D. Snyder and Douglas C. MacKenzie
Mobile Intelligence Corporation

Abstract:
The ability to organize agents into abstract groups provides a powerful tool for agent organization and communication in large multi-agent systems. In this paper we outline the fundamental characteristics required of a robust and scalable agent grouping mechanism. These characteristics are then discussed in the context of the Cougarr community infrastructure. This implementation of agent communities illustrates the use of distributed state management, distributed event propagation and abstract messaging in a high-performance agent architecture designed for robustness and scalability.

Steve Durbin - November 4, 2004 - Question-answering with soft text patterns

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News
Hang Cui, Min-Yen Kan, and Tat-Seng Chua
Department of Computer Science, School of Computing, National University of Singapore

Abstract:
Breaking news often contains timely definitions and descriptions of current terms, organizations and personalities. We utilize such web sources to construct definitions for such terms. Previous work has identified definitions using hand-crafted rules or supervised learning that constructs rigid, hard text patterns. In contrast, we demonstrate a new approach that uses flexible, soft matching patterns to characterize definition sentences. Our soft patterns are able to effectively accommodate the diversity of definition sentence structure exhibited in news. We use pseudo-relevance feedback to automatically label sentences for use in soft pattern generation. The application of our unsupervised method significantly improves baseline systems on both the standardized TREC corpus as well as crawled online news articles by 27% and 30%, respectively, in terms of F measure. When applied to a state-of-art definition generation system recently fielded in the TREC 2003 definitional question answering task, it improves the performance by 14%.

Binhai Zhu - October 28, 2004 - Datamining in sparse databases

Efficient discovery of error-tolerant frequent itemsets in high dimensions
Cheng Yang
Dept. of Computer Science, Stanford University
Usama Fayyad and Paul S. Bradley
digiMine, Inc.

Abstract:
We present a generalization of frequent itemsets allowing for the notion of errors in the itemset definition. We motivate the problem and present an efficient algorithm that identifies error-tolerant frequent clusters of items in transactional data (customer purchase data, web browsing data, text, etc.). The algorithm exploits sparseness of the underlying data to find large groups of items that are correlated over database records (rows). The notion of transaction coverage allows us to extend the algorithm and view it as a fast clustering algorithm for discovering segments of similar transactions in binary sparse data. We evaluate the new algorithm on three real-world applications: clustering high-dimensional data, query selectivity estimation and collaborative filtering. Results show that the algorithm consistently uncovers structure in large sparse databases that other traditional clustering algorithms fail to find.

Rafal Angryk, Steve Durbin - October 21, 2004 - Uncertain knowledge in databases and multi-agent communication

Current approaches to handling imperfect information in data and knowledge bases
Simon Parsons
Advanced Computation Laboratory, Imperial Cancer Research Fund

Abstract:
This paper surveys methods for representing and reasoning with imperfect information. It opens with an attempt to classify the different types of imperfection that may pervade data, and a discussion of the sources of such imperfections. The classification is then used as a framework for considering work that explicitly concerns the representation of imperfect information, and related work on how imperfect information may be used as a basis for reasoning. The work that is surveyed is drawn from both the field of databases and the field of artificial intelligence. Both of these areas have long been concerned with the problems caused by imperfect information, and this paper stresses the relationships between the approaches developed in each.

The Byzantine generals problem
Leslie Lamport, Robert Shostak, and Marshall Pease
SRI International

Abstract:
Reliable computer systems must handle malfunctioning components that give conflicting information to different parts of the system. This situation can be expressed abstractly in terms of a group of generals of the Byzantine army camped with their troops around an enemy city. Communicating only by messenger, the generals must agree upon a common battle plan. However, one or more of them may be traitors who will try to confuse the others. The problem is to find an algorithm to ensure that the loyal generals will reach agreement. It is shown that, using only oral messages, this problem is solvable if and only if more that two-thirds of the generals are loyal; so a single traitor can confound two loyal generals. With unforgeable written messages, the problem is solvable for any number of generals and possible traitors. Applications of the solutions to reliable computer systems are then discussed.

Doug Warner - October 7, 2004 - Processing uncertain knowledge

How Humans Process Uncertain Knowledge: An Introduction for Knowledge Engineers
Robert F. Hink and David L. Woods

Abstract:
The question of how humans process uncertain information is important to the development of knowledge-based systems in terms of both knowledge acquisition and knowledge representation. This article reviews three bodies of psychological research that address this question: human perception, human probabilistic and statistical judgement, and human choice behavior. The general conclusion is that human behavior under uncertainty is often suboptimal and sometimes even fallacious. Suggestions for knowledge engineers in detecting and obviating such errors are discussed. The requirements for a system designed to reduce the effects of human factors in the processing of uncertain knowledge are introduced.

Steve Durbin - September 30, 2004 - Cognitive biases and bounded rationality

Chapter from the very readable CIA publication Psychology of Intelligence Analysis
Biases in Estimating Probabilities

Daniel Kahneman Nobel Prize lecture
Maps of Bounded Rationality: A Perspective on Intuitive Judgement and Choice

Multiple -