Colloquium Papers
Neal Richter - May 1, 2008 - Web Search Ranking
Mining the Search Trails of Surfing Crowds: Identifying Relevant Websites From User Activity
Mikhail Bilenko and Ryen W. White
Microsoft Research
Abstract:
The paper proposes identifying relevant information sources from
the history of combined searching and browsing behavior of many
Web users. While it has been previously shown that user interactions
with search engines can be employed to improve document
ranking, browsing behavior that occurs beyond search result
pages has been largely overlooked in prior work. The paper
demonstrates that users’ post-search browsing activity strongly
reflects implicit endorsement of visited pages, which allows estimating
topical relevance of Web resources by mining large-scale
datasets of search trails. We present heuristic and probabilistic
algorithms that rely on such datasets for suggesting authoritative
websites for search queries. Experimental evaluation shows that
exploiting complete post-search browsing trails outperforms alternatives
in isolation (e.g., clickthrough logs), and yields accuracy
improvements when employed as a feature in learning to rank for
Web search.
Mike Harrelson - April 24, 2008 - Neural networks
Confabulation theory
Dr. Robert Hecht-Nielsen
University of California, San Diego
Abstract:
Confabulation theory offers a comprehensive detailed explanation of the mechanism of
thought (i.e., “cognition”: vision, hearing, reasoning, language, planning, origination
of movement and thought processes, etc.) in humans and other vertebrates (and possibly
in invertebrates, such as octopi and bees). For expositional simplicity, only the human
case is considered here.
Steve Durbin - April 17, 2008 - Prediction Markets
Using Prediction Markets to Track Information Flows: Evidence from Google
Bo Cowgill*, Justin Wolfers**, and Eric Zitzewitz***
*Google; **Wharton School, Univ. of Pennsylvania; ***Dartmouth College
Abstract:
In the last 2.5 years, Google has conducted the largest corporate experiment with prediction
markets we are aware of. In this paper, we illustrate how markets can be used to study how an
organization processes information. We document a number of biases in Google’s markets,
most notably an optimistic bias. Newly hired employees are on the optimistic side of these
markets, and optimistic biases are significantly more pronounced on days when Google stock is
appreciating. We find strong correlations in trading for those who sit within a few feet of one
another; social networks and work relationships also play a secondary explanatory role. The
results are interesting in light of recent research on the role of optimism in entrepreneurial
firms, as well as recent work on the importance of geographical and social proximity in
explaining information flows in firms and markets.
Neal Richter - April 10, 2008 - Artificial Intelligence
AI Chasers
Patrick Tucker
World Future Society
Abstract:
The “holy grail” of computer science may be
within reach. A futurist looks toward tomorrow’s
Artificial Intelligence Revolution.
The article was based on
these interviews:
MIT roboticist Rodney Brooks, Adaptive A.I. Inc. founder Peter Voss,
Self-Aware Systems founder Steve Omohundro, Powerset CEO Barney Pell,
and Google research director Peter Norvig discuss how they see AI
developing in the years ahead, when a human-level AI might emerge,
and how worried we should be about that whole killer-robot-goes-on-rampage scenario.
Neal Richter - April 3, 2008 - Pattern Mining
Clustering Co-occurrences of Maximal Frequent Patterns in Streams
Edgar H. de Graaf, Joost N. Kok, Walter A. Kosters
Leiden Institute of Advanced Computer Science, Leiden University
Abstract:
One way of getting a better view of data is by using frequent
patterns. In this paper frequent patterns are (sub)sets that
occur a minimal number of times in a stream of itemsets.
However, the discovery of frequent patterns in streams has
always been problematic. Because streams are potentially
endless it is in principle impossible to say if a pattern is often
occurring or not. Furthermore, the number of patterns can
be huge and a good overview of the structure of the stream
is lost quickly. The proposed approach will use clustering to
facilitate the “online” analysis of the structure of the stream.
A clustering on the co-occurrence of patterns will give the user an improved view on the structure of the stream. Some patterns might occur so often together that they should form a combined pattern. In this way the patterns in the clustering will approximate the largest frequent patterns: maximal frequent patterns. The number of (approximated) maximal frequent patterns is much smaller and combined with clustering methods these patterns provide a good view on the structure of the stream.
Our approach to decide if patterns occur often together is based on a method of clustering where only the distance between pairs of patterns is known. This distance is the Euclidean distance between points in a 2-dimensional space, where the points represent the frequent patterns, or rather the most important ones. The coordinates are adapted when the records from the stream pass by, and reflect the real support of the corresponding pattern. In this setup the support is viewed as the number of occurrences in a time window. The main algorithm tries to maintain a dynamic model of the data stream by merging and splitting these patterns. Experiments show the versatility of the method.
Richard McAllister - March 27, 2008 - Pattern Mining
Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach
Jiawei Han*, Jian Pei**, Yiwei Yin***, Runying Mao****
*Univ. of Illinois, **SUNY, ***Simon Fraser Univ., ****Microsoft
Abstract:
Mining frequent patterns in transaction databases, time-series databases,
and many other kinds of databases has been studied popularly in data mining
research. Most of the previous studies adopt an Apriori-like candidate set
generation-and-test approach. However, candidate set generation is still costly,
especially when there exist a large number of patterns and/or long patterns.
In this study, we propose a novel frequent-pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develop an efficient FP-treebased mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a condensed, smaller data structure, FP-tree which avoids costly, repeated database scans, (2) our FP-tree-based mining adopts a pattern-fragment growth method to avoid the costly generation of a large number of candidate sets, and (3) a partitioning-based, divide-and-conquer method is used to decompose the mining task into a set of smaller tasks for mining confined patterns in conditional databases, which dramatically reduces the search space. Our performance study shows that the FP-growth method is efficient and scalable for mining both long and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm and also faster than some recently reported new frequent-pattern mining methods.
Steve Durbin - March 6, 2008 - Data commons
Urban Sensing: Out of the Woods
Dana Cuff, Mark Hansen, and Jerry Kang
UCLA
Abstract:
Embedded networked sensing,
having successfully shifted from
the lab to the environment, is
primed for a more contentious
move to the city to where citizens
will likely be the target of data
collection. This transition will
warrant careful study and touch
on issues that go far beyond
the scientific realm.
Neal Richter - February 28, 2008 - Contextual Advertising
Finding Advertising Keywords on Web Pages
Wentau Yih,* Joshua Goodman,* Vitor R. Carvalho**
*Microsoft Research, **Carnegie Mellon University
Abstract:
A large and growing number of web pages display contextual
advertising based on keywords automatically extracted
from the text of the page, and this is a substantial source
of revenue supporting the web today. Despite the importance
of this area, little formal, published research exists.
We describe a system that learns how to extract keywords
from web pages for advertisement targeting. The system
uses a number of features, such as term frequency of each
potential keyword, inverse document frequency, presence in
meta-data, and how often the term occurs in search query
logs. The system is trained with a set of example pages that
have been hand-labeled with “relevant” keywords. Based on
this training, it can then extract new keywords from previously
unseen pages. Accuracy is substantially better than
several baseline systems.
Richard McAllister - February 21, 2008 - Clustering
Hierarchical Document Clustering Using Frequent Itemsets
Benjamin C.M. Fung, Ke Wangy, and Martin Esterz
Simon Fraser University
Abstract:
A major challenge in document clustering is the extremely
high dimensionality. For example, the vocabulary for a
document set can easily be thousands of words. On the
other hand, each document often contains a small fraction
of words in the vocabulary. These features require special
handlings. Another requirement is hierarchical clustering
where clustered documents can be browsed according to the
increasing specificity of topics. In this paper, we propose
to use the notion of frequent itemsets, which comes from
association rule mining, for document clustering. The
intuition of our clustering criterion is that each cluster is
identified by some common words, called frequent itemsets,
for the documents in the cluster. Frequent itemsets are
also used to produce a hierarchical topic tree for clusters.
By focusing on frequent items, the dimensionality of the
document set is drastically reduced. We show that this
method outperforms best existing methods in terms of both
clustering accuracy and scalability.
Neal Richter - February 14, 2008 - Search Advertising
Optimal Delivery of Sponsored Search Advertisements Subject to Budget Constraints
Zoe Abrams, Ofer Mendelevitch, John A. Tomlin
Yahoo!
Abstract:
We discuss an auction framework in which sponsored search
advertisements are delivered in response to queries. In practice,
the presence of bidder budgets can have a significant
impact on the ad delivery process. We propose an approach
based on linear programming which takes bidder budgets
into account, and uses them in conjunction with forecasting
of query frequencies, and pricing and ranking schemes, to
optimize ad delivery. Simulations show significant improvements
in revenue and efficiency.
Steve Durbin - February 7, 2008 - Web economics
Three topics from The Technium, a book in progress
Kevin Kelly
How much does one search cost?
Steve Durbin - January 31, 2008 - Search
Trusted Search Communities
Peter Briggs and Barry Smyth
Adaptive Information Cluster, School of Computer Science and Informatics, University College Dublin
Abstract:
We describe a social search technique that harnesses the search
experiences of a community of searchers to generate result
recommendations, in a collaborative fashion, to complement
results returned from some underlying search engine. We
describe a dynamic model of trust as a way to coordinate
collaboration, and provide experimental results to show that
search performance improves as the network evolves.
Steve Durbin - December 6, 2007 - Design
Creativity Support Tools: Accelerating Discovery and Innovation
Ben Shneiderman
Dept. of Computer Science, Univ. of Maryland, College Park
Abstract:
How can designers of programming interfaces, interactive tools, and rich social
environments enable more people to be more creative more often?
Anthony Arnone - November 29, 2007 - Search
SOLR and Lucene
Abstract:
This week we’ll be covering the search engine Lucene and one of its main sub-projects, Solr. I’ll be giving an overview of the concepts behind these two packages, and how you can use them to easily add expansive search capability to just about any Java environment you want.
Lucene is a search engine written in Java that provides high scalability, fast searching, robust handling of generic document structures, and a highly pluggable architecture that makes adding features easy. Lucene has a very active development community and is consistently on the ‘cutting edge’ of search engine technologies.
Solr is a servlet-based front end for the Lucene search engine. It provides XML-based configuration and index schema management, APIs in XML/HTTP, master-slave style replication, caching, and a web based admin interface. It also includes a bunch of trendy features for modern search engines such as faceting, more-like-this, and JSON responses, along with a host of other cool features under development.
Anthony's Powerpoint presentation
Steve Durbin - October 25, 2007 - Ontologies
Deriving a Large Scale Taxonomy from Wikipedia
Simone Paolo Ponzetto and Michael Strube
EML Research, Germany
Abstract:
We take the category system in Wikipedia as a conceptual network.
We label the semantic relations between categories using
methods based on connectivity in the network and lexico-syntactic
matching. As a result we are able to derive a large
scale taxonomy containing a large amount of subsumption,
i.e. isa, relations. We evaluate the quality of the created resource
by comparing it with ResearchCyc, one of the largest
manually annotated ontologies, as well as computing semantic
similarity between words in benchmarking datasets.
Steve Durbin - October 18, 2007 - Text similarity
Improving Similarity Measures for Short Segments of Text
Wen-tau Yih and Christopher Meek
Microsoft Research
Abstract:
In this paper we improve previous work on measuring the
similarity of short segments of text in two ways. First, we introduce
a Web-relevance similarity measure and demonstrate
its effectiveness. This measure extends the Web-kernel similarity
function introduced by Sahami and Heilman (2006)
by using relevance weighted inner-product of term occurrences
rather than TF×IDF. Second, we show that one can
further improve the accuracy of similarity measures by using
a machine learning approach. Our methods outperform other
state-of-the-art methods in a general query suggestion task
for multiple evaluation metrics.
Neal Richter and Steve Durbin - October 4, 2007 - Information Retrieval
Search Engines that Learn from Implicit Feedback
Thorsten Joachims and Filip Radlinski
Cornell University
Abstract:
Search-engine logs provide a wealth of information that machine-learning techniques can
harness to improve search quality. With proper interpretations that avoid inherent biases,
a search engine can use training data extracted from the logs to automatically tailor
ranking functions to a particular user group or collection.
Neal Richter - September 27, 2007 - Information Retrieval
Learning to Rank for Information Retrieval Using Genetic Programming
Jen-Yuan Yeh*, Jung-Yi Lin*, Hao-Ren Ke**, Wei-Pang Yang***
*Dept. of Computer Science,
National Chiao Tung University
**Inst. of Information Management,
National Chiao Tung University
***Dept. of Information Management,
National Dong Hwa University
Abstract:
One central problem of information retrieval (IR) is to determine
which documents are relevant and which are not to the user
information need. This problem is practically handled by a
ranking function which defines an ordering among documents
according to their degree of relevance to the user query. This
paper discusses work on using machine learning to automatically
generate an effective ranking function for IR. This task is referred
to as "learning to rank for IR" in the field. In this paper, a learning
method, RankGP, is presented to address this task. RankGP
employs genetic programming to learn a ranking function by
combining various types of evidences in IR, including content
features, structure features, and query-independent features. The
proposed method is evaluated using the LETOR benchmark
datasets and found to be competitive with Ranking SVM and
RankBoost.
Steve Durbin - September 20, 2007 - Knowledge base organization
Collaborative Structuring: Organizing Document Repositories Effectively and
Efficiently
Harris Wu* and Michael Gordon**
*Old Dominion University; **University of Michigan
Abstract:
Improving how documents are managed and data is stored and
exchanged in systems.
Steve Durbin - September 13, 2007 - Chatbots
A Terrorism Question/Answer System
R. P. Schumaker,* Y. Liu,** M. Ginsburg,*** and H. Chen***
*IS Dept, Iona College; **IS Dept., Cal State Long Beach; ***Dept. of Managmement
Information Systems, U. of Arizona
Abstract:
The TARA Project examined how a trio of modified chatterbots
could be used to disseminate terrorism-related information
to the general public.
Ben Werner - August 23, 2007 - HCI
Tangible Bits: Towards Seamless Interfaces between People, Bits and Atoms
Hiroshi Ishii and Brygg Ullmer, MIT Media Lab
Abstract:
This paper presents our vision of Human Computer
Interaction (HCI): "Tangible Bits." Tangible Bits allows
users to "grasp & manipulate" bits in the center of users’
attention by coupling the bits with everyday physical
objects and architectural surfaces. Tangible Bits also
enables users to be aware of background bits at the
periphery of human perception using ambient display media
such as light, sound, airflow, and water movement in an
augmented space. The goal of Tangible Bits is to bridge
the gaps between both cyberspace and the physical
environment, as well as the foreground and background of
human activities.
This paper describes three key concepts of Tangible Bits:
interactive surfaces; the coupling of bits with graspable
physical objects; and ambient media for background
awareness. We illustrate these concepts with three
prototype systems – the metaDESK, transBOARD and
ambientROOM – to identify underlying research issues.
Steve Durbin - August 2, 2007 - Search
How Well does Result Relevance Predict Session Satisfaction?
Scott B. Huffman, Michael Hochster, Google, Inc.
Abstract:
Per-query relevance measures provide standardized, repeatable
measurements of search result quality, but they ignore
much of what users actually experience in a full search session.
This paper examines how well we can approximate a
user’s ultimate session-level satisfaction using a simple relevance
metric. We find that this relationship is surprisingly
strong. By incorporating additional properties of the query
itself, we construct a model which predicts user satisfaction
even more accurately than relevance alone.
Ben Werner - July 19, 2007 - Usability
Examining the Robustness of Sensor-Based Statistical Models of Human Interruptibility
James Fogarty, Scott E. Hudson, HCI Institute, Carnegie-Mellon
Jennifer Lai, IBM T.J. Watson Research Center
Abstract:
Current systems often create socially awkward interruptions
or unduly demand attention because they have no way of
knowing if a person is busy and should not be interrupted.
Previous work has examined the feasibility of using sensors
and statistical models to estimate human interruptibility in
an office environment, but left open some questions about
the robustness of such an approach. This paper examines
several dimensions of robustness in sensor-based statistical
models of human interruptibility. We show that real
sensors can be constructed with sufficient accuracy to drive
the predictive models. We also create statistical models for
a much broader group of people than was studied in prior
work. Finally, we examine the effects of training data
quantity on the accuracy of these models and consider
tradeoffs associated with different combinations of sensors.
As a whole, our analyses demonstrate that sensor-based
statistical models of human interruptibility can provide
robust estimates for a variety of office workers in a range of
circumstances, and can do so with accuracy as good as or
better than people. Integrating these models into systems
could support a variety of advances in human computer
interaction and computer-mediated communication.
Neal Richter - June 21, 2007 - Search Ranking
Google Keeps Tweaking Its Search Engine
Saul Hansell
New York Times
Abstract:
Article on the Search Quality group at Google
See also the keynote presentation at SIGIR 2005: Challenges in Running a Commercial Search Engine by Amit Singhal of Google
Neal Richter - June 7, 2007 - Search Engines
Why writing your own search engine is hard
Anna Patterson
Stanford University
Abstract:
There must be 4,000 programmers typing away in their basements
trying to build the next “world’s most scalable”
search engine. It has been done only a few times. It has
never been done by a big group; always one to four people
did the core work, and the big team came on to build the
elaborations and the production infrastructure. Why is it
so hard? We are going to delve a bit into the various issues
to consider when writing a search engine. This article is
aimed at those individuals or small groups that are considering
this endeavor for their Web site or intranet. It is fun, but
a word of caution: not only is it difficult, but you need two
commodities in short supply—time and patience.
Zuzana Gedeon - May 17, 2007 - Search
Multi-factor Clustering for a Marketplace Search Interface
Neel Sundaresan, Kavita Ganesan and Roopnath Grandhi
eBay Research Labs
Abstract:
Search engines provide a small window to the vast repository of
data they index and against which they search. They try their best
to return the documents that are of relevance to the user but often
a large number of results may be returned. Users struggle to
manage this vast result set looking for the items of interest.
Clustering search results is one way of alleviating this
navigational pain. In this paper we describe a clustering system
that enables clustering search results in an online marketplace
search system.
Adaptive Faceted Browser for Navigation in Open Information Spaces
Michal Tvarožek and Mária Bieliková
Faculty of Informatics and Information Technologies, Slovak University of Technology
Abstract:
Open information spaces have several unique characteristics
such as their changeability, large size, complexity and diverse
user base. These result in novel challenges during user
navigation, information retrieval and data visualization in
open information spaces. We propose a method of navigation
in open information spaces based on an enhanced
faceted browser with support for dynamic facet generation
and adaptation based on user characteristics.
Zuzana Gedeon, Hasari Tosun, Steve Durbin - April 12, 2007 - WWW 2007
Navigating the Intranet with High Precision
Huaiyu Zhu, Alexander Loser, Sriram Raghavan, and Shivakumar Vaithyanathan
IBM Almaden Research Center and SAP Research CEC
Abstract:
Despite the success of web search engines, search over large enterprise
intranets still suffers from poor result quality. Earlier work
that compared intranets and the Internet from the view point of keyword
search has pointed to several reasons why the search problem
is quite different in these two domains. In this paper, we address the
problem of providing high quality answers to navigational queries
in the intranet (e.g., queries intended to find product or personal
home pages, service pages, etc.). Our approach is based on offline
identification of navigational pages, intelligent generation of termvariants
to associate with each page, and the construction of separate
indices exclusively devoted to answering navigational queries.
Using a testbed of 5.5M pages from the IBM intranet, we present
evaluation results that demonstrate that for navigational queries,
our approach of using custom indices produces results of significantly
higher precision than those produced by a general purpose
search algorithm.
The Complex Dynamics of Collaborative Tagging
Harry Halpin, Valentin Robu, and Hana Shepherd
University of Edinburgh; Center for Math and CS (Amsterdam); Princeton University
Abstract:
The debate within the Web community over the optimal means
by which to organize information often pits formalized classications
against distributed collaborative tagging systems. A number
of questions remain unanswered, however, regarding the nature of
collaborative tagging systems including whether coherent categorization
schemes can emerge from unsupervised tagging by users.
This paper uses data from the social bookmarking site del.icio.us to
examine the dynamics of collaborative tagging systems. In particular,
we examine whether the distribution of the frequency of use
of tags for "popular" sites with a long history (many tags and many
users) can be described by a power law distribution, often characteristic
of what are considered complex systems. We produce a
generative model of collaborative tagging in order to understand
the basic dynamics behind tagging, including how a power law distribution
of tags could arise. We empirically examine the tagging
history of sites in order to determine how this distribution arises
over time and to determine the patterns prior to a stable distribution.
Lastly, by focusing on the high-frequency tags of a site where
the distribution of tags is a stabilized power law, we show how tag
co-occurrence networks for a sample domain of tags can be used
to analyze the meaning of particular tags given their relationship to
other tags.
Identifying Ambiguous Queries in Web Search
Ruihua Song, Zhenxiao Luo, Ji-Rong Wen, Yong Yu, and Hsiao-Wuen Hon
Shanghai Jiao Tong University, Microsoft Research Asia, Fudan University
Abstract:
It is widely believed that some queries submitted to search
engines are by nature ambiguous (e.g., java, apple). However, few
studies have investigated the questions of "how many queries are
ambiguous?" and "how can we automatically identify an
ambiguous query?" This paper deals with these issues. First, we
construct the taxonomy of query ambiguity, and ask human
annotators to manually classify queries based upon it. From
manually labeled results, we find that query ambiguity is to some
extent predictable. We then use a supervised learning approach to
automatically classify queries as being ambiguous or not.
Experimental results show that we can correctly identify 87% of
labeled queries. Finally, we estimate that about 16% of queries in
a real search log are ambiguous.
Steve Durbin - April 5, 2007 - Personalization
Open User Profiles for Adaptive News Systems: Help or Harm?
Jae-wook Ahn, Peter Brusilovsky, Jonathan Grady, Daqing He and Sue Yeon Syn
School of Information Sciences, University of Pittsburgh
Abstract:
Over the last five years, a range of projects have focused on
progressively more elaborated techniques for adaptive news
delivery. However, the adaptation process in these systems has
become more complicated and thus less transparent to the users.
In this paper, we concentrate on the application of open user
models in adding transparency and controllability to adaptive
news systems. We present a personalized news system,
YourNews, which allows users to view and edit their interest
profiles, and report a user study on the system. Our results
confirm that users prefer transparency and control in their systems,
and generate more trust to such systems. However, similar to
previous studies, our study demonstrate that this ability to edit
user profiles may also harm the system’s performance and has to
be used with caution.
Steve Durbin - March 29, 2007 - Reputation Systems
A Content-Driven Reputation System for the Wikipedia
B. Thomas Adler and Luca de Alfaro
University of California, Santa Cruz
Abstract:
We present a content-driven reputation system for
Wikipedia authors. In our system, authors gain reputation
when the edits they perform to Wikipedia articles are
preserved by subsequent authors, and they lose reputation
when their edits are rolled back or undone in short order.
Thus, author reputation is computed solely on the basis of
content evolution; user-to-user comments or ratings are not
used. The author reputation we compute could be used to
flag new contributions from low-reputation authors, or it
could be used to allow only authors with high reputation to
contribute to controversial or critical pages. A reputation
system for the Wikipedia could also provide an incentive for
high-quality contributions.
We have implemented the proposed system, and we have
used it to analyze the entire Italian and French Wikipedias,
consisting of a total of 691,551 pages and 5,587,523 revisions.
Our results show that our notion of reputation has
good predictive value: changes performed by low-reputation
authors have a significantly larger than average probability
of having poor quality, as judged by human observers, and
of being later undone, as measured by our algorithms
Steve Durbin - March 22, 2007 - Collective Intelligence
Web pundits trumpet social software as a means of generating knowledge by tapping into the wisdom of crowds. The following two articles offer contrarian perspectives.
The "Dumbness of Crowds"
Kathy Sierra
DIGITAL MAOISM: The Hazards of the New Online Collectivism
Jaron Lanier
Neal Richter - March 15, 2007 - Semantic Web
AI Meets Web 2.0: Building the Web of Tomorrow, Today
Jay M. Tenenbaum
Abstract:
Imagine an Internet-scale knowledge system where people and intelligent agents can
collaborate on solving complex problems in business, engineering, science, medicine,
and other endeavors. Its resources include semantically tagged websites, wikis, and
blogs, as well as social networks, vertical search engines, and a vast array of web
services from business processes to AI planners and domain models. Research prototypes
of decentralized knowledge systems have been demonstrated for years, but now, thanks
to the web and Moore’s law, they appear ready for prime time. This article introduces
the architectural concepts for incrementally growing an Internet-scale knowledge
system and illustrates them with scenarios drawn from e-commerce, e-science, and e-life.
Steve Durbin - March 8, 2007 - Tag Mining
Ranking Bookmarks and Bistros: Intelligent Community and Folksonomy Development
Benjamin Szekely and Elias Torres
Harvard University
Abstract:
Online communities are an important forum for people to share information
on the internet. Historically, online communities have been centered
around free-text sharing and simple user rating systems. With the
shadow of the Semantic Web looming behind many new innovations
on the internet today, online communities are beginning to adopt semantic
techniques such as tagging and folksonomies as a vehicle for
the sharing and classification of information. Gourmetvillage.org is a
prototype online community that uses tagging to provide an innovative
restaurant rating mechanism. Users may provide free-text reviews and
tags for any aspect of a restaurant. UserRank is an algorithm based
on Google’s PageRank that provides a ranking of users based on
whose taggings are most often followed. TagRank provides a ranking
of tags based on the ranking of users. UserRank may in fact be used
to rank any entity in a community based on user association. del.icio.us
contains a significant dataset of users and tagged entities with which to
test these algorithms. UserRank and TagRank yield different rankings
than their count-based counterparts. When drilling down into the
results, the top user according to UserRank only made a few taggings,
yet hundreds of users in the system agreed with his taggings, showing
that these algorithms capture community consensus fairly well.
Steve Durbin - March 1, 2007 - Tag Mining
Collaborative Creation of Communal Hierarchical Taxonomies in Social Tagging Systems
Paul Heyman and Hector Garcia-Molina
Computer Science Department, Stanford University
Abstract:
Collaborative tagging systems -- systems where many casual users annotate objects with free-form
strings (tags) of their choosing -- have recently emerged as a powerful way to label and organize
large collections of data. During our recent investigation into these types of systems, we discovered
a simple but remarkably effective algorithm for converting a large corpus of tags annotating objects
in a tagging system into a navigable hierarchical taxonomy of tags. We first discuss the algorithm
and then present a preliminary model to explain why it is so effective in these types of systems.
Steve Durbin - February 15, 2007 - User interface
Clustering versus faceted categories for information exploration
Marti Hearst
School of Information Management and Systems
University of California, Berkeley
Abstract:
Information seekers often express a
desire for a user interface that organizes search
results into meaningful groups, in order to
help make sense of the results, and to help
decide what to do next. A longitudinal study
in which participants were provided with the
ability to group search results found they
changed their search habits in response to
having the grouping mechanism available [2].
There are many open research questions about how to generate useful groupings and how to design interfaces to support exploration using grouping. Currently two methods are quite popular: clustering and faceted categorization. Here, I describe both approaches and summarize their advantages and disadvantages based on the results of usability studies.
Steve Durbin - February 8, 2007 - User interface
Simplicity is Highly Overrated is a short essay on the notion of simplicity in user interfaces.
Cautious Cars and Cantankerous Kitchens: How Machines Take Control
Don Norman
Professor of Computer Science, Psychology, and Cognitive science
Northwestern University
Abstract:
The first chapter of The Design of Future Things, to be published in 2008.
It‘s about the ever-increasing role of automation in our homes and automobiles,
why it is being done so badly, with suggestions for doing it right through
what I call “natural interaction.”
Steve Durbin - February 1, 2007 - Ambient AI
Machine Learning, Reasoning, and Intelligence in Daily Life: Directions and Challenges
Eric Horvitz
Microsoft Research
Abstract:
Technical developments and trends are providing a
fertile substrate for creating and integrating machine
learning and reasoning into multiple applications
and services. I will review several illustrative
research efforts on our team, and focus on challenges,
opportunities, and directions with the
streaming of machine intelligence into daily life.
Steve Durbin - November 30, 2006 - Natural Language Understanding
Machine Reading
Oren Etzioni, Michele Banko, Michael J. Cafarella
Computer Science & Engineering, University of Washington
Abstract:
The time is ripe for the AI community to set its sights on
Machine Reading---the automatic, unsupervised
understanding of text. In this paper, we place the notion of
“Machine Reading” in context, describe progress towards
this goal by the KnowItAll research group at the University
of Washington, and highlight several central research
questions.
Steve Durbin - November 2, 2006 - Social Networks
The Dynamics of Viral Marketing
Jure Leskovec, Carnegie Mellon
Lada A. Adamic, University of Michigan
Bernardo A. Huberman, HP Labs
Abstract:
We present an analysis of a person-to-person recommendation network,
consisting of 4 million people who made 16 million recommendations on
half a million products. We observe the propagation of recommendations
and the cascade sizes, which we explain by a simple stochastic model.
We analyze how user behavior varies within user communities defined by a
recommendation network. Product purchases follow a 'long tail' where a
significant share of purchases belongs to rarely sold items. We establish
how the recommendation network grows over time and how effective it is
from the viewpoint of the sender and receiver of the recommendations.
While on average recommendations are not very effective at inducing
purchases and do not spread very far, we present a model that successfully
identfies communities, product and pricing categories for which viral
marketing seems to be very effective.
Steve Durbin - October 26, 2006 - User Behavior
The origin of bursts and heavy tails in human dynamics
Albert-Laszlo Barabasi
Center for Complex Networks Research and Dept. of Physics, University of Notre Dame
Abstract:
The dynamics of many social, technological and economic
phenomena are driven by individual human actions, turning
the quantitative understanding of human behaviour into a
central question of modern science. Current models of human
dynamics, used from risk assessment to communications, assume
that human actions are randomly distributed in time and thus
letters to nature well approximated by Poisson processes. In contrast, there
is increasing evidence that the timing of many human
activities, ranging from communication to entertainment and
work patterns, follow non-Poisson statistics, characterized by
bursts of rapidly occurring events separated by long periods of
inactivity. Here I show that the bursty nature of human
behaviour is a consequence of a decision-based queuing
process: when individuals execute tasks based on some perceived
priority, the timing of the tasks will be heavy tailed, with
most tasks being rapidly executed, whereas a few experience very
long waiting times. In contrast, random or priority blind
execution is well approximated by uniform inter-event statistics.
These finding have important implications, ranging from
resource management to service allocation, in both communications
and retail.
Hasari Tosun - October 19, 2006 - Recommender systems
Unifying User-based and Item-based Collaborative Filtering Approaches by Similarity Fusion
Jun Wang, Arjen P. de Vries, and Marcel J. T. Reinders
Information and Communication Theory Group, Delft University of Technology
Abstract:
Memory-based methods for collaborative filtering predict new ratings by averaging
(weighted) ratings between, respectively, pairs of similar users or items.
In practice, a large number of ratings from similar users or similar items
are not available, due to the sparsity inherent to rating data. Consequently,
prediction quality can be poor. This paper reformulates the memory-based collaborative
filtering problem in a generative probabilistic framework, treating individual
user-item ratings as predictors of missing ratings. The final rating is estimated by
fusing predictions from three sources: predictions based on ratings of the same item
by other users, predictions based on different item ratings made by the same user, and,
third, ratings predicted based on data from other but similar users rating other but
similar items. Existing user-based and item-based approaches correspond to the two simple
cases of our framework. The complete model is however more robust to data sparsity,
because the different types of ratings are used in concert, while additional ratings
from similar users towards similar items are employed as a background model to smooth
the predictions. Experiments demonstrate that the proposed methods are indeed more
robust against data sparsity and give better recommendations.
David Tolliver - October 12, 2006 - Clustering
Note special time (Thursday 3:10) and place (1-139 Wilson)
Spectral Approximation Algorithms: Clustering and Graph Partitioning
David Tolliver
Google Fellow and Department of Computer Science, Carnegie Mellon University
Abstract:
I'll provide an overview of recent advances in spectral clustering and partitioning
technology. I'll briefly cover a fundamental result in clustering, and then delve
into the basic spectral clustering algorithm. I'll walk through a few variations
on the basic algorithm, highlighting the specific applications domains and drawbacks.
Finally I'll cover our proof of an approximation bound for the Normalized Cut (an
NP-hard objection function) for a simple spectral algorithm.
Steve Durbin - October 5, 2006 - Search
A Live-User Evaluation of Collaborative Web Search
Barry Smyth, Evelyn Balfe, Oisin Boydell, Keith Bradley, Peter Briggs, Maurice Coyle, Jill Freyne
Smart Media Institute, Department of Computer Science, University College Dublin
Abstract:
Collaborative Web search exploits repetition and regularity within the query-space
of a community of like-minded individuals in order to improve the quality of
search results. In short, search results that have been judged to be relevant
for past queries are promoted in response to similar queries that occur in the
future. In this paper we present the results of a large-scale evaluation of
this approach, in a corporate Web search scenario, which shows that significant
benefits are available to its users.
Steve Durbin - September 28, 2006 - Tagging
Visualizing Tags over Time
Micah Dubinko, Ravi Kumar, Joseph Magnani, Jasmine Novak, Prabhakar Raghavan, Andrew Tomkins
Yahoo! Research
Abstract:
We consider the problem of visualizing the evolution of tags within the Flickr
(flickr.com) online image sharing community. Any user of the Flickr service
may append a tag to any photo in the system. Over the past year, users have
on average added over a million tags each week. Understanding the evolution
of these tags over time is therefore a challenging task. We present a new
approach based on a characterization of the most interesting tags associated
with a sliding interval of time. An animation provided via Flash in a web
browser allows the user to observe and interact with the interesting tags as
they evolve over time.
New algorithms and data structures are required to support the efficient
generation of this visualization. We combine a novel solution to an interval
covering problem with extensions to previous work on score aggregation in order
to create an efficient backend system capable of producing visualizations at
arbitrary scales on this large dataset in real time.
Neal Richter - September 21, 2006 - Index Mining
Random Sampling from a Search Engine's Index
Best Paper Winner at WWW-2006
Ziv Bar-Yossef, Technion - Israel Institute of Technology, Israel
Maxim Gurevich, Department of Electrical Engineering, Technion, Israel
Abstract:
We revisit a problem introduced by Bharat and Broder al-
most a decade ago: how to sample random pages from a
search engine's index using only the search engine's public
interface? Such a primitive is particularly useful in creating
objective benchmarks for search engines.
The technique of Bharat and Broder suffers from two well
recorded biases: it favors long documents and highly ranked
documents. In this paper we introduce two novel sam-
pling techniques: a lexicon-based technique and a random
walk technique. Our methods produce biased sample docu-
ments, but each sample is accompanied by a corresponding
¿weight¿, which represents the probability of this document
to be selected in the sample. The samples, in conjunction
with the weights, are then used to simulate near-uniform
samples. To this end, we resort to three well known Monte
Carlo simulation methods: rejection sampling, importance
sampling and the Metropolis-Hastings algorithm.
We analyze our methods rigorously and prove that under
plausible assumptions, our techniques are guaranteed to pro-
duce near-uniform samples from the search engine's index.
Experiments on a corpus of 2.4 million documents substanti-
ate our analytical findings and show that our algorithms do
not have significant bias towards long or highly ranked doc-
uments. We use our algorithms to collect fresh data about
the relative sizes of Google, MSN Search, and Yahoo!.
Neal Richter - September 14, 2006 - Click-stream Data and Queries
Time-Dependent Semantic Similarity Measure of Queries Using Historical Click-Through Data
Qiankun Zhao, Steven C. H. Hoi, Tie-Yan Liu, Sourav S Bhowmick, Michael R. Lyu, Wei-Ying Ma
Nanyang Technological University, The Chinese University of HK, Microsoft
Research Asia
Abstract:
It has become a promising direction to measure similarity of
Web search queries by mining the increasing amount of clickthrough
data logged by Web search engines, which record
the interactions between users and the search engines. Most
existing approaches employ the click-through data for similarity
measure of queries with little consideration of the temporal
factor, while the click-through data is often dynamic
and contains rich temporal information. In this paper we
present a new framework of time-dependent query semantic
similarity model on exploiting the temporal characteristics
of historical click-through data. The intuition is that more
accurate semantic similarity values between queries can be
obtained by taking into account the timestamps of the log
data. With a set of user-defined calendar schema and calendar
patterns, our time-dependent query similarity model is
constructed using the marginalized kernel technique, which
can exploit both explicit similarity and implicit semantics
from the click-through data effectively. Experimental results
on a large set of click-through data acquired from a commercial
search engine show that our time-dependent query similarity
model is more accurate than the existing approaches.
Moreover, we observe that our time-dependent query similarity
model can, to some extent, reflect real-world semantics
such as real-world events that are happening over time.
Steve Durbin - September 7, 2006 - Ranking the web
Beyond PageRank: machine learning for static ranking
Matthew Richardson, Amit Prakash, and Eric Brill
Microsoft Research, Inc.
Abstract:
Since the publication of Brin and Page's paper on PageRank, many in the Web community
have depended on PageRank for the static (query-independent) ordering of Web pages.
We show that we can significantly outperform PageRank using features that are independent
of the link structure of the Web. We gain a further boost in accuracy by using data on
the frequency at which users visit Web pages. We use RankNet, a ranking machine learning
algorithm, to combine these and other static features based on anchor text and domain
characteristics. The resulting model achieves a static ranking pairwise accuracy of 67.3%
(vs. 56.7% for PageRank or 50% for random).
Steve Durbin - August 31, 2006 - Better answers
Retroactive Answering of Search Queries
Beverly Yang and Glen Jeh
Google, Inc.
Abstract:
Major search engines currently use the history of a user's actions (e.g., queries,
clicks) to personalize search results. In this paper, we present a new personalized
service, query-specific web recommendations (QSRs), that retroactively answers
queries from a user's history as new results arise. The QSR system addresses two
important subproblems with applications beyond the system itself: (1) Automatic
identification of queries in a user's history that represent standing interests
and unfulfilled needs. (2) Effective detection of interesting new results to these
queries. We develop a variety of heuristics and algorithms to address these problems,
and evaluate them through a study of Google history users. Our results strongly
motivate the need for automatic detection of standing interests from a user's history,
and identifies the algorithms that are most useful in doing so. Our results also
identify the algorithms, some which are counter-intuitive, that are most useful in
identifying interesting new results for past queries, allowing us to achieve very
high precision over our data set.
Sam Gardner - April 27, 2006 - Neural networks
Self-Organization of Hierarchical Visual Maps with Feedback Connections
Yiu Fai Sit and Risto Miikkulainen
Department of Computer Sciences, University of Texas at Austin
Abstract:
Visual areas in primates are known to have reciprocal connections. While
the feedforward bottom-up processing of visual information has been
studied extensively for decades, little is known about the role of the
feedback connections. Existing feedback models usually employ hand-coded
connections, and do not address how these connections develop. The model
described in this paper shows how feedforward and feedback connections
between cortical areas V1 and V2 can be learned through self-organization
simultaneously. Computational experiments show that both areas can form
hierarchical representations of the input with reciprocal connections that
link relevant cells in the two areas.
Mike Emery - April 20, 2006 - Self-organization
Decentralised Autonomic Computing: Analysing Self-Organising Emergent Behaviour
using Advanced Numerical Methods
Tom De Wolf, Giovanni Samaey, Tom Holvoet and Dirk Roose
Department of Computer Science, KU Leuven, Belgium
Abstract:
When designing decentralised autonomic computing systems, a fundamental engineering
issue is to assess systemwide behaviour. Such decentralised systems are characterised
by the lack of global control, typically consist of autonomous cooperating entities,
and often rely on self-organised emergent behaviour to achieve the requirements.
A well-founded and practically feasible approach to study overall system behaviour is
a prerequisite for successful deployment. On one hand, formal proofs of correct
behaviour and even predictions of the exact systemwide behaviour are practically
infeasible due to the complex, dynamic, and often non-deterministic nature of
self-organising emergent systems. On the other hand, simple simulations give no
convincing arguments for guaranteeing system-wide properties. We describe an
alternative approach that allows to analyse and assess trends in systemwide
behaviour, based on so-called equation-free macroscopic analysis. This
technique yields more reliable results about the system-wide behaviour, compared
to mere observation of simulation results, at an affordable computational cost.
Numerical algorithms act at the system-wide level and steer the simulations.
This allows to limit the amount of simulations considerably.
We illustrate the
approach by studying a particular system-wide property of a decentralised control
system for Automated Guided Vehicles and we outline a road map towards a general
methodology for studying decentralised autonomic computing systems.
Leif Wickland - April 13, 2006 - Web text mining
Deriving Marketing Intelligence from Online Discussion
Glance et al.
Intelliseek Applied Research Center
Abstract:
Weblogs and message boards provide online forums for discussion that record
the voice of the public. Woven into this mass of discussion is a wide range
of opinion and commentary about consumer products. This presents an opportunity
for companies to understand and respond to the consumer by analyzing this
unsolicited feedback. Given the volume, format and content of the data, the
appropriate approach to understand this data is to use large-scale web and
text data mining technologies. This paper argues that applications for mining
large volumes of textual data for marketing intelligence should provide two
key elements: a suite of powerful mining and visualization technologies and
an interactive analysis environment which allows for rapid generation and
testing of hypotheses. This paper presents such a system that gathers and
annotates online discussion relating to consumer products using a wide variety
of state-of-the-art techniques, including crawling, wrapping, search, text
classification and computational linguistics. Marketing intelligence is
derived through an interactive analysis framework uniquely configured to
leverage the connectivity and content of annotated online discussion.
Neal Richter - April 6, 2006 - Social Networking
System of Mobile Agents to Model Social Networks
Gonzalez, Marta C., Lind, Pedro G. and Herrmann, Hans J.
Univ Stuttgart, Departamento de Fisica, Universidade Federal do Ceara
Abstract:
We propose a model of mobile agents to construct social networks, based on a
system of moving particles by keeping track of the collisions during their
permanence in the system. We reproduce not only the degree distribution,
clustering coefficient and shortest path length of a large data base of
empirical friendship networks recently collected, but also some features
related with their community structure. The model is completely characterized
by the collision rate and above a critical collision rate we find the emergence
of a giant cluster in the universality class of two-dimensional percolation.
Moreover, we propose possible schemes to reproduce other networks of particular
social contacts, namely sexual contacts.
Bob Wall - March 30, 2006 - Models of Neural Processing
Computation in the Higher Visual Cortices: Map-Seeking Circuit Theory and
Application to Machine Vision
David Arathorn
Center for Computational Biology, MSU, and General Intelligence Corporation
Abstract:
Map-Seeking Circuit theory is a biologically based computational theory of vision
applicable to difficult machine vision problems such as recognition of 3D
objects in arbitrary poses amid distractors and clutter, as well as to
non-recognition problems such as terrain interpretation. It provides a general
computational mechanism for tractable discovery of correspondences in massive
transformation spaces by exploiting an ordering property of superpositions.
The latter allows a set of transformations of an input image to be formed into
a sequence of superpositions which are then culled to a composition of single
mappings by a competitive process which matches each superposition against a
superposition of inverse transformations of memory patterns. The architecture
that performs this is based on a number of neuroanatomical features of the visual
cortices, including reciprocal dataflows and inverse mappings.
Steve Durbin - March 9, 2006 - Spelling correction
Learning a Spelling Error Model from Search Query Logs
Farooq Ahmad and Grzegorz Kondrak
University of Alberta
Abstract:
Applying the noisy channel model to search query spelling correction requires an
error model and a language model. Typically, the error model relies on a weighted
string edit distance measure. The weights can be learned from pairs of misspelled
words and their corrections. This paper investigates using the Expectation
Maximization algorithm to learn edit distance weights directly from search query
logs, without relying on a corpus of paired words.
Dave Kortenkamp - March 3, 2006 - Robotics
Note special time (Friday 3:10) and place, 101 Roberts
Worlds to Explore: Autonomy Challenges for Human Space Flight
Abstract:
This talk will look at two distinct artificial intelligence challenges in human
space flight. The first challenge can be compared to building HAL 9000 from
the movie 2001: A Space Odyssey, that is, autonomous monitoring and control of
long duration mission functions. The second challenge can be compared to building
Lt. Cmdr. Data from Star Trek: The Next Generation, that is, autonomous control
of highly dexterous robots. Both challenges will be highlighted by showing research
being done at NASA Johnson Space Center. This research includes autonomous control
of advanced life support system tests and control of a humanoid robot called Robonaut.
The talk will include many videos of NASA's robotics research.
Steve Durbin - February 23, 2005 - Language evolution
The "Lexical Contract": Modeling the emergence of word pronunciations
???
Abstract:
Even in the absence of writing, human languages appear to develop and maintain
vocabularies of roughly 10,000 morphemes, and at least 100,000 words or phrases
whose meaning is not predictable from their constituent parts. This presents
a formidable task for the individual language learner, who must learn words
from a small number of uses in context, and must manage to remember such a large
inventory of arbitrary pronunciations.
On a different scale of space and time, this also presents an extraordinary
challenge for the speech community, which must must somehow form and maintain
a consensus about these tens of thousands of complex and arbitrary conventions.
This is accomplished despite the fact that the conventional knowledge at stake
is tacit: the individuals involved typically don't know what they know about
word pronunciation, and don't try to discuss the details of this knowledge.
...
We will try to find simple programs for interacting agents that lead to successful
emergence of patterns of word pronunciation. As in the case of "ant colony
optimization", we'll look for algorithms that can accomplish this without
requiring any designation of authorities or any explicit negotiation to reach
consensus.
Zuzana Gedeon - February 16, 2005 - Text clustering
Topic-Driven Clustering for Document Datasets
Ying Zhao and George Karypis
Dept. of Computer Science and Engineering, Univ. of Minnesota
Abstract:
In this paper, we define the problem of topic-drive clustering, which organizes
a document collection according to a set of topics. We propose three topic-driven
schemes that consider the similarity between documents and topics and the
relationship among documents themselves simultaneously. We present a comprehensive
experimental evaluation of the proposed topic-driven schemes on five datasets.
Our experimental results show that the proposed topic-driven schemes are efficient
and effective with topic prototypes of different levels of specificity.
Steve Durbin - February 9, 2005 - Language Evolution
Language evolution: consensus and controversies
Morten H. Christiansen* and Simon Kirby**
*Cornell University and **Edinburgh University
Abstract:
Why is language the way it is? How did language come to be this way? And why
is our species alone in having complex language? These are old unsolved questions
that have seen a renaissance in the dramatic recent growth in research being
published on the origins and evolution of human language. This review provides
a broad overview of some of the important current work in this area. We highlight
new methodologies (such as computational modeling), emerging points of consensus
(such as the importance of pre-adaptation), and the major remaining controversies
(such as gestural origins of language). We also discuss why language evolution
is such a difficult problem, and suggest probable directions research may take
in the near future.
Neal Richter - February 2, 2005 - Text Mining with Genetic Algorithms
A Semantically Guided and Domain-Independent
Evolutionary Model for Knowledge Discovery
From Texts
John Atkinson-Abutridy, Chris Mellish, and Stuart Aitken
Abstract:
We present a novel evolutionary model for knowledge
discovery from texts (KDTs), which deals with issues concerning
shallow text representation and processing for mining purposes in
an integrated way. Its aims is to look for novel and interesting explanatory
knowledge across text documents. The approach uses
natural language technology and genetic algorithms to produce explanatory
novel hypotheses. The proposed approach is interdisciplinary,
involving concepts not only from evolutionary algorithms
but also from many kinds of text mining methods. Accordingly,
new kinds of genetic operations suitable for text mining are proposed.
The principles behind the representation and a new proposal
for using multiobjective evaluation at the semantic level are
described. Some promising results and their assessment by human
experts are also discussed which indicate the plausibility of the
model for effective KDT.
Steve Durbin - January 26, 2005 - Bayesian Modelling
Bayesian models of human action understanding
Chris L. Baker, Joshua B. Tenenbaum and Rebecca R. Saxe
Dept. of Brain and Cognitive Sciences, MIT
Abstract:
We present a Bayesian framework for explaining how people reason about and
predict the actions of an intentional agent, based on observing its behavior.
Action-understanding is cast as a problem of inverting a probabilistic generative
model, which assumes that agents tend to act rationally in order to achieve their
goals given the constraints of their environment. Working in a simple sprite-world
domain, we show how this model can be used to infer the goal of an agent and
predict how the agent will act in novel situations or when environmental constraints
change. The model provides a qualitative account of several kinds of inferences
that preverbal infants have been shown to perform, and also fits quantitative
predictions that adult observers make in a new experiment.
Reggie Mead - January 19, 2005 - Modelling
A Review of AI Techniques for Ecological Modelling
Reginald Mead
Dept. of Computer Science, MSU and RightNow Technologies
Abstract:
Environmental models are used to both further scientific understanding of environmental
processes and to aid in environmental decision making. Artificial intelligence
techniques can enhance the approaches that have traditionally been followed for
modelling. Generally, ecological modelling approaches can be classified as either
process-based, empirical-based, or a combination of the two. Process-based models
use an understanding of underlying sub-processes to model a system. Expert systems
can be constructed from expert knowledge and can be used for prediction, planning,
interpretation, and more. Process-based models have high explanatory depth, but also
tend to have low predictive power. Empirical-based models explicitly exclude any a priori
knowledge of the system, and instead model the system based on observation data alone.
...
Rick Sojda - December 8, 2005 - Modelling
An Artificial Intelligence Modelling Approach to Simulating Animal/Habitat Interactions
H. Saarenmaa l, N.D. Stone 2,4, L.J. Folse 3, J.M. Packard 3, W.E. Grant 3, M.E. Makela 2 and R.N. Coulson 2
1 The Finnish Forest Research Institute
Departments of 2 Entomology, and 3 Wildlife and Fisheries Science, Texas A&M University
Abstract:
Ecological modellers have begun to recognize the potential of object-oriented
programming techniques in structuring models. However, little has been done to
take advantage of artificial intelligence's (AI) symbolic representations to
model the decision-making processes of animals. Here, a generic model of
animal-habitat interaction and a specific model of moose-, Alces alces L., forest
interactions in Finland are described that are event-driven and behavior-based.
Individual level simulation is accomplished through an object-oriented knowledge
representation scheme and AI techniques to implement a hierarchical decision-making
model of behavior. The habitat is likewise represented in an object-oriented
scheme, allowing the simulation of a heterogeneous environment. Other AI techniques
for modelling behavior, memory, and actions are discussed including LISP methods,
rule-based reasoning, and several search algorithms. Simulations of the moose-forest
system show the power of this approach but are not intended to advance the theory
of large-herbivore behavior and foraging. AI techniques are found to be most
beneficial in (a) studying population processes based on individual level models
of behavior, (b) modelling spatial heterogeneity, (c) building event-driven models,
(d) providing a conceptual clarity to model construction, and (e) providing a
structure equally well suited to simulating resource management.
Neal Richter - December 1, 2005 - Implicit Feedback
A Study of Factors Affecting the Utility of Implicit Relevance Feedback
Ryen W. White,* Ian Ruthven,** and Joemon M. Jose^
*Human-Computer Interaction Laboratory, University of Maryland
**Dept. of Computer and Information Sciences, University of Strathclyde
^Dept of Computing Science, University of Glasgow
Abstract:
Implicit relevance feedback (IRF) is the process by which a search system
unobtrusively gathers evidence on searcher interests from their interaction
with the system. IRF is a new method of gathering information on user interest
and, if IRF is to be used in operational IR systems, it is important to establish
when it performs well and when it performs poorly. In this paper we investigate
how the use and effectiveness of IRF is affected by three factors: search
task complexity, the search experience of the user and the stage in the search.
Our findings suggest that all three of these factors contribute to the utility of IRF.
Accurately Interpreting Clickthrough Data as Implicit Feedback
Thorsten Joachims,* Laura Granka,** Bing Pan,^ Helene Hembrooke,^ and Geri Gay^
*Dept. of Computer Science, Cornell University
**Dept. of Communication, Stanford University
^Information Science, Cornell University
Abstract:
This paper examines the reliability of implicit feedback generated from
clickthrough data in WWW search. Analyzing the users' decision process
using eyetracking and comparing implicit feedback against manual relevance
judgments, we conclude that clicks are informative but biased. While this
makes the interpretation of clicks as absolute relevance judgments difficult,
we show that relative preferences derived from clicks are reasonably accurate
on average.
Bob Wall - November 17, 2005 - Clustering for Information Retrieval
Creating Concept Hierarchies in an Information Retrieval System
Bob Wall,*^ Neal Richter,*^ and Rafal A. Angryk^
*RightNow Technologies
^Dept. of Computer Science, Montana State University
Abstract:
Most information retrieval systems are comprised of a focused set of domain-specific
documents located within a single logical repository. A mechanism is developed by
which user queries against such a system are used to generate a concept hierarchy
pertinent to the domain. First, an algorithm is described which extracts terms
from documents matching user queries, and then reduces this set of terms to a
manageable length. The resulting terms are used to generate a feature vector for
each query, and the queries are clustered using a Hierarchical Agglomerative Clustering
(HAC) algorithm. The HAC algorithm generates a binary tree of clusters, which is not
particularly amenable to use by humans and which is slow to search due to its depth,
so a subsequent processing step applies min-max partitioning to form a shallower,
bushier tree that is a more natural representation of the concept hierarchy inherent in the system.
Steve Durbin - November 10, 2005 - Web computing
Learning from THE WEB (Word doc)
Learning from THE WEB (ACM Queue online)
Adam Bosworth
Google
Abstract:
The Web has much to teach us about managing and modeling distributed data.
It's time we began listening.
Steve Durbin - November 3, 2005 - Web search
Automatic Identification of User Goals in Web Search
Uichin Lee, Zhenyu Liu, and Junghoo Cho
Dept. of Computer Science, University of California, Los Angeles
Abstract:
There have been recent interests in studying the "goal" behind
a user's Web query, so that this goal can be used to improve
the quality of a search engine's results. Previous studies
have mainly focused on using manual query-log investigation
to identify Web query goals. In this paper we study
whether and how we can automate this goal-identification
process. We first present our results from a human subject
study that strongly indicate the feasibility of automatic
query-goal identification. We then propose two types of features
for the goal-identification task: user-click behavior and
anchor-link distribution. Our experimental evaluation shows
that by combining these features we can correctly identify
the goals for 90% of the queries studied.
Steve Durbin - October 27, 2005 - Visualization
Arc Diagrams: Visualizing Structure in Strings
Martin Wattenberg
IBM Research
Abstract:
This paper introduces a new visualization method, the arc diagram , which is
capable of representing complex patterns of repetition in string data. Arc diagrams
improve over previous methods such as dotplots because they scale efficiently
for strings that contain many instances of the same subsequence. This paper
describes design and implementation issues related to arc diagrams and shows
how they may be applied to visualize such diverse data as music, text, and
compiled code.
Leif Wickland - October 20, 2005 - Google File System
Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
Google
Abstract:
We have designed and implemented the Google File System, a scalable distributed
file system for large distributed data-intensive applications. It provides fault
tolerance while running on inexpensive commodity hardware, and it delivers high
aggregate performance to a large number of clients.
While sharing many of the same goals as previous distributed file systems, our design
has been driven by observations of our application workloads and technological
environment, both current and anticipated, that reflect a marked departure from
some earlier file system assumptions. This has led us to reexamine traditional
choices and explore radically different design points.
The file system has successfully met our storage needs. It is widely deployed
within Google as the storage platform for the generation and processing of data
used by our service as well as research and development efforts that require large
data sets. The largest cluster to date provides hundreds of terabytes of storage
across thousands of disks on over a thousand machines, and it is concurrently
accessed by hundreds of clients.
In this paper, we present file system interface extensions designed to support
distributed applications, discuss many aspects of our design, and report measurements
from both micro-benchmarks and real world use.
Zuzana Gedeon - October 13, 2005 - Machine learning and speech recognition for evil purposes
Keyboard Acoustic Emanations Revisited
Li Zhuang, Feng Zhou, and J. D. Tygar
University of California, Berkeley
Abstract:
We examine the problem of keyboard acoustic emanations. We present a novel
attack taking as input a 10-minute sound recording of a user typing English
text using a keyboard, and then recovering up to 96% of typed characters.
There is no need for a labeled training recording. Moreover the recognizer
bootstrapped this way can even recognize random text such as passwords:
In our experiments, 90% of 5-character random passwords using only letters can
be generated in fewer than 20 attempts by an adversary; 80% of 10-character
passwords can be generated in fewer than 75 attempts. Our attack uses the
statistical constraints of the underlying content, English language, to
reconstruct text from sound recordings without any labeled training data.
The attack uses a combination of standard machine learning and speech
recognition techniques, including cepstrum features, Hidden Markov Models,
linear classification, and feedback-based incremental learning.
Anthony Arnone - September 29, 2005 - Web Crawling
User-Centric Web Crawling
Sandeep Pandey and Christopher Olston
Carnegie Mellon University
Abstract:
Search engines are the primary gateways of information access on the Web today.
Behind the scenes, search engines crawl the Web to populate a local indexed
repository of Web pages, used to answer user search queries. In an aggregate sense,
the Web is very dynamic, causing any repository of Web pages to become out of
date over time, which in turn causes query answer quality to degrade. Given the
considerable size, dynamicity, and degree of autonomy of the Web as a whole, it is not
feasible for a search engine to maintain its repository exactly synchronized with the Web.
In this paper we study how to schedule Web pages for selective (re)downloading into
a search engine repository. The scheduling objective is to maximize the quality of
the user experience for those who query the search engine. We begin with a
quantitative characterization of the way in which the discrepancy between the
content of the repository and the current content of the live Web impacts the
quality of the user experience. This characterization leads to a usercentric
metric of the quality of a search engine's local repository. We use this metric
to derive a policy for scheduling Web page (re)downloading that is driven by
search engine usage and free of exterior tuning parameters. We then focus on the
important subproblem of scheduling refreshing of Web pages already present in the
repository, and show how to compute the priorities efficiently. We provide extensive
empirical comparisons of our user-centric method against prior Web page refresh
strategies, using real Web data. Our results demonstrate that our method requires
far fewer resources to maintain same search engine quality level for users, leaving
substantially more resources available for incorporating new Web pages into the search repository.
Neal Richter - September 22, 2005 - Language learning
Unsupervised Learning of Natural Languages
Zach Solan, David Horn, Eytan Ruppin and Shimon Edelman,
Tel Aviv University, and Cornell University
Abstract:
We address the problem, fundamental to linguistics, bioinformatics, and certain
other disciplines, of using corpora of raw symbolic sequential data to infer
underlying rules that govern their production. Given a corpus of strings (such
as text, transcribed speech, chromosome or protein sequence data, sheet music,
etc.), our unsupervised algorithm recursively distills from it hierarchically
structured patterns. The ADIOS (automatic distillation of structure) algorithm
relies on a statistical method for pattern extraction and on structured
generalization, two processes that have been implicated in language
acquisition. It has been evaluated on artificial context-free grammars with
thousands of rules, on natural languages as diverse as English and Chinese, and
on protein data correlating sequence with function. This unsupervised algorithm
is capable of learning complex syntax, generating grammatical novel sentences,
and proving useful in other fields that call for structure discovery from raw
data, such as bioinformatics.
Steve Durbin - September 15, 2005 - Web text mining
The Predictive Power of Online Chatter
Daniel Gruhl, R. Guha*, Ravi Kumar, Jasmine Novak and Andrew Tomkins
IBM Almaden Research, *Google
Abstract:
An increasing fraction of the global discourse is migrating
online in the form of blogs, bulletin boards, web pages, wikis,
editorials, and a dizzying array of new collaborative technologies.
The migration has now proceeded to the point
that topics reflecting certain individual products are sufficiently
popular to allow targeted online tracking of the ebb
and flow of chatter around these topics. Based on an analysis
of around half a million sales rank values for 2,340 books
over a period of four months, and correlating postings in
blogs, media, and web pages, we are able to draw several
interesting conclusions.
First, carefully hand-crafted queries produce matching postings
whose volume predicts sales ranks. Second, these queries
can be automatically generated in many cases. And third,
even though sales rank motion might be difficult to predict
in general, algorithmic predictors can use online postings to
successfully predict spikes in sales rank.
Steve Durbin - September 8, 2005 - Robot swarms
Dynamic Task Assignment in Robot Swarms
James McLurkin and Daniel Yamins*
Computer Science and Artificial Intelligence Lab, MIT, *Dept. of Mathematics, Harvard University
Abstract:
A large group of robots will often be partitioned into subgroups, each subgroup
performing a different task. This paper presents four distributed algorithms
for assigning swarms of homogenous robots to subgroups to meet a specified
global task distribution. Algorithm Random-Choice selects tasks randomly,
but runs in constant time. Algorithm Extreme-Comm compiles a complete inventory
of all the robots on every robot, runs quickly, but uses a great deal of
communication. The CardDealer's algorithm assigns tasks to individual robots
sequentially, using minimal communications but a great deal of time. The
Tree-Recolor algorithm is a compromise between Extreme-Comm and Card-Dealer's,
balancing communications use and running time. The three deterministic
algorithms drive the system towards the desired assignment of subtasks with
high accuracy. We implement the algorithms on a group of 25 iRobot SwarmBots,
and collect and analyze performance data.
Photos and video of robot swarm behaviors
Neal Richter - August 18, 2005 - Michael Collins's Talk on NLP at the University of Washington.
Abstract:
Natural Language Processing
Natural language processing offers a rich problem domain for machine learning
approaches. Many NLP problems require the induction of a mapping that involves
complex, discrete structures such as strings, labeled sequences, or trees. In
this distinguished lecture, Michael Collins describes how 'large margin'
methods in machine learning can be generalized to 'structured' problems found
in NLP.
Speaker(s): Michael Collins, Massachusetts Institute of Technology
Subject: Natural Language processing
Series: CSE Colloquia - 2005
Produced by: University of Washington, December 7, 2005
Runtime: 00:58:16
Streaming
Video at UWTV.org
Steve Durbin - August 4, 2005 - Personal information management
The Haystack Project is investigating approaches designed to let people manage their information in ways that make the most sense to them. Learn more at Haystack Project
The Re:Search Engine: Helping People Return to Information on the Web
Jaime Teevan
MIT
Abstract:
Re-finding information is commonly cited as a problem on the Web. One reason
re-finding on the Web is difficult is that while people rely on a considerable
amount of context to return to information (e.g., the original path taken to
it), the Web makes no guarantee that the context will remain static. The Re:Search
Engine is designed to help people return to information in the dynamic environment
of the Web by maintaining consistency in the search results it returns across
time. For example, if Connie, while looking to purchase a Global Positioning System,
found several systems she liked via a search for GPS , she would expect to be able
to use the same query to locate the exact same systems again. However, simply
returning the original result list when she re-issues the query might omit newly
available GPS systems that she would like to see. The ideal result list would contain
both the systems Connie remembers having seen and high quality new systems. Because
people tend to remember little of what is presented in a result list, when a person
repeats a query, the Re:Search Engine can preserve what is remembered about the
original result set while still presenting new information.
Neal Richter - May 19th and 26th, 2005 - Steven Wolfram's Talk at University of Michigan.
On October 8, 2003 Stephen Wolfram gave a talk sponsored by
University of Michigan Center for Study of Complex Systems titled and
about his (then new) book A New Kind of Science.
Wolfram.wmv -
Windows Media Player Movie File
Abstract:
The foundational observation of "A New Kind of Science" is that very simple
systems are very frequently capable of great complexity. While isolated
examples of this have been known for millenia, no fundamental significance was
attached to them.
The surprising fact is that if one just starts enumerating simple computational systems, one very quickly encounters great complexity. This is robust fact that does not depend on the details of the system in question, but rather seems to be a generic law of our universe.
The fact that complexity is in fact rather easy to generate suggests an entirely new approach to science. The traditional path of science is to seek out examples of behavior in the natural world and try to break them down into understandable pieces. But an alternative path is to just systematically enumerate simple abstract computational systems, starting with the very simplest, and try to understand what they do.
Links about the Book (and controversy surrounding it):
- http://www.wolframscience.com/ - The official webiste of the book.
- Wikipedia entry on A New Kind of Science
- http://gnosticalturpitude.org/archives/000026.html
Steve Durbin - April 7, 2005 - Recommender systems
TiVo: Making Show Recommendations Using a Distributed Collaborative Filtering Architecture
Kamal Ali,* Wijnand van Stam**
*TiVo, Yahoo!; **TiVo
Abstract:
We describe the TiVo television show collaborative recommendation system which has been
fielded in over one million TiVo clients for four years. Over this install base,
TiVo currently has approximately 100 million ratings by users over approximately
30,000 distinct TV shows and movies. TiVo uses an item-item (show to show) form of
collaborative filtering which obviates the need to keep any persistent memory of each
user viewing preferences at the TiVo server. Taking advantage of TiVo client-server
architecture has produced a novel collaborative filtering system in which the server
does a minimum of work and most work is delegated to the numerous clients. Nevertheless,
the server-side processing is also highly scalable and parallelizable. Although we have
not performed formal empirical evaluations of its accuracy, internal studies have shown
its recommendations to be useful even for multiple user households. TiVo architecture
also allows for throttling of the server so if more server-side resources become
available, more correlations can be computed on the server allowing TiVo to make
recommendations for niche audiences.
Steve Durbin - March 31, 2005 - Subjectivity analysis
Learning Subjective Nouns using Extraction Pattern Bootstrapping
Ellen Riloff,* Janyce Wiebe,** Theresa Wilson**
*University of Utah, **University of Pittsburgh
Abstract:
We explore the idea of creating a subjectivity classifier that uses lists of
subjective nouns learned by bootstrapping algorithms. The goal of our
research is to develop a system that can distinguish subjective sentences
from objective sentences. First, we use two bootstrapping algorithms
that exploit extraction patterns to learn sets of subjective nouns.
Then we train a Naive Bayes classifier using the subjective nouns,
discourse features, and subjectivity clues identified in prior research.
The boot-strapping algorithms learned over 1000 subjective nouns, and the
subjectivity classifier performed well, achieving 77% recall with 81% precision.
Steve Durbin - March 24, 2005 - Social Information Foraging
Footprints: History-Rich Tools for Information Foraging
Alan Wexelblat and Pattie Maes
MIT Media Lab
Abstract:
Inspired by Hill and Hollan's original work, we have been developing a theory
of interaction history and building tools to apply this theory to navigation
in a complex information space. We have built a series of tools--map, paths,
annotations and signposts--based on a physical-world navigation metaphor.
These tools have been in use for over a year. Our user study involved a
controlled browse task and showed that users were able to get the same amount
of work done with significantly less effort.
Ken Dyke - March 10, 2005 - Computer Security & AI
Anomaly Detection of Web-based Attacks
By Christopher Kruegel & Giovanni Vigna
Reliable Software Group
University of California, Santa Barbara
Abstract:
Web-based vulnerabilities represent a substantial portion of the
security exposures of computer networks. In order to detect known web-based
attacks, misuse detection systems are equipped with a large number of
signatures. Unfortunately, it is difficult to keep up with the daily disclosure
of web-related vulnerabilities, and, in addition, vulnerabilities may be
introduced by installation-specific web-based applications. Therefore, misuse
detection systems should be complemented with anomaly detection systems. This
paper presents an intrusion detection system that uses a number of different
anomaly detection techniques to detect attacks against web servers and
web-based applications. The system correlates the serverside programs
referenced by client queries with the parameters contained in these queries.
The application-specific characteristics of the parameters allow the system to
perform focused analysis and produce a reduced number of false positives. The
system derives automatically the parameter profiles associated with web
applications (e.g., length and structure of parameters) from the analyzed data.
Therefore, it can be deployed in very different application environments
without having to perform time-consuming tuning and configuration.
Neal Richter - March 3, 2005 - Search Engine Security
The Insecure Indexing Vulnerability
Attacks Against Local Search Engines
By Amit Klein
Security Researcher
Abstract:
This paper describes several techniques (many of them new) for exposing file
contents using the site search functionality. It is assumed that a site
contains documents which are not visible/accessible to external users. Such
documents are typically future PR items, or future security advisories,
uploaded to the website beforehand. However, the site is also searchable via an
internal search facility, which does have access to those documents, and as
such, they are indexed by it not via web crawling, but rather, via direct
access to the files (and therein lies the security breach).
Several attack techniques are described, some very simple and quick, while other require an enormous amount of traffic; not all attacks are relevant to a particular site, as they depend on the richness of syntax supported by the site's search engine.
The paper concludes with methods to detect insecure indexing vulnerability and suggested solutions.
Note that this attack is fundamentally different than exploitation of external (remote) search engines.
Desktop
search engines threaten SSL VPN security
By Author Tim Greene
Abstract:
Discusses how new Desktop Search tools from Google and others are a
security threat.
Matrix of Desktop Search Engine Indexing Capabilities
Steve Durbin - February 24, 2005 - Knowledge-based search
Exploiting a Thesaurus-Based Semantic Net for Knowledge-Based Search
Peter Clark, John Thompson, Heather Holmback, Lisbeth Duncan
Mathematics and Computing Technology, The Boeing Company
Abstract:
With the growth of on-line information, the need for better resource location
services is growing rapidly. A popular goal is to conduct search in terms of
concepts, rather than words; however, this approach is frequently thwarted by
the high up-front cost of building an adequate ontology (conceptual vocabulary)
in the first place. In this paper we describe a knowledge-based Expert Locator
application (for identifying human experts relevant to a particular problem or
interest), which addresses this issue by using a large, pre-built, technical
thesaurus as an initial ontology, combined with simple AI techniques of search,
subsumption computation, and language processing. The application has been
deployed and in use in our local organization since June, 1999, and a second,
larger application was deployed in March 2000. We present the Expert Locator
and the AI techniques it uses, and then we evaluate and discuss the
application. The significance of this work is that it demonstrates how years of
work by library science in thesaurus-building can be leveraged using AI
methods, to construct a practical resource location service in a short period
of time.
Neal Richter and Steve Durbin - February 17, 2005 - Knowledge discovery from the Web
Automatic Meaning Discovery Using Google
Rudi Cilibrasi and Paul Vitanyi
National Research Institute for Mathematics and Computer Science, Netherlands
Abstract:
We have found a method to automatically extract the meaning of words and
phrases from the world-wide-web using Google page counts. The approach is
novel in its unrestricted problem domain, simplicity of implementation,
and manifestly ontological underpinnings. The world-wide-web is the largest
database on earth, and the latent semantic context information entered by
millions of independent users averages out to provide automatic meaning of
useful quality. We demonstrate positive correlations, evidencing an underlying
semantic structure, in both numerical symbol notations and number-name words
in a variety of natural languages and contexts. Next, we demonstrate the
ability to distinguish between colors and numbers, and to distinguish between
17th century Dutch painters; the ability to understand electrical terms,
religious terms, emergency incidents, and we conduct a massive experiment in
understanding WordNet categories; the ability to do a simple automatic
English-Spanish translation.
Doug Warner and Steve Durbin - February 10, 2005 - Search systems with swarm optimization
User Behavior in an Adaptive Search System
Doug Warner, Stephen D. Durbin, J. Neal Richter, Z. Gedeon
RightNow Technologies
Abstract:
We analyze usage of a search system employing both standard search techniques
and Ant Colony Optimization (ACO) approaches. Searching behavior on the
system is similar to that on Internet search engines, though low searching
rates suggest facilitation due to the ACO methods. Simulations based on site
and user models give insight into the adaptive behavior and roughly match
observations.
Zuzana Gedeon - February 3, 2005 - Learning ordering
Learning to Order Things
William W. Cohen, Robert E. Schapire, and Yoram Singer
AT&T Labs
Abstract:
There are many applications in which it is desirable to order rather than
classify instances. Here we consider the problem of learning how to order,
given feedback in the form of preference judgments, i.e., statements to
the effect that one instance should be ranked ahead of another. We outline
a two-stage approach in which one first learns by conventional means a
preference function, of the form PREF(u,v), which indicates whether it is
advisable to rank u before v. New instances are then ordered so as to maximize
agreements with the learned preference function. We show that the problem of
finding the ordering that agrees best with a preference function is
NP-complete, even under very restrictive assumptions. Nevertheless, we
describe a simple greedy algorithm that is guaranteed to find a good
approximation. We then discuss an on-line learning algorithm, based on the
"Hedge" algorithm, for finding a good linear combination of ranking "experts."
We use the ordering algorithm combined with the on-line learning algorithm to
find a combination of "search experts," each of which is a domain-specific
query expansion strategy for a WWW search engine, and present experimental
results that demonstrate the merits of our approach.
Jeff Elser and John Paxton - January 27, 2005 - Optimizing a search engine
Abstract:
ht://dig performs three major tasks that should be performed in the following order:
-
Digging
Before you can search, a database of all the documents that need to be searched has to be created. -
Merging
Merging consists of two processes:
A. Converting the databases of all documents to specialized databases for simple, fast searching.
B. Merging changed information into previously existing databases.
Even though this task could be performed at the same time as the Digging, it is a separate process for efficiency reasons. This also allows for more control over the processes implemented when merging. -
Searching
Finally, the databases that were created in the previous steps can be used for actual searches. Normally, searches will be invoked by a CGI (Common Gateway Interface; a program running on the webserver) which gets input from the user through an HTML form.
Discussion Topics:
(1) An overview of ht://dig
(2) an exploratory discussion of how machine learning might be used to configure ht://dig
Reading: Browse the ht://dig website, http://www.htdig.org/
Steve Durbin - January 20, 2005 - Bayesian bots
Teaching Bayesian Behaviours to Video Game Characters
Ronan Le Hy, Anthony Arrigonia, Pierre Bessiere, Olivier Lebeltel
INRIA, France
Abstract:
This article explores an application of Bayesian Programming to behaviours for
synthetic video games characters. We address the problem of real-time reactive
selection of elementary behaviours for an agent playing a first person shooter
game. We show how Bayesian Programming can lead to condensed and easier
formalisation of finite state machine-like behaviour selection, and lend itself
to learning by imitation, in a fully transparent way for the player.
Steve Durbin - December 9, 2004 - Perceptions of randomness
Randomness and Coincidences: Reconciling Intuition and Probability Theory
Thomas L. Griffiths & Joshua B. Tenenbaum
Department of Psychology, Stanford University
Abstract:
We argue that the apparent inconsistency between people's intuitions about chance
and the normative predictions of probability theory, as expressed in judgments
about randomness and coincidences, can be resolved by focussing on the evidence
observations provide about the processes that generated them, rather than their
likelihood. This argument is supported by probabilistic modeling of sequence
and number production, together with two experiments that examine people's
judgments about coincidences.
Zuzana Gedeon - December 2, 2004 - Clustering impossibility theorem
An Impossibility Theorem for Clustering
Jon Kleinberg
Department of Computer Science, Cornell University
Abstract:
Although the study of clustering is centered around an intuitively compelling goal,
it has been very difficult to develop a unified framework for reasoning about it
at a technical level, and profoundly diverse approaches to clustering abound in the
research community. Here we suggest a formal perspective on the difficulty in
finding such a unification, in the form of an impossibility theorem: for a set of
three simple properties, we show that there is no clustering function satisfying all
three. Relaxations of these properties expose some of the interesting (and unavoidable)
trade-offs at work in well-studied clustering techniques such as single-linkage,
sum-of-pairs, k-means, and k-median.
Mike Emery and John Paxton - November 18, 2004 - Structure in multi-agent systems
Cougarr Agent Communities
Ronald D. Snyder and Douglas C. MacKenzie
Mobile Intelligence Corporation
Abstract:
The ability to organize agents into abstract groups provides a powerful tool for
agent organization and communication in large multi-agent systems. In this paper
we outline the fundamental characteristics required of a robust and scalable agent
grouping mechanism. These characteristics are then discussed in the context of the
Cougarr community infrastructure. This implementation of agent communities illustrates
the use of distributed state management, distributed event propagation and abstract
messaging in a high-performance agent architecture designed for robustness and
scalability.
Steve Durbin - November 4, 2004 - Question-answering with soft text patterns
Unsupervised Learning of Soft Patterns for Generating Definitions from Online News
Hang Cui, Min-Yen Kan, and Tat-Seng Chua
Department of Computer Science, School of Computing, National University of Singapore
Abstract:
Breaking news often contains timely definitions and descriptions of current terms,
organizations and personalities. We utilize such web sources to construct definitions
for such terms. Previous work has identified definitions using hand-crafted rules or
supervised learning that constructs rigid, hard text patterns. In contrast, we
demonstrate a new approach that uses flexible, soft matching patterns to characterize
definition sentences. Our soft patterns are able to effectively accommodate the
diversity of definition sentence structure exhibited in news. We use pseudo-relevance
feedback to automatically label sentences for use in soft pattern generation.
The application of our unsupervised method significantly improves baseline systems
on both the standardized TREC corpus as well as crawled online news articles by 27%
and 30%, respectively, in terms of F measure. When applied to a state-of-art definition
generation system recently fielded in the TREC 2003 definitional question answering task,
it improves the performance by 14%.
Binhai Zhu - October 28, 2004 - Datamining in sparse databases
Efficient discovery of error-tolerant frequent itemsets in high dimensions
Cheng Yang
Dept. of Computer Science, Stanford University
Usama Fayyad and Paul S. Bradley
digiMine, Inc.
Abstract:
We present a generalization of frequent itemsets allowing for the notion of
errors in the itemset definition. We motivate the problem and present an
efficient algorithm that identifies error-tolerant frequent clusters of items
in transactional data (customer purchase data, web browsing data, text, etc.).
The algorithm exploits sparseness of the underlying data to find large groups
of items that are correlated over database records (rows). The notion of transaction
coverage allows us to extend the algorithm and view it as a fast clustering algorithm
for discovering segments of similar transactions in binary sparse data. We evaluate
the new algorithm on three real-world applications: clustering high-dimensional data,
query selectivity estimation and collaborative filtering. Results show that the
algorithm consistently uncovers structure in large sparse databases that other
traditional clustering algorithms fail to find.
Rafal Angryk, Steve Durbin - October 21, 2004 - Uncertain knowledge in databases and multi-agent communication
Current approaches to handling imperfect information in data and knowledge bases
Simon Parsons
Advanced Computation Laboratory, Imperial Cancer Research Fund
Abstract:
This paper surveys methods for representing and reasoning with imperfect information.
It opens with an attempt to classify the different types of imperfection that may
pervade data, and a discussion of the sources of such imperfections. The classification
is then used as a framework for considering work that explicitly concerns the
representation of imperfect information, and related work on how imperfect information
may be used as a basis for reasoning. The work that is surveyed is drawn from both the
field of databases and the field of artificial intelligence. Both of these areas have
long been concerned with the problems caused by imperfect information, and this paper
stresses the relationships between the approaches developed in each.
The Byzantine generals problem
Leslie Lamport, Robert Shostak, and Marshall Pease
SRI International
Abstract:
Reliable computer systems must handle malfunctioning components that give conflicting
information to different parts of the system. This situation can be expressed
abstractly in terms of a group of generals of the Byzantine army camped with their
troops around an enemy city. Communicating only by messenger, the generals must agree
upon a common battle plan. However, one or more of them may be traitors who will try
to confuse the others. The problem is to find an algorithm to ensure that the loyal
generals will reach agreement. It is shown that, using only oral messages, this
problem is solvable if and only if more that two-thirds of the generals are loyal; so
a single traitor can confound two loyal generals. With unforgeable written messages, the
problem is solvable for any number of generals and possible traitors. Applications of
the solutions to reliable computer systems are then discussed.
Doug Warner - October 7, 2004 - Processing uncertain knowledge
How Humans Process Uncertain Knowledge: An Introduction for Knowledge Engineers
Robert F. Hink and David L. Woods
Abstract:
The question of how humans process uncertain information is important to the
development of knowledge-based systems in terms of both knowledge acquisition
and knowledge representation. This article reviews three bodies of psychological
research that address this question: human perception, human probabilistic and
statistical judgement, and human choice behavior. The general conclusion is that
human behavior under uncertainty is often suboptimal and sometimes even fallacious.
Suggestions for knowledge engineers in detecting and obviating such errors are
discussed. The requirements for a system designed to reduce the effects of human
factors in the processing of uncertain knowledge are introduced.
Steve Durbin - September 30, 2004 - Cognitive biases and bounded rationality
Chapter from the very readable CIA publication
Psychology of Intelligence Analysis
Biases in Estimating Probabilities
Daniel Kahneman Nobel Prize lecture
Maps of Bounded Rationality: A Perspective on Intuitive Judgement and Choice
