Colloquium Papers
Neal Richter - May 1, 2008 - Web Search Ranking
Mining the Search Trails of Surfing Crowds: Identifying Relevant Websites From User Activity
Mikhail Bilenko and Ryen W. White
Microsoft Research
Abstract:
The paper proposes identifying relevant information sources from
the history of combined searching and browsing behavior of many
Web users. While it has been previously shown that user interactions
with search engines can be employed to improve document
ranking, browsing behavior that occurs beyond search result
pages has been largely overlooked in prior work. The paper
demonstrates that users’ post-search browsing activity strongly
reflects implicit endorsement of visited pages, which allows estimating
topical relevance of Web resources by mining large-scale
datasets of search trails. We present heuristic and probabilistic
algorithms that rely on such datasets for suggesting authoritative
websites for search queries. Experimental evaluation shows that
exploiting complete post-search browsing trails outperforms alternatives
in isolation (e.g., clickthrough logs), and yields accuracy
improvements when employed as a feature in learning to rank for
Web search.
Mike Harrelson - April 24, 2008 - Neural networks
Confabulation theory
Dr. Robert Hecht-Nielsen
University of California, San Diego
Abstract:
Confabulation theory offers a comprehensive detailed explanation of the mechanism of
thought (i.e., “cognition”: vision, hearing, reasoning, language, planning, origination
of movement and thought processes, etc.) in humans and other vertebrates (and possibly
in invertebrates, such as octopi and bees). For expositional simplicity, only the human
case is considered here.
Steve Durbin - April 17, 2008 - Prediction Markets
Using Prediction Markets to Track Information Flows: Evidence from Google
Bo Cowgill*, Justin Wolfers**, and Eric Zitzewitz***
*Google; **Wharton School, Univ. of Pennsylvania; ***Dartmouth College
Abstract:
In the last 2.5 years, Google has conducted the largest corporate experiment with prediction
markets we are aware of. In this paper, we illustrate how markets can be used to study how an
organization processes information. We document a number of biases in Google’s markets,
most notably an optimistic bias. Newly hired employees are on the optimistic side of these
markets, and optimistic biases are significantly more pronounced on days when Google stock is
appreciating. We find strong correlations in trading for those who sit within a few feet of one
another; social networks and work relationships also play a secondary explanatory role. The
results are interesting in light of recent research on the role of optimism in entrepreneurial
firms, as well as recent work on the importance of geographical and social proximity in
explaining information flows in firms and markets.
Neal Richter - April 10, 2008 - Artificial Intelligence
AI Chasers
Patrick Tucker
World Future Society
Abstract:
The “holy grail” of computer science may be
within reach. A futurist looks toward tomorrow’s
Artificial Intelligence Revolution.
The article was based on
these interviews:
MIT roboticist Rodney Brooks, Adaptive A.I. Inc. founder Peter Voss,
Self-Aware Systems founder Steve Omohundro, Powerset CEO Barney Pell,
and Google research director Peter Norvig discuss how they see AI
developing in the years ahead, when a human-level AI might emerge,
and how worried we should be about that whole killer-robot-goes-on-rampage scenario.
Neal Richter - April 3, 2008 - Pattern Mining
Clustering Co-occurrences of Maximal Frequent Patterns in Streams
Edgar H. de Graaf, Joost N. Kok, Walter A. Kosters
Leiden Institute of Advanced Computer Science, Leiden University
Abstract:
One way of getting a better view of data is by using frequent
patterns. In this paper frequent patterns are (sub)sets that
occur a minimal number of times in a stream of itemsets.
However, the discovery of frequent patterns in streams has
always been problematic. Because streams are potentially
endless it is in principle impossible to say if a pattern is often
occurring or not. Furthermore, the number of patterns can
be huge and a good overview of the structure of the stream
is lost quickly. The proposed approach will use clustering to
facilitate the “online” analysis of the structure of the stream.
A clustering on the co-occurrence of patterns will give the user an improved view on the structure of the stream. Some patterns might occur so often together that they should form a combined pattern. In this way the patterns in the clustering will approximate the largest frequent patterns: maximal frequent patterns. The number of (approximated) maximal frequent patterns is much smaller and combined with clustering methods these patterns provide a good view on the structure of the stream.
Our approach to decide if patterns occur often together is based on a method of clustering where only the distance between pairs of patterns is known. This distance is the Euclidean distance between points in a 2-dimensional space, where the points represent the frequent patterns, or rather the most important ones. The coordinates are adapted when the records from the stream pass by, and reflect the real support of the corresponding pattern. In this setup the support is viewed as the number of occurrences in a time window. The main algorithm tries to maintain a dynamic model of the data stream by merging and splitting these patterns. Experiments show the versatility of the method.
Richard McAllister - March 27, 2008 - Pattern Mining
Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach
Jiawei Han*, Jian Pei**, Yiwei Yin***, Runying Mao****
*Univ. of Illinois, **SUNY, ***Simon Fraser Univ., ****Microsoft
Abstract:
Mining frequent patterns in transaction databases, time-series databases,
and many other kinds of databases has been studied popularly in data mining
research. Most of the previous studies adopt an Apriori-like candidate set
generation-and-test approach. However, candidate set generation is still costly,
especially when there exist a large number of patterns and/or long patterns.
In this study, we propose a novel frequent-pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develop an efficient FP-treebased mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a condensed, smaller data structure, FP-tree which avoids costly, repeated database scans, (2) our FP-tree-based mining adopts a pattern-fragment growth method to avoid the costly generation of a large number of candidate sets, and (3) a partitioning-based, divide-and-conquer method is used to decompose the mining task into a set of smaller tasks for mining confined patterns in conditional databases, which dramatically reduces the search space. Our performance study shows that the FP-growth method is efficient and scalable for mining both long and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm and also faster than some recently reported new frequent-pattern mining methods.
Steve Durbin - March 6, 2008 - Data commons
Urban Sensing: Out of the Woods
Dana Cuff, Mark Hansen, and Jerry Kang
UCLA
Abstract:
Embedded networked sensing,
having successfully shifted from
the lab to the environment, is
primed for a more contentious
move to the city to where citizens
will likely be the target of data
collection. This transition will
warrant careful study and touch
on issues that go far beyond
the scientific realm.
Neal Richter - February 28, 2008 - Contextual Advertising
Finding Advertising Keywords on Web Pages
Wentau Yih,* Joshua Goodman,* Vitor R. Carvalho**
*Microsoft Research, **Carnegie Mellon University
Abstract:
A large and growing number of web pages display contextual
advertising based on keywords automatically extracted
from the text of the page, and this is a substantial source
of revenue supporting the web today. Despite the importance
of this area, little formal, published research exists.
We describe a system that learns how to extract keywords
from web pages for advertisement targeting. The system
uses a number of features, such as term frequency of each
potential keyword, inverse document frequency, presence in
meta-data, and how often the term occurs in search query
logs. The system is trained with a set of example pages that
have been hand-labeled with “relevant” keywords. Based on
this training, it can then extract new keywords from previously
unseen pages. Accuracy is substantially better than
several baseline systems.
Richard McAllister - February 21, 2008 - Clustering
Hierarchical Document Clustering Using Frequent Itemsets
Benjamin C.M. Fung, Ke Wangy, and Martin Esterz
Simon Fraser University
Abstract:
A major challenge in document clustering is the extremely
high dimensionality. For example, the vocabulary for a
document set can easily be thousands of words. On the
other hand, each document often contains a small fraction
of words in the vocabulary. These features require special
handlings. Another requirement is hierarchical clustering
where clustered documents can be browsed according to the
increasing specificity of topics. In this paper, we propose
to use the notion of frequent itemsets, which comes from
association rule mining, for document clustering. The
intuition of our clustering criterion is that each cluster is
identified by some common words, called frequent itemsets,
for the documents in the cluster. Frequent itemsets are
also used to produce a hierarchical topic tree for clusters.
By focusing on frequent items, the dimensionality of the
document set is drastically reduced. We show that this
method outperforms best existing methods in terms of both
clustering accuracy and scalability.
Neal Richter - February 14, 2008 - Search Advertising
Optimal Delivery of Sponsored Search Advertisements Subject to Budget Constraints
Zoe Abrams, Ofer Mendelevitch, John A. Tomlin
Yahoo!
Abstract:
We discuss an auction framework in which sponsored search
advertisements are delivered in response to queries. In practice,
the presence of bidder budgets can have a significant
impact on the ad delivery process. We propose an approach
based on linear programming which takes bidder budgets
into account, and uses them in conjunction with forecasting
of query frequencies, and pricing and ranking schemes, to
optimize ad delivery. Simulations show significant improvements
in revenue and efficiency.
Steve Durbin - February 7, 2008 - Web economics
Three topics from The Technium, a book in progress
Kevin Kelly
How much does one search cost?
Steve Durbin - January 31, 2008 - Search
Trusted Search Communities
Peter Briggs and Barry Smyth
Adaptive Information Cluster, School of Computer Science and Informatics, University College Dublin
Abstract:
We describe a social search technique that harnesses the search
experiences of a community of searchers to generate result
recommendations, in a collaborative fashion, to complement
results returned from some underlying search engine. We
describe a dynamic model of trust as a way to coordinate
collaboration, and provide experimental results to show that
search performance improves as the network evolves.
Steve Durbin - December 6, 2007 - Design
Creativity Support Tools: Accelerating Discovery and Innovation
Ben Shneiderman
Dept. of Computer Science, Univ. of Maryland, College Park
Abstract:
How can designers of programming interfaces, interactive tools, and rich social
environments enable more people to be more creative more often?
Anthony Arnone - November 29, 2007 - Search
SOLR and Lucene
Abstract:
This week we’ll be covering the search engine Lucene and one of its main sub-projects, Solr. I’ll be giving an overview of the concepts behind these two packages, and how you can use them to easily add expansive search capability to just about any Java environment you want.
Lucene is a search engine written in Java that provides high scalability, fast searching, robust handling of generic document structures, and a highly pluggable architecture that makes adding features easy. Lucene has a very active development community and is consistently on the ‘cutting edge’ of search engine technologies.
Solr is a servlet-based front end for the Lucene search engine. It provides XML-based configuration and index schema management, APIs in XML/HTTP, master-slave style replication, caching, and a web based admin interface. It also includes a bunch of trendy features for modern search engines such as faceting, more-like-this, and JSON responses, along with a host of other cool features under development.
Anthony's Powerpoint presentation
Steve Durbin - October 25, 2007 - Ontologies
Deriving a Large Scale Taxonomy from Wikipedia
Simone Paolo Ponzetto and Michael Strube
EML Research, Germany
Abstract:
We take the category system in Wikipedia as a conceptual network.
We label the semantic relations between categories using
methods based on connectivity in the network and lexico-syntactic
matching. As a result we are able to derive a large
scale taxonomy containing a large amount of subsumption,
i.e. isa, relations. We evaluate the quality of the created resource
by comparing it with ResearchCyc, one of the largest
manually annotated ontologies, as well as computing semantic
similarity between words in benchmarking datasets.
Steve Durbin - October 18, 2007 - Text similarity
Improving Similarity Measures for Short Segments of Text
Wen-tau Yih and Christopher Meek
Microsoft Research
Abstract:
In this paper we improve previous work on measuring the
similarity of short segments of text in two ways. First, we introduce
a Web-relevance similarity measure and demonstrate
its effectiveness. This measure extends the Web-kernel similarity
function introduced by Sahami and Heilman (2006)
by using relevance weighted inner-product of term occurrences
rather than TF×IDF. Second, we show that one can
further improve the accuracy of similarity measures by using
a machine learning approach. Our methods outperform other
state-of-the-art methods in a general query suggestion task
for multiple evaluation metrics.
Neal Richter and Steve Durbin - October 4, 2007 - Information Retrieval
Search Engines that Learn from Implicit Feedback
Thorsten Joachims and Filip Radlinski
Cornell University
Abstract:
Search-engine logs provide a wealth of information that machine-learning techniques can
harness to improve search quality. With proper interpretations that avoid inherent biases,
a search engine can use training data extracted from the logs to automatically tailor
ranking functions to a particular user group or collection.
Neal Richter - September 27, 2007 - Information Retrieval
Learning to Rank for Information Retrieval Using Genetic Programming
Jen-Yuan Yeh*, Jung-Yi Lin*, Hao-Ren Ke**, Wei-Pang Yang***
*Dept. of Computer Science,
National Chiao Tung University
**Inst. of Information Management,
National Chiao Tung University
***Dept. of Information Management,
National Dong Hwa University
Abstract:
One central problem of information retrieval (IR) is to determine
which documents are relevant and which are not to the user
information need. This problem is practically handled by a
ranking function which defines an ordering among documents
according to their degree of relevance to the user query. This
paper discusses work on using machine learning to automatically
generate an effective ranking function for IR. This task is referred
to as "learning to rank for IR" in the field. In this paper, a learning
method, RankGP, is presented to address this task. RankGP
employs genetic programming to learn a ranking function by
combining various types of evidences in IR, including content
features, structure features, and query-independent features. The
proposed method is evaluated using the LETOR benchmark
datasets and found to be competitive with Ranking SVM and
RankBoost.
Steve Durbin - September 20, 2007 - Knowledge base organization
Collaborative Structuring: Organizing Document Repositories Effectively and
Efficiently
Harris Wu* and Michael Gordon**
*Old Dominion University; **University of Michigan
Abstract:
Improving how documents are managed and data is stored and
exchanged in systems.
Steve Durbin - September 13, 2007 - Chatbots
A Terrorism Question/Answer System
R. P. Schumaker,* Y. Liu,** M. Ginsburg,*** and H. Chen***
*IS Dept, Iona College; **IS Dept., Cal State Long Beach; ***Dept. of Managmement
Information Systems, U. of Arizona
Abstract:
The TARA Project examined how a trio of modified chatterbots
could be used to disseminate terrorism-related information
to the general public.
Ben Werner - August 23, 2007 - HCI
Tangible Bits: Towards Seamless Interfaces between People, Bits and Atoms
Hiroshi Ishii and Brygg Ullmer, MIT Media Lab
Abstract:
This paper presents our vision of Human Computer
Interaction (HCI): "Tangible Bits." Tangible Bits allows
users to "grasp & manipulate" bits in the center of users’
attention by coupling the bits with everyday physical
objects and architectural surfaces. Tangible Bits also
enables users to be aware of background bits at the
periphery of human perception using ambient display media
such as light, sound, airflow, and water movement in an
augmented space. The goal of Tangible Bits is to bridge
the gaps between both cyberspace and the physical
environment, as well as the foreground and background of
human activities.
This paper describes three key concepts of Tangible Bits:
interactive surfaces; the coupling of bits with graspable
physical objects; and ambient media for background
awareness. We illustrate these concepts with three
prototype systems – the metaDESK, transBOARD and
ambientROOM – to identify underlying research issues.
Steve Durbin - August 2, 2007 - Search
How Well does Result Relevance Predict Session Satisfaction?
Scott B. Huffman, Michael Hochster, Google, Inc.
Abstract:
Per-query relevance measures provide standardized, repeatable
measurements of search result quality, but they ignore
much of what users actually experience in a full search session.
This paper examines how well we can approximate a
user’s ultimate session-level satisfaction using a simple relevance
metric. We find that this relationship is surprisingly
strong. By incorporating additional properties of the query
itself, we construct a model which predicts user satisfaction
even more accurately than relevance alone.
Ben Werner - July 19, 2007 - Usability
Examining the Robustness of Sensor-Based Statistical Models of Human Interruptibility
James Fogarty, Scott E. Hudson, HCI Institute, Carnegie-Mellon
Jennifer Lai, IBM T.J. Watson Research Center
Abstract:
Current systems often create socially awkward interruptions
or unduly demand attention because they have no way of
knowing if a person is busy and should not be interrupted.
Previous work has examined the feasibility of using sensors
and statistical models to estimate human interruptibility in
an office environment, but left open some questions about
the robustness of such an approach. This paper examines
several dimensions of robustness in sensor-based statistical
models of human interruptibility. We show that real
sensors can be constructed with sufficient accuracy to drive
the predictive models. We also create statistical models for
a much broader group of people than was studied in prior
work. Finally, we examine the effects of training data
quantity on the accuracy of these models and consider
tradeoffs associated with different combinations of sensors.
As a whole, our analyses demonstrate that sensor-based
statistical models of human interruptibility can provide
robust estimates for a variety of office workers in a range of
circumstances, and can do so with accuracy as good as or
better than people. Integrating these models into systems
could support a variety of advances in human computer
interaction and computer-mediated communication.
Neal Richter - June 21, 2007 - Search Ranking
Google Keeps Tweaking Its Search Engine
Saul Hansell
New York Times
Abstract:
Article on the Search Quality group at Google
See also the keynote presentation at SIGIR 2005: Challenges in Running a Commercial Search Engine by Amit Singhal of Google
Neal Richter - June 7, 2007 - Search Engines
Why writing your own search engine is hard
Anna Patterson
Stanford University
Abstract:
There must be 4,000 programmers typing away in their basements
trying to build the next “world’s most scalable”
search engine. It has been done only a few times. It has
never been done by a big group; always one to four people
did the core work, and the big team came on to build the
elaborations and the production infrastructure. Why is it
so hard? We are going to delve a bit into the various issues
to consider when writing a search engine. This article is
aimed at those individuals or small groups that are considering
this endeavor for their Web site or intranet. It is fun, but
a word of caution: not only is it difficult, but you need two
commodities in short supply—time and patience.
Zuzana Gedeon - May 17, 2007 - Search
Multi-factor Clustering for a Marketplace Search Interface
Neel Sundaresan, Kavita Ganesan and Roopnath Grandhi
eBay Research Labs
Abstract:
Search engines provide a small window to the vast repository of
data they index and against which they search. They try their best
to return the documents that are of relevance to the user but often
a large number of results may be returned. Users struggle to
manage this vast result set looking for the items of interest.
Clustering search results is one way of alleviating this
navigational pain. In this paper we describe a clustering system
that enables clustering search results in an online marketplace
search system.
Adaptive Faceted Browser for Navigation in Open Information Spaces
Michal Tvarožek and Mária Bieliková
Faculty of Informatics and Information Technologies, Slovak University of Technology
Abstract:
Open information spaces have several unique characteristics
such as their changeability, large size, complexity and diverse
user base. These result in novel challenges during user
navigation, information retrieval and data visualization in
open information spaces. We propose a method of navigation
in open information spaces based on an enhanced
faceted browser with support for dynamic facet generation
and adaptation based on user characteristics.
Zuzana Gedeon, Hasari Tosun, Steve Durbin - April 12, 2007 - WWW 2007
Navigating the Intranet with High Precision
Huaiyu Zhu, Alexander Loser, Sriram Raghavan, and Shivakumar Vaithyanathan
IBM Almaden Research Center and SAP Research CEC
Abstract:
Despite the success of web search engines, search over large enterprise
intranets still suffers from poor result quality. Earlier work
that compared intranets and the Internet from the view point of keyword
search has pointed to several reasons why the search problem
is quite different in these two domains. In this paper, we address the
problem of providing high quality answers to navigational queries
in the intranet (e.g., queries intended to find product or personal
home pages, service pages, etc.). Our approach is based on offline
identification of navigational pages, intelligent generation of termvariants
to associate with each page, and the construction of separate
indices exclusively devoted to answering navigational queries.
Using a testbed of 5.5M pages from the IBM intranet, we present
evaluation results that demonstrate that for navigational queries,
our approach of using custom indices produces results of significantly
higher precision than those produced by a general purpose
search algorithm.
The Complex Dynamics of Collaborative Tagging
Harry Halpin, Valentin Robu, and Hana Shepherd
University of Edinburgh; Center for Math and CS (Amsterdam); Princeton University
Abstract:
The debate within the Web community over the optimal means
by which to organize information often pits formalized classications
against distributed collaborative tagging systems. A number
of questions remain unanswered, however, regarding the nature of
collaborative tagging systems including whether coherent categorization
schemes can emerge from unsupervised tagging by users.
This paper uses data from the social bookmarking site del.icio.us to
examine the dynamics of collaborative tagging systems. In particular,
we examine whether the distribution of the frequency of use
of tags for "popular" sites with a long history (many tags and many
users) can be described by a power law distribution, often characteristic
of what are considered complex systems. We produce a
generative model of collaborative tagging in order to understand
the basic dynamics behind tagging, including how a power law distribution
of tags could arise. We empirically examine the tagging
history of sites in order to determine how this distribution arises
over time and to determine the patterns prior to a stable distribution.
Lastly, by focusing on the high-frequency tags of a site where
the distribution of tags is a stabilized power law, we show how tag
co-occurrence networks for a sample domain of tags can be used
to analyze the meaning of particular tags given their relationship to
other tags.
Identifying Ambiguous Queries in Web Search
Ruihua Song, Zhenxiao Luo, Ji-Rong Wen, Yong Yu, and Hsiao-Wuen Hon
Shanghai Jiao Tong University, Microsoft Research Asia, Fudan University
Abstract:
It is widely believed that some queries submitted to search
engines are by nature ambiguous (e.g., java, apple). However, few
studies have investigated the questions of "how many queries are
ambiguous?" and "how can we automatically identify an
ambiguous query?" This paper deals with these issues. First, we
construct the taxonomy of query ambiguity, and ask human
annotators to manually classify queries based upon it. From
manually labeled results, we find that query ambiguity is to some
extent predictable. We then use a supervised learning approach to
automatically classify queries as being ambiguous or not.
Experimental results show that we can correctly identify 87% of
labeled queries. Finally, we estimate that about 16% of queries in
a real search log are ambiguous.
Steve Durbin - April 5, 2007 - Personalization
Open User Profiles for Adaptive News Systems: Help or Harm?
Jae-wook Ahn, Peter Brusilovsky, Jonathan Grady, Daqing He and Sue Yeon Syn
School of Information Sciences, University of Pittsburgh
Abstract:
Over the last five years, a range of projects have focused on
progressively more elaborated techniques for adaptive news
delivery. However, the adaptation process in these systems has
become more complicated and thus less transparent to the users.
In this paper, we concentrate on the application of open user
models in adding transparency and controllability to adaptive
news systems. We present a personalized news system,
YourNews, which allows users to view and edit their interest
profiles, and report a user study on the system. Our results
confirm that users prefer transparency and control in their systems,
and generate more trust to such systems. However, similar to
previous studies, our study demonstrate that this ability to edit
user profiles may also harm the system’s performance and has to
be used with caution.
Steve Durbin - March 29, 2007 - Reputation Systems
A Content-Driven Reputation System for the Wikipedia
B. Thomas Adler and Luca de Alfaro
University of California, Santa Cruz
Abstract:
We present a content-driven reputation system for
Wikipedia authors. In our system, authors gain reputation
when the edits they perform to Wikipedia articles are
preserved by subsequent authors, and they lose reputation
when their edits are rolled back or undone in short order.
Thus, author reputation is computed solely on the basis of
content evolution; user-to-user comments or ratings are not
used. The author reputation we compute could be used to
flag new contributions from low-reputation authors, or it
could be used to allow only authors with high reputation to
contribute to controversial or critical pages. A reputation
system for the Wikipedia could also provide an incentive for
high-quality contributions.
We have implemented the proposed system, and we have
used it to analyze the entire Italian and French Wikipedias,
consisting of a total of 691,551 pages and 5,587,523 revisions.
Our results show that our notion of reputation has
good predictive value: changes performed by low-reputation
authors have a significantly larger than average probability
of having poor quality, as judged by human observers, and
of being later undone, as measured by our algorithms
Steve Durbin - March 22, 2007 - Collective Intelligence
Web pundits trumpet social software as a means of generating knowledge by tapping into the wisdom of crowds. The following two articles offer contrarian perspectives.
The "Dumbness of Crowds"
Kathy Sierra
DIGITAL MAOISM: The Hazards of the New Online Collectivism
Jaron Lanier
Neal Richter - March 15, 2007 - Semantic Web
AI Meets Web 2.0: Building the Web of Tomorrow, Today
Jay M. Tenenbaum
Abstract:
Imagine an Internet-scale knowledge system where people and intelligent agents can
collaborate on solving complex problems in business, engineering, science, medicine,
and other endeavors. Its resources include semantically tagged websites, wikis, and
blogs, as well as social networks, vertical search engines, and a vast array of web
services from business processes to AI planners and domain models. Research prototypes
of decentralized knowledge systems have been demonstrated for years, but now, thanks
to the web and Moore’s law, they appear ready for prime time. This article introduces
the architectural concepts for incrementally growing an Internet-scale knowledge
system and illustrates them with scenarios drawn from e-commerce, e-science, and e-life.
Steve Durbin - March 8, 2007 - Tag Mining
Ranking Bookmarks and Bistros: Intelligent Community and Folksonomy Development
Benjamin Szekely and Elias Torres
Harvard University
Abstract:
Online communities are an important forum for people to share information
on the internet. Historically, online communities have been centered
around free-text sharing and simple user rating systems. With the
shadow of the Semantic Web looming behind many new innovations
on the internet today, online communities are beginning to adopt semantic
techniques such as tagging and folksonomies as a vehicle for
the sharing and classification of information. Gourmetvillage.org is a
prototype online community that uses tagging to provide an innovative
restaurant rating mechanism. Users may provide free-text reviews and
tags for any aspect of a restaurant. UserRank is an algorithm based
on Google’s PageRank that provides a ranking of users based on
whose taggings are most often followed. TagRank provides a ranking
of tags based on the ranking of users. UserRank may in fact be used
to rank any entity in a community based on user association. del.icio.us
contains a significant dataset of users and tagged entities with which to
test these algorithms. UserRank and TagRank yield different rankings
than their count-based counterparts. When drilling down into the
results, the top user according to UserRank only made a few taggings,
yet hundreds of users in the system agreed with his taggings, showing
that these algorithms capture community consensus fairly well.
Steve Durbin - March 1, 2007 - Tag Mining
Collaborative Creation of Communal Hierarchical Taxonomies in Social Tagging Systems
Paul Heyman and Hector Garcia-Molina
Computer Science Department, Stanford University
Abstract:
Collaborative tagging systems -- systems where many casual users annotate objects with free-form
strings (tags) of their choosing -- have recently emerged as a powerful way to label and organize
large collections of data. During our recent investigation into these types of systems, we discovered
a simple but remarkably effective algorithm for converting a large corpus of tags annotating objects
in a tagging system into a navigable hierarchical taxonomy of tags. We first discuss the algorithm
and then present a preliminary model to explain why it is so effective in these types of systems.
Steve Durbin - February 15, 2007 - User interface
Clustering versus faceted categories for information exploration
Marti Hearst
School of Information Management and Systems
University of California, Berkeley
Abstract:
Information seekers often express a
desire for a user interface that organizes search
results into meaningful groups, in order to
help make sense of the results, and to help
decide what to do next. A longitudinal study
in which participants were provided with the
ability to group search results found they
changed their search habits in response to
having the grouping mechanism available [2].
There are many open research questions about how to generate useful groupings and how to design interfaces to support exploration using grouping. Currently two methods are quite popular: clustering and faceted categorization. Here, I describe both approaches and summarize their advantages and disadvantages based on the results of usability studies.
Steve Durbin - February 8, 2007 - User interface
Simplicity is Highly Overrated is a short essay on the notion of simplicity in user interfaces.
Cautious Cars and Cantankerous Kitchens: How Machines Take Control
Don Norman
Professor of Computer Science, Psychology, and Cognitive science
Northwestern University
Abstract:
The first chapter of The Design of Future Things, to be published in 2008.
It‘s about the ever-increasing role of automation in our homes and automobiles,
why it is being done so badly, with suggestions for doing it right through
what I call “natural interaction.”
Steve Durbin - February 1, 2007 - Ambient AI
Machine Learning, Reasoning, and Intelligence in Daily Life: Directions and Challenges
Eric Horvitz
Microsoft Research
Abstract:
Technical developments and trends are providing a
fertile substrate for creating and integrating machine
learning and reasoning into multiple applications
and services. I will review several illustrative
research efforts on our team, and focus on challenges,
opportunities, and directions with the
streaming of machine intelligence into daily life.
Steve Durbin - November 30, 2006 - Natural Language Understanding
Machine Reading
Oren Etzioni, Michele Banko, Michael J. Cafarella
Computer Science & Engineering, University of Washington
Abstract:
The time is ripe for the AI community to set its sights on
Machine Reading---the automatic, unsupervised
understanding of text. In this paper, we place the notion of
“Machine Reading” in context, describe progress towards
this goal by the KnowItAll research group at the University
of Washington, and highlight several central research
questions.
Steve Durbin - November 2, 2006 - Social Networks
The Dynamics of Viral Marketing
Jure Leskovec, Carnegie Mellon
Lada A. Adamic, University of Michigan
Bernardo A. Huberman, HP Labs
Abstract:
We present an analysis of a person-to-person recommendation network,
consisting of 4 million people who made 16 million recommendations on
half a million products. We observe the propagation of recommendations
and the cascade sizes, which we explain by a simple stochastic model.
We analyze how user behavior varies within user communities defined by a
recommendation network. Product purchases follow a 'long tail' where a
significant share of purchases belongs to rarely sold items. We establish
how the recommendation network grows over time and how effective it is
from the viewpoint of the sender and receiver of the recommendations.
While on average recommendations are not very effective at inducing
purchases and do not spread very far, we present a model that successfully
identfies communities, product and pricing categories for which viral
marketing seems to be very effective.
Steve Durbin - October 26, 2006 - User Behavior
The origin of bursts and heavy tails in human dynamics
Albert-Laszlo Barabasi
Center for Complex Networks Research and Dept. of Physics, University of Notre Dame
Abstract:
The dynamics of many social, technological and economic
phenomena are driven by individual human actions, turning
the quantitative understanding of human behaviour into a
central question of modern science. Current models of human
dynamics, used from risk assessment to communications, assume
that human actions are randomly distributed in time and thus
letters to nature well approximated by Poisson processes. In contrast, there
is increasing evidence that the timing of many human
activities, ranging from communication to entertainment and
work patterns, follow non-Poisson statistics, characterized by
bursts of rapidly occurring events separated by long periods of
inactivity. Here I show that the bursty nature of human
behaviour is a consequence of a decision-based queuing
process: when individuals execute tasks based on some perceived
priority, the timing of the tasks will be heavy tailed, with
most tasks being rapidly executed, whereas a few experience very
long waiting times. In contrast, random or priority blind
execution is well approximated by uniform inter-event statistics.
These finding have important implications, ranging from
resource management to service allocation, in both communications
and retail.
Hasari Tosun - October 19, 2006 - Recommender systems
Unifying User-based and Item-based Collaborative Filtering Approaches by Similarity Fusion
Jun Wang, Arjen P. de Vries, and Marcel J. T. Reinders
Information and Communication Theory Group, Delft University of Technology
Abstract:
Memory-based methods for collaborative filtering predict new ratings by averaging
(weighted) ratings between, respectively, pairs of similar users or items.
In practice, a large number of ratings from similar users or similar items
are not available, due to the sparsity inherent to rating data. Consequently,
prediction quality can be poor. This paper reformulates the memory-based collaborative
filtering problem in a generative probabilistic framework, treating individual
user-item ratings as predictors of missing ratings. The final rating is estimated by
fusing predictions from three sources: predictions based on ratings of the same item
by other users, predictions based on different item ratings made by the same user, and,
third, ratings predicted based on data from other but similar users rating other but
similar items. Existing user-based and item-based approaches correspond to the two simple
cases of our framework. The complete model is however more robust to data sparsity,
because the different types of ratings are used in concert, while additional ratings
from similar users towards similar items are employed as a background model to smooth
the predictions. Experiments demonstrate that the proposed methods are indeed more
robust against data sparsity and give better recommendations.
David Tolliver - October 12, 2006 - Clustering
Note special time (Thursday 3:10) and place (1-139 Wilson)
Spectral Approximation Algorithms: Clustering and Graph Partitioning
David Tolliver
Google Fellow and Department of Computer Science, Carnegie Mellon University
Abstract:
I'll provide an overview of recent advances in spectral clustering and partitioning
technology. I'll briefly cover a fundamental result in clustering, and then delve
into the basic spectral clustering algorithm. I'll walk through a few variations
on the basic algorithm, highlighting the specific applications domains and drawbacks.
Finally I'll cover our proof of an approximation bound for the Normalized Cut (an
NP-hard objection function) for a simple spectral algorithm.
Steve Durbin - October 5, 2006 - Search
A Live-User Evaluation of Collaborative Web Search
Barry Smyth, Evelyn Balfe, Oisin Boydell, Keith Bradley, Peter Briggs, Maurice Coyle, Jill Freyne
Smart Media Institute, Department of Computer Science, University College Dublin
Abstract:
Collaborative Web search exploits repetition and regularity within the query-space
of a community of like-minded individuals in order to improve the quality of
search results. In short, search results that have been judged to be relevant
for past queries are promoted in response to similar queries that occur in the
future. In this paper we present the results of a large-scale evaluation of
this approach, in a corporate Web search scenario, which shows that significant
benefits are available to its users.
Steve Durbin - September 28, 2006 - Tagging
Visualizing Tags over Time
Micah Dubinko, Ravi Kumar, Joseph Magnani, Jasmine Novak, Prabhakar Raghavan, Andrew Tomkins
Yahoo! Research
Abstract:
We consider the problem of visualizing the evolution of tags within the Flickr
(flickr.com) online image sharing community. Any user of the Flickr service
may append a tag to any photo in the system. Over the past year, users have
on average added over a million tags each week. Understanding the evolution
of these tags over time is therefore a challenging task. We present a new
approach based on a characterization of the most interesting tags associated
with a sliding interval of time. An animation provided via Flash in a web
browser allows the user to observe and interact with the interesting tags as
they evolve over time.
New algorithms and data structures are required to support the efficient
generation of this visualization. We combine a novel solution to an interval
covering problem with extensions to previous work on score aggregation in order
to create an efficient backend system capable of producing visualizations at
arbitrary scales on this large dataset in real time.
Neal Richter - September 21, 2006 - Index Mining
Random Sampling from a Search Engine's Index
Best Paper Winner at WWW-2006
Ziv Bar-Yossef, Technion - Israel Institute of Technology, Israel
Maxim Gurevich, Department of Electrical Engineering, Technion, Israel
Abstract:
We revisit a problem introduced by Bharat and Broder al-
most a decade ago: how to sample random pages from a
search engine's index using only the search engine's public
interface? Such a primitive is particularly useful in creating
objective benchmarks for search engines.
The technique of Bharat and Broder suffers from two well
recorded biases: it favors long documents and highly ranked
documents. In this paper we introduce two novel sam-
pling techniques: a lexicon-based technique and a random
walk technique. Our methods produce biased sample docu-
ments, but each sample is accompanied by a corresponding
¿weight¿, which represents the probability of this document
to be selected in the sample. The samples, in conjunction
with the weights, are then used to simulate near-uniform
samples. To this end, we resort to three well known Monte
Carlo simulation methods: rejection sampling, importance
sampling and the Metropolis-Hastings algorithm.
We analyze our methods rigorously and prove that under
plausible assumptions, our techniques are guaranteed to pro-
duce near-uniform samples from the search engine's index.
Experiments on a corpus of 2.4 million documents substanti-
ate our analytical findings and show that our algorithms do
not have significant bias towards long or highly ranked doc-
uments. We use our algorithms to collect fresh data about
the relative sizes of Google, MSN Search, and Yahoo!.
Neal Richter - September 14, 2006 - Click-stream Data and Queries
Time-Dependent Semantic Similarity Measure of Queries Using Historical Click-Through Data
Qiankun Zhao, Steven C. H. Hoi, Tie-Yan Liu, Sourav S Bhowmick, Michael R. Lyu, Wei-Ying Ma
Nanyang Technological University, The Chinese University of HK, Microsoft
Research Asia
Abstract:
It has become a promising direction to measure similarity of
Web search queries by mining the increasing amount of clickthrough
data logged by Web search engines, which record
the interactions between users and the search engines. Most
existing approaches employ the click-through data for similarity
measure of queries with little consideration of the temporal
factor, while the click-through data is often dynamic
and contains rich temporal information. In this paper we
present a new framework of time-dependent query semantic
similarity model on exploiting the temporal characteristics
of historical click-through data. The intuition is that more
accurate semantic similarity values between queries can be
obtained by taking into account the timestamps of the log
data. With a set of user-defined calendar schema and calendar
patterns, our time-dependent query similarity model is
constructed using the marginalized kernel technique, which
can exploit both explicit similarity and implicit semantics
from the click-through data effectively. Experimental results
on a large set of click-through data acquired from a commercial
search engine show that our time-dependent query similarity
model is more accurate than the existing approaches.
Moreover, we observe that our time-dependent query similarity
model can, to some extent, reflect real-world semantics
such as real-world events that are happening over time.
Steve Durbin - September 7, 2006 - Ranking the web
Beyond PageRank: machine learning for static ranking
Matthew Richardson, Amit Prakash, and Eric Brill
Microsoft Research, Inc.
Abstract:
Since the publication of Brin and Page's paper on PageRank, many in the Web community
have depended on PageRank for the static (query-independent) ordering of Web pages.
We show that we can significantly outperform PageRank using features that are independent
of the link structure of the Web. We gain a further boost in accuracy by using data on
the frequency at which users visit Web pages. We use RankNet, a ranking machine learning
algorithm, to combine these and other static features based on anchor text and domain
characteristics. The resulting model achieves a static ranking pairwise accuracy of 67.3%
(vs. 56.7% for PageRank or 50% for random).
Steve Durbin - August 31, 2006 - Better answers
Retroactive Answering of Search Queries
Beverly Yang and Glen Jeh
Google, Inc.
Abstract:
Major search engines currently use the history of a user's actions (e.g., queries,
clicks) to personalize search results. In this paper, we present a new personalized
service, query-specific web recommendations (QSRs), that retroactively answers
queries from a user's history as new results arise. The QSR system addresses two
important subproblems with applications beyond the system itself: (1) Automatic
identification of queries in a user's history that represent standing interests
and unfulfilled needs. (2) Effective detection of interesting new results to these
queries. We develop a variety of heuristics and algorithms to address these problems,
and evaluate them through a study of Google history users. Our results strongly
motivate the need for automatic detection of standing interests from a user's history,
and identifies the algorithms that are most useful in doing so. Our results also
identify the algorithms, some which are counter-intuitive, that are most useful in
identifying interesting new results for past queries, allowing us to achieve very
high precision over our data set.
Sam Gardner - April 27, 2006 - Neural networks
Self-Organization of Hierarchical Visual Maps with Feedback Connections
Yiu Fai Sit and Risto Miikkulainen
Department of Computer Sciences, University of Texas at Austin
Abstract:
Visual areas in primates are known to have reciprocal connections. While
the feedforward bottom-up processing of visual information has been
studied extensively for decades, little is known about the role of the
feedback connections. Existing feedback models usually employ hand-coded
connections, and do not address how these connections develop. The model
described in this paper shows how feedforward and feedback connections
between cortical areas V1 and V2 can be learned through self-organization
simultaneously. Computational experiments show that both areas can form
hierarchical representations of the input with reciprocal connections that
link relevant cells in the two areas.
Mike Emery - April 20, 2006 - Self-organization
Decentralised Autonomic Computing: Analysing Self-Organising Emergent Behaviour
using Advanced Numerical Methods
Tom De Wolf, Giovanni Samaey, Tom Holvoet and Dirk Roose
Department of Computer Science, KU Leuven, Belgium
Abstract:
When designing decentralised autonomic computing systems, a fundamental engineering
issue is to assess systemwide behaviour. Such decentralised systems are characterised
by the lack of global control, typically consist of autonomous cooperating entities,
and often rely on self-organised emergent behaviour to achieve the requirements.
A well-founded and practically feasible approach to study overall system behaviour is
a prerequisite for successful deployment. On one hand, formal proofs of correct
behaviour and even predictions of the exact systemwide behaviour are practically
infeasible due to the complex, dynamic, and often non-deterministic nature of
self-organising emergent systems. On the other hand, simple simulations give no
convincing arguments for guaranteeing system-wide properties. We describe an
alternative approach that allows to analyse and assess trends in systemwide
behaviour, based on so-called equation-free macroscopic analysis. This
technique yields more reliable results about the system-wide behaviour, compared
to mere observation of simulation results, at an affordable computational cost.
Numerical algorithms act at the system-wide level and steer the simulations.
This allows to limit the amount of simulations considerably.
We illustrate the
approach by studying a particular system-wide property of a decentralised control
system for Automated Guided Vehicles and we outline a road map towards a general
methodology for studying decentralised autonomic computing systems.
Leif Wickland - April 13, 2006 - Web text mining
Deriving Marketing Intelligence from Online Discussion
Glance et al.
Intelliseek Applied Research Center
Abstract:
Weblogs and message boards provide online forums for discussion that record
the voice of the public. Woven into this mass of discussion is a wide range
of opinion and commentary about consumer products. This presents an opportunity
for companies to understand and respond to the consumer by analyzing this
unsolicited feedback. Given the volume, format and content of the data, the
appropriate approach to understand this data is to use large-scale web and
text data mining technologies. This paper argues that applications for mining
large volumes of textual data for marketing intelligence should provide two
key elements: a suite of powerful mining and visualization technologies and
an interactive analysis environment which allows for rapid generation and
testing of hypotheses. This paper presents such a system that gathers and
annotates online discussion relating to consumer products using a wide variety
of state-of-the-art techniques, including crawling, wrapping, search, text
classification and computational linguistics. Marketing intelligence is
derived through an interactive analysis framework uniquely configured to
leverage the connectivity and content of annotated online discussion.
Neal Richter - April 6, 2006 - Social Networking
System of Mobile Agents to Model Social Networks
Gonzalez, Marta C., Lind, Pedro G. and Herrmann, Hans J.
Univ Stuttgart, Departamento de Fisica, Universidade Federal do Ceara
Abstract:
We propose a model of mobile agents to construct social networks, based on a
system of moving particles by keeping track of the collisions during their
permanence in the system. We reproduce not only the degree distribution,
clustering coefficient and shortest path length of a large data base of
empirical friendship networks recently collected, but also some features
related with their community structure. The model is completely characterized
by the collision rate and above a critical collision rate we find the emergence
of a giant cluster in the universality class of two-dimensional percolation.
Moreover, we propose possible schemes to reproduce other networks of particular
social contacts, namely sexual contacts.
Bob Wall - March 30, 2006 - Models of Neural Processing
Computation in the Higher Visual Cortices: Map-Seeking Circuit Theory and
Application to Machine Vision
David Arathorn
Center for Computational Biology, MSU, and General Intelligence Corporation
Abstract:
Map-Seeking Circuit theory is a biologically based computational theory of vision
applicable to difficult machine vision problems such as recognition of 3D
objects in arbitrary poses amid distractors and clutter, as well as to
non-recognition problems such as terrain interpretation. It provides a general
computational mechanism for tractable discovery of correspondences in massive
transformation spaces by exploiting an ordering property of superpositions.
The latter allows a set of transformations of an input image to be formed into
a sequence of superpositions which are then culled to a composition of single
mappings by a competitive process which matches each superposition against a
superposition of inverse transformations of memory patterns. The architecture
that performs this is based on a number of neuroanatomical features of the visual
cortices, including reciprocal dataflows and inverse mappings.
Steve Durbin - March 9, 2006 - Spelling correction
Learning a Spelling Error Model from Search Query Logs
Farooq Ahmad and Grzegorz Kondrak
University of Alberta
Abstract:
Applying the noisy channel model to search query spelling correction requires an
error model and a language model. Typically, the error model relies on a weighted
string edit distance measure. The weights can be learned from pairs of misspelled
words and their corrections. This paper investigates using the Expectation
Maximization algorithm to learn edit distance weights directly from search query
logs, without relying on a corpus of paired words.
Dave Kortenkamp - March 3, 2006 - Robotics
Note special time (Friday 3:10) and place, 101 Roberts
Worlds to Explore: Autonomy Challenges for Human Space Flight
Abstract:
This talk will look at two distinct artificial intelligence challenges in human
space flight. The first challenge can be compared to building HAL 9000 from
the movie 2001: A Space Odyssey, that is, autonomous monitoring and control of
long duration mission functions. The second challenge can be compared to building
Lt. Cmdr. Data from Star Trek: The Next Generation, that is, autonomous control
of highly dexterous robots. Both challenges will be highlighted by showing research
being done at NASA Johnson Space Center. This research includes autonomous control
of advanced life support system tests and control of a humanoid robot called Robonaut.
The talk will include many videos of NASA's robotics research.
Steve Durbin - February 23, 2005 - Language evolution
The "Lexical Contract": Modeling the emergence of word pronunciations
???
Abstract:
Even in the absence of writing, human languages appear to develop and maintain
vocabularies of roughly 10,000 morphemes, and at least 100,000 words or phrases
whose meaning is not predictable from their constituent parts. This presents
a formidable task for the individual language learner, who must learn words
from a small number of uses in context, and must manage to remember such a large
inventory of arbitrary pronunciations.
On a different scale of space and time, this also presents an extraordinary
challenge for the speech community, which must must somehow form and maintain
a consensus about these tens of thousands of complex and arbitrary conventions.
This is accomplished despite the fact that the conventional knowledge at stake
is tacit: the individuals involved typically don't know what they know about
word pronunciation, and don't try to discuss the details of this knowledge.
...
We will try to find simple programs for interacting agents that lead to successful
emergence of patterns of word pronunciation. As in the case of "ant colony
optimization", we'll look for algorithms that can accomplish this without
requiring any designation of authorities or any explicit negotiation to reach
consensus.
Zuzana Gedeon - February 16, 2005 - Text clustering
Topic-Driven Clustering for Document Datasets
Ying Zhao and George Karypis
Dept. of Computer Science and Engineering, Univ. of Minnesota
Abstract:
In this paper, we define the problem of topic-drive clustering, which organizes
a document collection according to a set of topics. We propose three topic-driven
schemes that consider the similarity between documents and topics and the
relationship among documents themselves simultaneously. We present a comprehensive
experimental evaluation of the proposed topic-driven schemes on five datasets.
Our experimental results show that the proposed topic-driven schemes are efficient
and effective with topic prototypes of different levels of specificity.
Steve Durbin - February 9, 2005 - Language Evolution
Language evolution: consensus and controversies
Morten H. Christiansen* and Simon Kirby**
*Cornell University and **Edinburgh University
Abstract:
Why is language the way it is? How did language come to be this way? And why
is our species alone in having complex language? These are old unsolved questions
that have seen a renaissance in the dramatic recent growth in research being
published on the origins and evolution of human language. This review provides
a broad overview of some of the important current work in this area. We highlight
new methodologies (such as computational modeling), emerging points of consensus
(such as the importance of pre-adaptation), and the major remaining controversies
(such as gestural origins of language). We also discuss why language evolution
is such a difficult problem, and suggest probable directions research may take
in the near future.
Neal Richter - February 2, 2005 - Text Mining with Genetic Algorithms
A Semantically Guided and Domain-Independent
Evolutionary Model for Knowledge Discovery
From Texts
John Atkinson-Abutridy, Chris Mellish, and Stuart Aitken
Abstract:
We present a novel evolutionary model for knowledge
discovery from texts (KDTs), which deals with issues concerning
shallow text representation and processing for mining purposes in
an integrated way. Its aims is to look for novel and interesting explanatory
knowledge across text documents. The approach uses
natural language technology and genetic algorithms to produce explanatory
novel hypotheses. The proposed approach is interdisciplinary,
involving concepts not only from evolutionary algorithms
but also from many kinds of text mining methods. Accordingly,
new kinds of genetic operations suitable for text mining are proposed.
The principles behind the representation and a new proposal
for using multiobjective evaluation at the semantic level are
described. Some promising results and their assessment by human
experts are also discussed which indicate the plausibility of the
model for effective KDT.
Steve Durbin - January 26, 2005 - Bayesian Modelling
Bayesian models of human action understanding
Chris L. Baker, Joshua B. Tenenbaum and Rebecca R. Saxe
Dept. of Brain and Cognitive Sciences, MIT
Abstract:
We present a Bayesian framework for explaining how people reason about and
predict the actions of an intentional agent, based on observing its behavior.
Action-understanding is cast as a problem of inverting a probabilistic generative
model, which assumes that agents tend to act rationally in order to achieve their
goals given the constraints of their environment. Working in a simple sprite-world
domain, we show how this model can be used to infer the goal of an agent and
predict how the agent will act in novel situations or when environmental constraints
change. The model provides a qualitative account of several kinds of inferences
that preverbal infants have been shown to perform, and also fits quantitative
predictions that adult observers make in a new experiment.
Reggie Mead - January 19, 2005 - Modelling
A Review of AI Techniques for Ecological Modelling
Reginald Mead
Dept. of Computer Science, MSU and RightNow Technologies
Abstract:
Environmental models are used to both further scientific understanding of environmental
processes and to aid in environmental decision making. Artificial intelligence
techniques can enhance the approaches that have traditionally been followed for
modelling. Generally, ecological modelling approaches can be classified as either
process-based, empirical-based, or a combination of the two. Process-based models
use an understanding of underlying sub-processes to model a system. Expert systems
can be constructed from expert knowledge and can be used for prediction, planning,
interpretation, and more. Process-based models have high explanatory depth, but also
tend to have low predictive power. Empirical-based models explicitly exclude any a priori
knowledge of the system, and instead model the system based on observation data alone.
...
Rick Sojda - December 8, 2005 - Modelling
An Artificial Intelligence Modelling Approach to Simulating Animal/Habitat Interactions
H. Saarenmaa l, N.D. Stone 2,4, L.J. Folse 3, J.M. Packard 3, W.E. Grant 3, M.E. Makela 2 and R.N. Coulson 2
1 The Finnish Forest Research Institute
Departments of 2 Entomology, and 3 Wildlife and Fisheries Science, Texas A&M University
Abstract:
Ecological modellers have begun to recognize the potential of object-oriented
programming techniques in structuring models. However, little has been done to
take advantage of artificial intelligence's (AI) symbolic representations to
model the decision-making processes of animals. Here, a generic model of
animal-habitat interaction and a specific model of moose-, Alces alces L., forest
interactions in Finland are described that are event-driven and behavior-based.
Individual level simulation is accomplished through an object-oriented knowledge
representation scheme and AI techniques to implement a hierarchical decision-making
model of behavior. The habitat is likewise represented in an object-oriented
scheme, allowing the simulation of a heterogeneous environment. Other AI techniques
for modelling behavior, memory, and actions are discussed including LISP methods,
rule-based reasoning, and several search algorithms. Simulations of the moose-forest
system show the power of this approach but are not intended to advance the theory
of large-herbivore behavior and foraging. AI techniques are found to be most
beneficial in (a) studying population processes based on individual level models
of behavior, (b) modelling spatial heterogeneity, (c) building event-driven models,
(d) providing a conceptual clarity to model construction, and (e) providing a
structure equally well suited to simulating resource management.
Neal Richter - December 1, 2005 - Implicit Feedback
A Study of Factors Affecting the Utility of Implicit Relevance Feedback
Ryen W. White,* Ian Ruthven,** and Joemon M. Jose^
*Human-Computer Interaction Laboratory, University of Maryland
**Dept. of Computer and Information Sciences, University of Strathclyde
^Dept of Computing Science, University of Glasgow
Abstract:
Implicit relevance feedback (IRF) is the process by which a search system
unobtrusively gathers evidence on searcher interests from their interaction
with the system. IRF is a new method of gathering information on user interest
and, if IRF is to be used in operational IR systems, it is important to establish
when it performs well and when it performs poorly. In this paper we investigate
how the use and effectiveness of IRF is affected by three factors: search
task complexity, the search experience of the user and the stage in the search.
Our findings suggest that all three of these factors contribute to the utility of IRF.
Accurately Interpreting Clickthrough Data as Implicit Feedback
Thorsten Joachims,* Laura Granka,** Bing Pan,^ Helene Hembrooke,^ and Geri Gay^
*Dept. of Computer Science, Cornell University
**Dept. of Communication, Stanford University
^Information Science, Cornell University
Abstract:
This paper examines the reliability of implicit feedback generated from
clickthrough data in WWW search. Analyzing the users' decision process
using eyetracking and comparing implicit feedback against manual relevance
judgments, we conclude that clicks are informative but biased. While this
makes the interpretation of clicks as absolute relevance judgments difficult,
we show that relative preferences derived from clicks are reasonably accurate
on average.
Bob Wall - November 17, 2005 - Clustering for Information Retrieval
Creating Concept Hierarchies in an Information Retrieval System
Bob Wall,*^ Neal Richter,*^ and Rafal A. Angryk^
*RightNow Technologies
^Dept. of Computer Science, Montana State University
Abstract:
Most information retrieval systems are comprised of a focused set of domain-specific
documents located within a single logical repository. A mechanism is developed by
which user queries against such a system are used to generate a concept hierarchy
pertinent to the domain. First, an algorithm is described which extracts terms
from documents matching user queries, and then reduces this set of terms to a
manageable length. The resulting terms are used to generate a feature vector for
each query, and the queries are clustered using a Hierarchical Agglomerative Clustering
(HAC) algorithm. The HAC algorithm generates a binary tree of clusters, which is not
particularly amenable to use by humans and which is slow to search due to its depth,
so a subsequent processing step applies min-max partitioning to form a shallower,
bushier tree that is a more natural representation of the concept hierarchy inherent in the system.
Steve Durbin - November 10, 2005 - Web computing
Learning from THE WEB (Word doc)
Learning from THE WEB (ACM Queue online)
Adam Bosworth
Google
Abstract:
The Web has much to teach us about managing and modeling distributed data.
It's time we began listening.
Steve Durbin - November 3, 2005 - Web search
Automatic Identification of User Goals in Web Search
Uichin Lee, Zhenyu Liu, and Junghoo Cho
Dept. of Computer Science, University of California, Los Angeles
Abstract:
There have been recent interests in studying the "goal" behind
a user's Web query, so that this goal can be used to improve
the quality of a search engine's results. Previous studies
have mainly focused on using manual query-log investigation
to identify Web query goals. In this paper we study
whether and how we can automate this goal-identification
process. We first present our results from a human subject
study that strongly indicate the feasibility of automatic
query-goal identification. We then propose two types of features
for the goal-identification task: user-click behavior and
anchor-link distribution. Our experimental evaluation shows
that by combining these features we can correctly identify
the goals for 90% of the queries studied.
Steve Durbin - October 27, 2005 - Visualization
Arc Diagrams: Visualizing Structure in Strings
Martin Wattenberg
IBM Research
Abstract:
This paper introduces a new visualization method, the arc diagram , which is
capable of representing complex patterns of repetition in string data. Arc diagrams
improve over previous methods such as dotplots because they scale efficiently
for strings that contain many instances of the same subsequence. This paper
describes design and implementation issues related to arc diagrams and shows
how they may be applied to visualize such diverse data as music, text, and
compiled code.
Leif Wickland - October 20, 2005 - Google File System
Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
Google
Abstract:
We have designed and implemented the Google File System, a scalable distributed
file system for large distributed data-intensive applications. It provides fault
tolerance while running on inexpensive commodity hardware, and it delivers high
aggregate performance to a large number of clients.
While sharing many of the same goals as previous distributed file systems, our design
has been driven by observations of our application workloads and technological
environment, both current and anticipated, that reflect a marked departure from
some earlier file system assumptions. This has led us to reexamine traditional
choices and explore radically different design points.
The file system has successfully met our storage needs. It is widely deployed
within Google as the storage platform for the generation and processing of data
used by our service as well as research and development efforts that require large
data sets. The largest cluster to date provides hundreds of terabytes of storage
across thousands of disks on over a thousand machines, and it is concurrently
accessed by hundreds of clients.
In this paper, we present file system interface extensions designed to support
distributed applications, discuss many aspects of our design, and report measurements
from both micro-benchmarks and real world use.
Zuzana Gedeon - October 13, 2005 - Machine learning and speech recognition for evil purposes
Keyboard Acoustic Emanations Revisited
Li Zhuang, Feng Zhou, and J. D. Tygar
University of California, Berkeley
Abstract:
We examine the problem of keyboard acoustic emanations. We present a novel
attack taking as input a 10-minute sound recording of a user typing English
text using a keyboard, and then recovering up to 96% of typed characters.
There is no need for a labeled training recording. Moreover the recognizer
bootstrapped this way can even recognize random text such as passwords:
In our experiments, 90% of 5-character random passwords using only letters can
be generated in fewer than 20 attempts by an adversary; 80% of 10-character
passwords can be generated in fewer than 75 attempts. Our attack uses the
statistical constraints of the underlying content, English language, to
reconstruct text from sound recordings without any labeled training data.
The attack uses a combination of standard machine learning and speech
recognition techniques, including cepstrum features, Hidden Markov Models,
linear classification, and feedback-based incremental learning.
Anthony Arnone - September 29, 2005 - Web Crawling
User-Centric Web Crawling
Sandeep Pandey and Christopher Olston
Carnegie Mellon University
Abstract:
Search engines are the primary gateways of information access on the Web today.
Behind the scenes, search engines crawl the Web to populate a local indexed
repository of Web pages, used to answer user search queries. In an aggregate sense,
the Web is very dynamic, causing any repository of Web pages to become out of
date over time, which in turn causes query answer quality to degrade. Given the
considerable size, dynamicity, and degree of autonomy of the Web as a whole, it is not
feasible for a search engine to maintain its repository exactly synchronized with the Web.
In this paper we study how to schedule Web pages for selective (re)downloading into
a search engine repository. The scheduling objective is to maximize the quality of
the user experience for those who query the search engine. We begin with a
quantitative characterization of the way in which the discrepancy between the
content of the repository and the current content of the live Web impacts the
quality of the user experience. This characterization leads to a usercentric
metric of the quality of a search engine's local repository. We use this metric
to derive a policy for scheduling Web page (re)downloading that is driven by
search engine usage and free of exterior tuning parameters. We then focus on the
important subproblem of scheduling refreshing of Web pages already present in the
repository, and show how to compute the priorities efficiently. We provide extensive
empirical comparisons of our user-centric method against prior Web page refresh
strategies, using real Web data. Our results demonstrate that our method requires
far fewer resources to maintain same search engine quality level for users, leaving
substantially more resources available for incorporating new Web pages into the search repository.
Neal Richter - September 22, 2005 - Language learning
Unsupervised Learning of Natural Languages
Zach Solan, David Horn, Eytan Ruppin and Shimon Edelman,
Tel Aviv University, and Cornell University
Abstract:
We address the problem, fundamental to linguistics, bioinformatics, and certain
other disciplines, of using corpora of raw symbolic sequential data to infer
underlying rules that govern their production. Given a corpus of strings (such
as text, transcribed speech, chromosome or protein sequence data, sheet music,
etc.), our unsupervised algorithm recursively distills from it hierarchically
structured patterns. The ADIOS (automatic distillation of structure) algorithm
relies on a statistical method for pattern extraction and on structured
generalization, two processes that have been implicated in language
acquisition. It has been evaluated on artificial context-free grammars with
thousands of rules, on natural languages as diverse as English and Chinese, and
on protein data correlating sequence with function. This unsupervised algorithm
is capable of learning complex syntax, generating grammatical novel sentences,
and proving useful in other fields that call for structure discovery from raw
data, such as bioinformatics.
Steve Durbin - September 15, 2005 - Web text mining
The Predictive Power of Online Chatter
Daniel Gruhl, R. Guha*, Ravi Kumar, Jasmine Novak and Andrew Tomkins
IBM Almaden Research, *Google
Abstract:
An increasing fraction of the global discourse is migrating
online in the form of blogs, bulletin boards, web pages, wikis,
editorials, and a dizzying array of new collaborative technologies.
The migration has now proceeded to the point
that topics reflecting certain individual products are sufficiently
popular to allow targeted online tracking of the ebb
and flow of chatter around these topics. Based on an analysis
of around half a million sales rank values for 2,340 books
over a period of four months, and correlating postings in
blogs, media, and web pages, we are able to draw several
interesting conclusions.
First, carefully hand-crafted queries produce matching postings
whose volume predicts sales ranks. Second, these queries
can be automatically generated in many cases. And third,
even though sales rank motion might be difficult to predict
in general, algorithmic predictors can use online postings to
successfully predict spikes in sales rank.
Steve Durbin - September 8, 2005 - Robot swarms
Dynamic Task Assignment in Robot Swarms
James McLurkin and Daniel Yamins*
Computer Science and Artificial Intelligence Lab, MIT, *Dept. of Mathematics, Harvard University
Abstract:
A large group of robots will often be partitioned into subgroups, each subgroup
performing a different task. This paper presents four distributed algorithms
for assigning swarms of homogenous robots to subgroups to meet a specified
global task distribution. Algorithm Random-Choice selects tasks randomly,
but runs in constant time. Algorithm Extreme-Comm compiles a complete inventory
of all the robots on every robot, runs quickly, but uses a great deal of
communication. The CardDealer's algorithm assigns tasks to individual robots
sequentially, using minimal communications but a great deal of time. The
Tree-Recolor algorithm is a compromise between Extreme-Comm and Card-Dealer's,
balancing communications use and running time. The three deterministic
algorithms drive the system towards the desired assignment of subtasks with
high accuracy. We implement the algorithms on a group of 25 iRobot SwarmBots,
and collect and analyze performance data.
Photos and video of robot swarm behaviors
Neal Richter - August 18, 2005 - Michael Collins's Talk on NLP at the University of Washington.
Abstract:
Natural Language Processing
Natural language processing offers a rich problem domain for machine learning
approaches. Many NLP problems require the induction of a mapping that involves
complex, discrete structures such as strings, labeled sequences, or trees. In
this distinguished lecture, Michael Collins describes how 'large margin'
methods in machine learning can be generalized to 'structured' problems found
in NLP.
Speaker(s): Michael Collins, Massachusetts Institute of Technology
Subject: Natural Language processing
Series: CSE Colloquia - 2005
Produced by: University of Washington, December 7, 2005
Runtime: 00:58:16
Streaming
Video at UWTV.org
Steve Durbin - August 4, 2005 - Personal information management
The Haystack Project is investigating approaches designed to let people manage their information in ways that make the most sense to them. Learn more at Haystack Project
The Re:Search Engine: Helping People Return to Information on the Web
Jaime Teevan
MIT
Abstract:
Re-finding information is commonly cited as a problem on the Web. One reason
re-finding on the Web is difficult is that while people rely on a considerable
amount of context to return to information (e.g., the original path taken to
it), the Web makes no guarantee that the context will remain static. The Re:Search
Engine is designed to help people return to information in the dynamic environment
of the Web by maintaining consistency in the search results it returns across
time. For example, if Connie, while looking to purchase a Global Positioning System,
found several systems she liked via a search for GPS , she would expect to be able
to use the same query to locate the exact same systems again. However, simply
returning the original result list when she re-issues the query might omit newly
available GPS systems that she would like to see. The ideal result list would contain
both the systems Connie remembers having seen and high quality new systems. Because
people tend to remember little of what is presented in a result list, when a person
repeats a query, the Re:Search Engine can preserve what is remembered about the
original result set while still presenting new information.
Neal Richter - May 19th and 26th, 2005 - Steven Wolfram's Talk at University of Michigan.
On October 8, 2003 Stephen Wolfram gave a talk sponsored by
University of Michigan Center for Study of Complex Systems titled and
about his (then new) book A New Kind of Science.
Wolfram.wmv -
Windows Media Player Movie File
Abstract:
The foundational observation of "A New Kind of Science" is that very simple
systems are very frequently capable of great complexity. While isolated
examples of this have been known for millenia, no fundamental significance was
attached to them.
The surprising fact is that if one just starts enumerating simple computational systems, one very quickly encounters great complexity. This is robust fact that does not depend on the details of the system in question, but rather seems to be a generic law of our universe.
The fact that complexity is in fact rather easy to generate suggests an entirely new approach to science. The traditional path of science is to seek out examples of behavior in the natural world and try to break them down into understandable pieces. But an alternative path is to just systematically enumerate simple abstract computational systems, starting with the very simplest, and try to understand what they do.
Links about the Book (and controversy surrounding it):
- http://www.wolframscience.com/ - The official webiste of the book.
- Wikipedia entry on A New Kind of Science
- http://gnosticalturpitude.org/archives/000026.html
Steve Durbin - April 7, 2005 - Recommender systems
TiVo: Making Show Recommendations Using a Distributed Collaborative Filtering Architecture
Kamal Ali,* Wijnand van Stam**
*TiVo, Yahoo!; **TiVo
Abstract:
We describe the TiVo television show collaborative recommendation system which has been
fielded in over one million TiVo clients for four years. Over this install base,
TiVo currently has approximately 100 million ratings by users over approximately
30,000 distinct TV shows and movies. TiVo uses an item-item (show to show) form of
collaborative filtering which obviates the need to keep any persistent memory of each
user viewing preferences at the TiVo server. Taking advantage of TiVo client-server
architecture has produced a novel collaborative filtering system in which the server
does a minimum of work and most work is delegated to the numerous clients. Nevertheless,
the server-side processing is also highly scalable and parallelizable. Although we have
not performed formal empirical evaluations of its accuracy, internal studies have shown
its recommendations to be useful even for multiple user households. TiVo architecture
also allows for throttling of the server so if more server-side resources become
available, more correlations can be computed on the server allowing TiVo to make
recommendations for niche audiences.
Steve Durbin - March 31, 2005 - Subjectivity analysis
Learning Subjective Nouns using Extraction Pattern Bootstrapping
Ellen Riloff,* Janyce Wiebe,** Theresa Wilson**
*University of Utah, **University of Pittsburgh
Abstract:
We explore the idea of creating a subjectivity classifier that uses lists of
subjective nouns learned by bootstrapping algorithms. The goal of our
research is to develop a system that can distinguish subjective sentences
from objective sentences. First, we use two bootstrapping algorithms
that exploit extraction patterns to learn sets of subjective nouns.
Then we train a Naive Bayes classifier using the subjective nouns,
discourse features, and subjectivity clues identified in prior research.
The boot-strapping algorithms learned over 1000 subjective nouns, and the
subjectivity classifier performed well, achieving 77% recall with 81% precision.
Steve Durbin - March 24, 2005 - Social Information Foraging
Footprints: History-Rich Tools for Information Foraging
Alan Wexelblat and Pattie Maes
MIT Media Lab
Abstract:
Inspired by Hill and Hollan's original work, we have been developing a theory
of interaction history and building tools to apply this theory to navigation
in a complex information space. We have built a series of tools--map, paths,
annotations and signposts--based on a physical-world navigation metaphor.
These tools have been in use for over a year. Our user study involved a
controlled browse task and showed that users were able to get the same amount
of work done with significantly less effort.
Ken Dyke - March 10, 2005 - Computer Security & AI
Anomaly Detection of Web-based Attacks
By Christopher Kruegel & Giovanni Vigna
Reliable Software Group
University of California, Santa Barbara
Abstract:
Web-based vulnerabilities represent a substantial portion of the
security exposures of computer networks. In order to detect known web-based
attacks, misuse detection systems are equipped with a large number of
signatures. Unfortunately, it is difficult to keep up with the daily disclosure
of web-related vulnerabilities, and, in addition, vulnerabilities may be
introduced by installation-specific web-based applications. Therefore, misuse
detection systems should be complemented with anomaly detection systems. This
paper presents an intrusion detection system that uses a number of different
anomaly detection techniques to detect attacks against web servers and
web-based applications. The system correlates the serverside programs
referenced by client queries with the parameters contained in these queries.
The application-specific characteristics of the parameters allow the system to
perform focused analysis and produce a reduced number of false positives. The
system derives automatically the parameter profiles associated with web
applications (e.g., length and structure of parameters) from the analyzed data.
Therefore, it can be deployed in very different application environments
without having to perform time-consuming tuning and configuration.
Neal Richter - March 3, 2005 - Search Engine Security
The Insecure Indexing Vulnerability
Attacks Against Local Search Engines
By Amit Klein
Security Researcher
Abstract:
This paper describes several techniques (many of them new) for exposing file
contents using the site search functionality. It is assumed that a site
contains documents which are not visible/accessible to external users. Such
documents are typically future PR items, or future security advisories,
uploaded to the website beforehand. However, the site is also searchable via an
internal search facility, which does have access to those documents, and as
such, they are indexed by it not via web crawling, but rather, via direct
access to the files (and therein lies the security breach).
Several attack techniques are described, some very simple and quick, while other require an enormous amount of traffic; not all attacks are relevant to a particular site, as they depend on the richness of syntax supported by the site's search engine.
The paper concludes with methods to detect insecure indexing vulnerability and suggested solutions.
Note that this attack is fundamentally different than exploitation of external (remote) search engines.
Desktop
search engines threaten SSL VPN security
By Author Tim Greene
Abstract:
Discusses how new Desktop Search tools from Google and others are a
security threat.
Matrix of Desktop Search Engine Indexing Capabilities
Steve Durbin - February 24, 2005 - Knowledge-based search
Exploiting a Thesaurus-Based Semantic Net for Knowledge-Based Search
Peter Clark, John Thompson, Heather Holmback, Lisbeth Duncan
Mathematics and Computing Technology, The Boeing Company
Abstract:
With the growth of on-line information, the need for better resource location
services is growing rapidly. A popular goal is to conduct search in terms of
concepts, rather than words; however, this approach is frequently thwarted by
the high up-front cost of building an adequate ontology (conceptual vocabulary)
in the first place. In this paper we describe a knowledge-based Expert Locator
application (for identifying human experts relevant to a particular problem or
interest), which addresses this issue by using a large, pre-built, technical
thesaurus as an initial ontology, combined with simple AI techniques of search,
subsumption computation, and language processing. The application has been
deployed and in use in our local organization since June, 1999, and a second,
larger application was deployed in March 2000. We present the Expert Locator
and the AI techniques it uses, and then we evaluate and discuss the
application. The significance of this work is that it demonstrates how years of
work by library science in thesaurus-building can be leveraged using AI
methods, to construct a practical resource location service in a short period
of time.
Neal Richter and Steve Durbin - February 17, 2005 - Knowledge discovery from the Web
Automatic Meaning Discovery Using Google
Rudi Cilibrasi and Paul Vitanyi
National Research Institute for Mathematics and Computer Science, Netherlands
Abstract:
We have found a method to automatically extract the meaning of words and
phrases from the world-wide-web using Google page counts. The approach is
novel in its unrestricted problem domain, simplicity of implementation,
and manifestly ontological underpinnings. The world-wide-web is the largest
database on earth, and the latent semantic context information entered by
millions of independent users averages out to provide automatic meaning of
useful quality. We demonstrate positive correlations, evidencing an underlying
semantic structure, in both numerical symbol notations and number-name words
in a variety of natural languages and contexts. Next, we demonstrate the
ability to distinguish between colors and numbers, and to distinguish between
17th century Dutch painters; the ability to understand electrical terms,
religious terms, emergency incidents, and we conduct a massive experiment in
understanding WordNet categories; the ability to do a simple automatic
English-Spanish translation.
Doug Warner and Steve Durbin - February 10, 2005 - Search systems with swarm optimization
User Behavior in an Adaptive Search System
Doug Warner, Stephen D. Durbin, J. Neal Richter, Z. Gedeon
RightNow Technologies
Abstract:
We analyze usage of a search system employing both standard search techniques
and Ant Colony Optimization (ACO) approaches. Searching behavior on the
system is similar to that on Internet search engines, though low searching
rates suggest facilitation due to the ACO methods. Simulations based on site
and user models give insight into the adaptive behavior and roughly match
observations.
Zuzana Gedeon - February 3, 2005 - Learning ordering
Learning to Order Things
William W. Cohen, Robert E. Schapire, and Yoram Singer
AT&T Labs
Abstract:
There are many applications in which it is desirable to order rather than
classify instances. Here we consider the problem of learning how to order,
given feedback in the form of preference judgments, i.e., statements to
the effect that one instance should be ranked ahead of another. We outline
a two-stage approach in which one first learns by conventional means a
preference function, of the form PREF(u,v), which indicates whether it is
advisable to rank u before v. New instances are then ordered so as to maximize
agreements with the learned preference function. We show that the problem of
finding the ordering that agrees best with a preference function is
NP-complete, even under very restrictive assumptions. Nevertheless, we
describe a simple greedy algorithm that is guaranteed to find a good
approximation. We then discuss an on-line learning algorithm, based on the
"Hedge" algorithm, for finding a good linear combination of ranking "experts."
We use the ordering algorithm combined with the on-line learning algorithm to
find a combination of "search experts," each of which is a domain-specific
query expansion strategy for a WWW search engine, and present experimental
results that demonstrate the merits of our approach.
Jeff Elser and John Paxton - January 27, 2005 - Optimizing a search engine
Abstract:
ht://dig performs three major tasks that should be performed in the following order:
-
Digging
Before you can search, a database of all the documents that need to be searched has to be created. -
Merging
Merging consists of two processes:
A. Converting the databases of all documents to specialized databases for simple, fast searching.
B. Merging changed information into previously existing databases.
Even though this task could be performed at the same time as the Digging, it is a separate process for efficiency reasons. This also allows for more control over the processes implemented when merging. -
Searching
Finally, the databases that were created in the previous steps can be used for actual searches. Normally, searches will be invoked by a CGI (Common Gateway Interface; a program running on the webserver) which gets input from the user through an HTML form.
Discussion Topics:
(1) An overview of ht://dig
(2) an exploratory discussion of how machine learning might be used to configure ht://dig
Reading: Browse the ht://dig website, http://www.htdig.org/
Steve Durbin - January 20, 2005 - Bayesian bots
Teaching Bayesian Behaviours to Video Game Characters
Ronan Le Hy, Anthony Arrigonia, Pierre Bessiere, Olivier Lebeltel
INRIA, France
Abstract:
This article explores an application of Bayesian Programming to behaviours for
synthetic video games characters. We address the problem of real-time reactive
selection of elementary behaviours for an agent playing a first person shooter
game. We show how Bayesian Programming can lead to condensed and easier
formalisation of finite state machine-like behaviour selection, and lend itself
to learning by imitation, in a fully transparent way for the player.
Steve Durbin - December 9, 2004 - Perceptions of randomness
Randomness and Coincidences: Reconciling Intuition and Probability Theory
Thomas L. Griffiths & Joshua B. Tenenbaum
Department of Psychology, Stanford University
Abstract:
We argue that the apparent inconsistency between people's intuitions about chance
and the normative predictions of probability theory, as expressed in judgments
about randomness and coincidences, can be resolved by focussing on the evidence
observations provide about the processes that generated them, rather than their
likelihood. This argument is supported by probabilistic modeling of sequence
and number production, together with two experiments that examine people's
judgments about coincidences.
Zuzana Gedeon - December 2, 2004 - Clustering impossibility theorem
An Impossibility Theorem for Clustering
Jon Kleinberg
Department of Computer Science, Cornell University
Abstract:
Although the study of clustering is centered around an intuitively compelling goal,
it has been very difficult to develop a unified framework for reasoning about it
at a technical level, and profoundly diverse approaches to clustering abound in the
research community. Here we suggest a formal perspective on the difficulty in
finding such a unification, in the form of an impossibility theorem: for a set of
three simple properties, we show that there is no clustering function satisfying all
three. Relaxations of these properties expose some of the interesting (and unavoidable)
trade-offs at work in well-studied clustering techniques such as single-linkage,
sum-of-pairs, k-means, and k-median.
Mike Emery and John Paxton - November 18, 2004 - Structure in multi-agent systems
Cougarr Agent Communities
Ronald D. Snyder and Douglas C. MacKenzie
Mobile Intelligence Corporation
Abstract:
The ability to organize agents into abstract groups provides a powerful tool for
agent organization and communication in large multi-agent systems. In this paper
we outline the fundamental characteristics required of a robust and scalable agent
grouping mechanism. These characteristics are then discussed in the context of the
Cougarr community infrastructure. This implementation of agent communities illustrates
the use of distributed state management, distributed event propagation and abstract
messaging in a high-performance agent architecture designed for robustness and
scalability.
Steve Durbin - November 4, 2004 - Question-answering with soft text patterns
Unsupervised Learning of Soft Patterns for Generating Definitions from Online News
Hang Cui, Min-Yen Kan, and Tat-Seng Chua
Department of Computer Science, School of Computing, National University of Singapore
Abstract:
Breaking news often contains timely definitions and descriptions of current terms,
organizations and personalities. We utilize such web sources to construct definitions
for such terms. Previous work has identified definitions using hand-crafted rules or
supervised learning that constructs rigid, hard text patterns. In contrast, we
demonstrate a new approach that uses flexible, soft matching patterns to characterize
definition sentences. Our soft patterns are able to effectively accommodate the
diversity of definition sentence structure exhibited in news. We use pseudo-relevance
feedback to automatically label sentences for use in soft pattern generation.
The application of our unsupervised method significantly improves baseline systems
on both the standardized TREC corpus as well as crawled online news articles by 27%
and 30%, respectively, in terms of F measure. When applied to a state-of-art definition
generation system recently fielded in the TREC 2003 definitional question answering task,
it improves the performance by 14%.
Binhai Zhu - October 28, 2004 - Datamining in sparse databases
Efficient discovery of error-tolerant frequent itemsets in high dimensions
Cheng Yang
Dept. of Computer Science, Stanford University
Usama Fayyad and Paul S. Bradley
digiMine, Inc.
Abstract:
We present a generalization of frequent itemsets allowing for the notion of
errors in the itemset definition. We motivate the problem and present an
efficient algorithm that identifies error-tolerant frequent clusters of items
in transactional data (customer purchase data, web browsing data, text, etc.).
The algorithm exploits sparseness of the underlying data to find large groups
of items that are correlated over database records (rows). The notion of transaction
coverage allows us to extend the algorithm and view it as a fast clustering algorithm
for discovering segments of similar transactions in binary sparse data. We evaluate
the new algorithm on three real-world applications: clustering high-dimensional data,
query selectivity estimation and collaborative filtering. Results show that the
algorithm consistently uncovers structure in large sparse databases that other
traditional clustering algorithms fail to find.
Rafal Angryk, Steve Durbin - October 21, 2004 - Uncertain knowledge in databases and multi-agent communication
Current approaches to handling imperfect information in data and knowledge bases
Simon Parsons
Advanced Computation Laboratory, Imperial Cancer Research Fund
Abstract:
This paper surveys methods for representing and reasoning with imperfect information.
It opens with an attempt to classify the different types of imperfection that may
pervade data, and a discussion of the sources of such imperfections. The classification
is then used as a framework for considering work that explicitly concerns the
representation of imperfect information, and related work on how imperfect information
may be used as a basis for reasoning. The work that is surveyed is drawn from both the
field of databases and the field of artificial intelligence. Both of these areas have
long been concerned with the problems caused by imperfect information, and this paper
stresses the relationships between the approaches developed in each.
The Byzantine generals problem
Leslie Lamport, Robert Shostak, and Marshall Pease
SRI International
Abstract:
Reliable computer systems must handle malfunctioning components that give conflicting
information to different parts of the system. This situation can be expressed
abstractly in terms of a group of generals of the Byzantine army camped with their
troops around an enemy city. Communicating only by messenger, the generals must agree
upon a common battle plan. However, one or more of them may be traitors who will try
to confuse the others. The problem is to find an algorithm to ensure that the loyal
generals will reach agreement. It is shown that, using only oral messages, this
problem is solvable if and only if more that two-thirds of the generals are loyal; so
a single traitor can confound two loyal generals. With unforgeable written messages, the
problem is solvable for any number of generals and possible traitors. Applications of
the solutions to reliable computer systems are then discussed.
Doug Warner - October 7, 2004 - Processing uncertain knowledge
How Humans Process Uncertain Knowledge: An Introduction for Knowledge Engineers
Robert F. Hink and David L. Woods
Abstract:
The question of how humans process uncertain information is important to the
development of knowledge-based systems in terms of both knowledge acquisition
and knowledge representation. This article reviews three bodies of psychological
research that address this question: human perception, human probabilistic and
statistical judgement, and human choice behavior. The general conclusion is that
human behavior under uncertainty is often suboptimal and sometimes even fallacious.
Suggestions for knowledge engineers in detecting and obviating such errors are
discussed. The requirements for a system designed to reduce the effects of human
factors in the processing of uncertain knowledge are introduced.
Steve Durbin - September 30, 2004 - Cognitive biases and bounded rationality
Chapter from the very readable CIA publication
Psychology of Intelligence Analysis
Biases in Estimating Probabilities
Daniel Kahneman Nobel Prize lecture
Maps of Bounded Rationality: A Perspective on Intuitive Judgement and Choice
Multiple - September 23, 2004 - Evaluation methods
Overview:
This week, multiple presenters will give short presentations on evaluation
methods of their choice. These may be methods of evaluating competing algorithms,
or methods of evaluating data so as to select the most appropriate algorithm
for a particular task. One method, the ROC curve, is described at the sites below:
The Magnificent ROC
Interpreting Diagnostic Tests
Steve Durbin - September 16, 2004 - Machine learning for computer security
Learning to Detect Malicious Executables in the Wild
Jeremy Z. Kolter and Marcus A. Maloof
Department of Computer Science, Georgetown University
Abstract:
In this paper, we describe the development of a fielded application for detecting
malicious executables in the wild. We gathered 1971 benign and 1651 malicious
executables and encoded each as a training example using n-grams of byte codes
as features. Such processing resulted in more than 255 million distinct n-grams.
After selecting the most relevant n-grams for prediction, we evaluated a variety
of inductive methods, including naive Bayes, decision trees, support vector
machines, and boosting. Ultimately, boosted decision trees outperformed other
methods with an area under the ROC curve of 0.996. Results also suggest that
our methodology will scale to larger collections of executables. To the best
of our knowledge, ours is the only fielded application for this task developed
using techniques from machine learning and data mining.
Lou Glassy - May 6, 2004 - Missing Text Reconstruction
Public PhD Dissertation Defense, Computer Science Dept, Montana State
University.
2:00pm in EPS 108
Abstract:
TBA
Neal Richter - April 22, 2004 - Expectation Maximization Algorithm
The Expectation Maximization Algorithm
Frank Dellaert
College of Computing, Georgia Institute of Technology
Abstract:
An intuitive tutorial of the "Expectation Maximization Algorithm".
The EM algorithm is general method for finding a maximum-likelihood estimate of the parameters of a probability distribution from a data set with incomplete/missing values.
The general EM is a description of a meta-algorithm, which is used to design a particular algorithm. It is used in pattern recognition and information retrieval communities for both unsupervised and semi-supervised learning algorithms as well as other research areas.
Steve Durbin - April 15, 2004 - Natural language interface
A Reliable Natural Language Interface to Household Appliances
Alexander Yates, Oren Etzioni, Daniel Weld
University of Washington, Computer Science
Abstract:
As household appliances grow in complexity and sophistication, they become
harder and harder to use, particularly because of their tiny display screens
and limited keyboards. This paper describes a strategy for building natural
language interfaces to appliances that circumvents these problems. Our
approach leverages decades of research on planning and natural language
interfaces to databases by reducing the appliance problem to the database
problem; the reduction provably maintains desirable properties of the database
interface. The paper goes on to describe the implementation and evaluation of
the EXACT interface to appliances, which is based on this reduction. EXACT
maps each English user request to an SQL query, which is transformed to create
a PDDL goal, and uses the Blackbox planner [13] to map the planning problem to
a sequence of appliance commands that satisfy the original request. Both
theoretical arguments and experimental evaluation show that EXACT is highly reliable.
Steve Durbin - April 8, 2004 - Affect sensing
What Would They Think? A Computational Model of Attitudes
Hugo Liu and Patti Maes
MIT Media Laboratory
Abstract:
A key to improving at any task is frequent feedback from people whose opinions
we care about: our family, friends, mentors, and the experts. However, such
input is not usually available from the right people at the time it is needed
most, and attaining a deep understanding of someone else's perspective
requires immense effort. This paper introduces a technological solution.
We present a novel method for automatically modeling a person's attitudes and
opinions, and a proactive interface called "What Would They Think?" which
offers the just-in-time perspectives of people whose opinions we care about,
based on whatever the user happens to be reading or writing. In the
application, each person is represented by a "digital persona," generated from
an automated analysis of personal texts ( e.g. weblogs and papers written by
the person being modeled) using natural language processing and commonsense-based
textual-affect sensing. In user studies, participants using our application
were able to grasp the personalities and opinions of a panel of strangers more
quickly and deeply than with either of two baseline methods. We discuss the
theoretical and pragmatic implications of this research to intelligent user interfaces.
Neal Richter and Doug Warner - April 1, 2004 - Automatic Classification of Photographic Images
Automatic Classification of Photographic Images
Loof Lirpa and Sourd-muet D'âne
University of DummköpfeeFrühling
and Université Polytechniques des Iimbécil
Abstract:
This paper demonstrates an automatic system for classifying photographic images according
to genre. The system marks pixels using combined
color, texture and temperature properties. These marked regions are then fed to
a specialized classifier, which attempts to group similar objects using geometric
constraints. If the classifier finds a sufficiently complex structure,the system
decides an object type is present and labels the genre appropriately. The
system shows excellent performance on a test set of 565 uncontrolled images
of art, mostly obtained from the internet, and 4289 assorted control images, drawn
from a wide variety of sources.
Bob Wall, Jeff Elser, ... - March 25, 2004 - Tutorial on Support Vector Machines
A Tutorial on nu-Support Vector Machines
Pai-Hsuen Chen*, Chih-Jen Lin*, and Bernhard Schölkopf**
*National Taiwan University and **Max Planck Institute for Biological Cybernetics
Abstract:
We briefly describe the main ideas of statistical learning theory, support
vector machines (SVMs), and kernel feature spaces. We place particular emphasis
on a description of the so-called nu-SVM, including details of the algorithm
and its implementation, theoretical results, and practical applications.
Ken Dyke - March 18, 2004 - Stupid vs. Intelligent Networks
End-to-End Arguments in System Design
J.H. Saltzer, D.P. Reed and D.D. Clark
M.I.T. Laboratory for Computer Science
Abstract:
This paper presents a design principle that helps guide placement of functions
among the modules of a distributed computer system. The principle, called the
end-to-end argument, suggests that functions placed at low levels of a system
may be redundant or of little value when compared with the cost of providing
them at that low level. Examples discussed in the paper include bit error recovery,
security using encryption, duplicate message suppression, recovery from system
crashes, and delivery acknowledgement. Low level mechanisms to support these
functions are justified only as performance enhancements.
Additional article online
Rise of the Stupid Network
Steve Durbin - March 11, 2004 - Faceted Metadata for Search
Flexible Search and Navigation using Faceted Metadata
Jennifer English, Marti Hearst, Rashmi Sinha, Kirsten Swearingen, Ka-Ping Yee
School of Information Management & Systems, University of California, Berkeley
Abstract:
We have developed an innovative search interface that allows non-expert users to
move through large information spaces in a flexible manner without feeling lost.
The design goal was to offer users a "browsing the shelves" experience seamlessly
integrated with focused search. Key to achieving our goal is the explicit exposure
of hierarchical faceted metadata in a manner that is intuitive and inviting to users.
After several iterations of design and testing, the usability results are strikingly
positive. We believe our approach marks a major step forward in search user interfaces
and can serve as a model for web-based collections of up to 100,000 items.
Demo Search site
Fine Arts Search
Software demo and download site
FacetMap
Jeff Elser - March 4, 2004 - Classification with Support Vector Machines
Severe Storm Cell Classification Using Support Vector Machines and Radial Basis Function Approaches
L. Ramirez ^, W. Pedrycz ^, N. Pizzi ^^
^Dept. of Electrical and Computer Engineering, University of Alberta
^^Institute for Biodiagnostics, National Research Council, Canada
Abstract:
Meteorological volumetric data are used to detect thunderstorms that are the cause
of most of the summer severe weathers. There are systems that may convert the
volumetric data into a set of derived products. Based on these derived features,
this work compares three classifiers to determine which approach will best classify
a storm cell data set coming from Environment Canada. The criterion for comparison
is the accuracy in the classification over a testing set. The three approaches
compared are the support vector machine (SVM) classifier, with radial basis function
(RBF) kernel; the classic RBF classifier, with the centres found using the orthogonal
least squares approach; and the hybrid RBF, with the centres corresponding to the
support vectors found using the SVM approach. The results show that the SVM approach
is the best of these approaches, in terms of accuracy, for the storm cell classification.
Background paper on support vector machines
Support-Vector Networks
Corinna Cortes and Vladimir Vapnik
AT&T Research Labs
Chad Nybo - February 26, 2004 - Zero Intelligence Agents in Economics
The Predictive Power of Zero Intelligence in Financial Markets
J. Doyne Farmer [1], Paolo Patelli [1,2], Ilija I. Zovko [1,3]
[1] Santa Fe Institute; [2] Sant'Anna School of Advanced Studies, Pisa; [3] CENDEF, University of Amsterdam
Abstract:
Using data from the London Stock Exchange, we test a model that treats the
statistical mechanics of price formation and the accumulation of stored
supply and demand under the simple assumption that people place orders to
trade at random. The model makes excellent predictions for transaction costs,
price diffusion rates, and a quantity closely related to supply and demand.
Thus, it appears that the price formation mechanism strongly constrains the
market, playing a more important role than the strategic behavior of agents.
The remarkable success of this approach suggests a new and unorthodox approach
to economics.
Brief article in Nature Science Update
Stock market traders show signs of zero intelligence
Extra credit paper (4 pages)
High-Performance Bidding Agents for the Continuous Double Auction
Gerald Tesauro and Rajarshi Das
IBM T.J. Watson Research Center
Steve Durbin - February 19, 2004 - Using Expectation to Select Features
Using Expectation to Guide Processing: A Study of Three Real-World Applications
Shumeet Baluja
Justsystem Pittsburgh Research Center and School of Computer Science, Carnegie Mellon
Abstract:
In many real world tasks, only a small fraction of the available inputs are important
at any particular time. This paper presents a method for ascertaining the relevance of
inputs by exploiting temporal coherence and predictability. The method proposed in
this paper dynamically allocates relevance to inputs by using expectations of their
future values. As a model of the task is learned, the model is simultaneously extended
to create task-specific predictions of the future values of inputs. Inputs which are
either not relevant, and therefore not accounted for in the model, or those which
contain noise, will not be predicted accurately. These inputs can be deemphasized,
and, in turn, a new, improved, model of the task created. The techniques presented in
this paper have yielded significant improvements for the vision-based autonomous
control of a land vehicle, vision-based hand tracking in cluttered scenes, and the
detection of faults in the etching of semiconductor wafers.
Doug Warner - February 12, 2004 - Link-Based Recommendation
On the Recommending of Citations for Research Papers
Sean M. McNee, Istvan Albert, Dan Cosley, Prateep Gopalkrishnan, Shyong K. Lam, Al Mamunur Rashid, Joseph A. Konstan, John Riedl
GroupLens Research Project, Dept. of Computer Science and Engineering, University of Minnesota
Abstract:
Collaborative filtering has proven to be valuable for recommending items in
many different domains. In this paper, we explore the use of collaborative
filtering to recommend research papers, using the citation web between papers
to create the ratings matrix. Specifically, we tested the ability of
collaborative filtering to recommend citations that would be suitable
additional references for a target research paper. We investigated six
algorithms for selecting citations, evaluating them through offline
experiments against a database of over 186,000 research papers contained in
ResearchIndex. We also performed an online experiment with over 120 users to
gauge user opinion of the effectiveness of the algorithms and of the utility
of such recommendations for common research tasks. We found large differences
in the accuracy of the algorithms in the offline experiment, especially when
balanced for coverage. In the online experiment, users felt they received
quality recommendations, and were enthusiastic about the idea of receiving
recommendations in this domain.
Steve Durbin - February 5, 2004 - Dynamical Systems and Cognition
Toward the Evolution of Dynamical Neural Networks for Minimally Cognitive Behavior
Randall D. Beer
Dept. of Computer Engineering and Science and Dept.of Biology, Case Western Reserve University
Abstract:
Current debates regarding the possible cognitive implications of ideas from
adaptive behavior research and dynamical systems theory would benefit greatly
from a careful study of simple model agents that exhibit minimally cognitive
behavior. This paper sketches one such agent, and presents the results of
preliminary experiments on the evolution of dynamical neural networks for
visually-guided orientation, object discrimination and accurate pointing
with a simple manipulator to objects appearing in its field of view.
Neal Richter - January 29, 2004 - Personal Search Engines
Stuff I've Seen: A System for Personal Information Retrieval and Re-Use
Susan Dumais, Edward Cutrell, JJ Cadiz, Gavin Jancke, Raman Sarin, Daniel C.
Robbins, Microsoft Research.
Abstract:
Most information retrieval technologies are designed to facilitate information
discovery.
However, much knowledge work involves finding and re-using previously seen
information. We describe
the design and evaluation of a system, called Stuff I've Seen (SIS) , that
facilitates information
re-use. This is accomplished in two ways. First, the system provides a unified
index of information
that a person has seen, whether it was seen as email, web page, document,
appointment, etc. Second,
because the information has been seen before, rich contextual cues can be
used in the search interface. The system has been used internally by more than
230 employees. We report on both qualitative and quantitative aspects of
system use. Initial findings show that time and people are important retrieval
cues. Users find information more easily using SIS, and use other search tools
less frequently
after installation.
Optional Short Articles
- Surfsaver -- Build Your Own Portable, Personal Search Engine
- Personal Search Engine for the Web and Your Computer
- Copernic Agent: Jack of All Searches
Neal Richter - January 22, 2004 - Topic Segmentation & Search Engines
SETS: Search Enhanced by Topic Segmentation
Mayank Bawa, Gurmeet Singh Manku, & Prabhakar Raghavan
Abstract:
We present SETS, an architecture for efficient search in peer-to-peer networks,
building upon ideas drawn from machine learning and social network theory. The
key idea is to arrange participating sites in a topic-segmented overlay
topology in which most connections are short-distance, connecting pairs of
sites with similar content. Topically focused sets of sites are then joined
together into a single network by long-distance links. Queries are matched and
routed to only the topically closest regions. We discuss a variety of design
issues and tradeoffs that an implementor of SETS would face. We show that SETS
is efficient in network traffic and query processing load.
Steve Durbin - January 15, 2004 - Intelligent interfaces
Automatically Personalizing User Interfaces
Daniel S. Weld, Corin Anderson, Pedro Domingos, Oren Etzioni, Krzysztof Gajos, Tessa Lau, and Steve Wolfman
Dept. of Computer Science and Engineering, University of Washington
Abstract:
Today's computer interfaces are one-size-fits-all. Users with little programming
experience have very limited opportunities to customize an interface to their
task and work habits. Furthermore, the overhead induced by generic interfaces
will be proportionately greater on small form-factor PDAs, embedded applications
and wearable devices. Automatic personalization may greatly enhance user
productivity, but it requires advances in customization (explicit, user-initiated
change) and adaptation (interface-initiated change in response to routine user
behavior). In order to improve customization, we must make it easier for users
to direct these changes. In order to improve adaptation, we must better predict
user behavior and navigate the inherent tension between the dynamism of automatic
adaptation and the stability required in order for theuser to predict the
computers behavior and maintain control. This paper surveys a decade's work on
customization and adaptation at the University of Washington, distilling the
lessons we have learned.
Steve Durbin - December 11, 2003 - Choices in Natural Language Generation
Experiments with discourse-level choices and readability
Sandra Williams, Ehud Reiter and Liesl Osman
Depts. of Computing Science and Medicine and Therapeutics, University of Aberdeen
Abstract:
This paper reports on pilot experiments that are being used, together with corpus
analysis, in the development of a Natural Language Generation (NLG) system, GIRL
(Generator for Individual Reading Levels). GIRL generates reports for individuals
after a literacy assessment. We tested GIRL's output on adult learner readers and
good readers. Our aim was to find out if choices the system makes at the discourse
level have an impact on readability. Our preliminary results indicate that such
choices do indeed appear to be important for learner readers. These will be
investigated further in future larger-scale experiments. Ultimately we intend to
use the results to develop a mechanism that makes discourse-level choices that are
appropriate for individuals' reading skills.
Additional paper (easy read!):
Human Variation and Lexical Choice
Ehud Reiter and Somayaljulu Sripada
Jeff Elser, Leif Wickland, and Ashish Kapoor - December 4, 2003 - Beyond Turing Machines
Turing's Ideas and Models of Computation
Eugene Eberbach, Dina Goldin, and Peter Wegner
Computer Science Depts. of UMass, Univ. of Connecticut, and Brown University
Abstract:
The theory of computation that we have inherited from the 1960's focuses
on algorithmic computation as embodied in the Turing Machine to the exclusion
of other types of computation that Turing had considered. In this chapter we
present new models of computation, inspired by Turing's ideas, that are more
appropriate for today's interactive, networked, and embedded computing systems.
These models represent super-Turing computation, going beyond Turing Machines
and algorithms. We identify three principles underlying super-Turing computation
(interaction with the world, infinity of resources, and evolution of
system) and apply these principles in our discussion of the implications
of super-Turing computation for the future of computer science.
Additional reference (~200 pages):
Study, Implementation, and Evolution of the Artificial Neural Networks Proposed by Alan M. Turing
Steve Durbin - November 20, 2003 - Context-sensitive computing
Out of context: Computer systems that adapt to, and learn from, context
Henry Lieberman and Ted Selker
MIT Media Lab
Abstract:
There is a growing realization that computer systems will need to be increasingly
sensitive to their context. Traditionally, hardware and software were conceptualized
as input/output systems: systems that took input, explicitly given to them by a
human, and acted upon that input alone to produce an explicit output. Now, this
view is seen as being too restrictive. Smart computers, intelligent agent software,
and digital devices of the future will have to operate on data that are not
explicitly given to them, data that they observe or gather for themselves. These
operations may be dependent on time, place, weather, user preferences, or the
history of interaction. In other words, context. But what, exactly, is context?
We look at perspectives from software agents, sensors, and embedded devices, and
also contrast traditional mathematical and formal approaches. We see how each
treats the problem of context and discuss the implications for design of
context-sensitive hardware and software.
Neal Richter - November 13, 2003 - Information Retrieval and Language Modeling
Challenges in Information Retrieval and Language Modeling
James Allan (editor), Jay Aslam, Nicholas Belkin, Chris Buckley, Jamie Callan,
Bruce Croft (editor), Sue Dumais, Norbert Fuhr, Donna Harman, David J. Harper,
Djoerd Hiemstra, Thomas Hofmann, Eduard Hovy, Wessel Kraaij, John Lafferty,
Victor Lavrenko, David Lewis, Liz Liddy, R. Manmatha, Andrew McCallum, Jay
Ponte, John Prager, Dragomir Radev, Philip Resnik, Stephen Robertson, Roni
Rosenfeld, Salim Roukos, Mark Sanderson, Rich Schwartz, Amit Singhal, Alan
Smeaton, Howard Turtle, Ellen Voorhees, Ralph Weischedel, Jinxi Xu, ChengXiang
Zhai
Abstract:
Information retrieval (IR) research has reached a point where it is appropriate
to assess progress and to define a research agenda for the next five to ten
years. This report summarizes a discussion of IR research challenges that took
place at a recent workshop. The attendees of the workshop considered
information retrieval research in a range of areas chosen to give broad
coverage of topic areas that engage information retrieval researchers. Those
areas are retrieval models, cross-lingual retrieval, Web search, user modeling,
filtering, topic detection and tracking, classification, summarization,
question answering, metasearch, distributed retrieval, multimedia retrieval,
information extraction, as well as testbed requirements for future work. The
potential use of language modeling techniques in these areas was also
discussed. The workshop identified major challenges within each of those areas.
Steve Durbin - November 6, 2003 - Text categorization using decision trees and rules
A Decision-Tree-Based Symbolic Rule Induction System for Text Categorization
David E. Johnson, Frank J. Oles, Tong Zhang, and Thilo Goetz
IBM T. J. Watson Research Center
Abstract:
We present a decison-tree-based symbolic rule induction system whose purpose is
to categorize text documents automatically. Our method for rule induction
involves the novel combination of (1) a fast decision tree induction algorithm
especially suited to text data and (2) a new method for converting a decision
tree to a rule set that is simplified, but still logically equivalent to the
original tree. We report experimental results on the use of this system on some
practical problems.
Steve Durbin - October 30, 2003 - Search and prediction of social networks
Identity and Search in Social Networks
Duncan J. Watts*#$, Peter Sheridan Dodds#, M. E. J. Newman$
*Department of Sociology, Columbia University; #Columbia Earth Institute, Columbia University; $Santa Fe Institute
Abstract:
Social networks have the surprising property of being "searchable": Ordinary
people are capable of directing messages through their network of acquaintances
to reach a specific but distant target person in only a few steps. We present a
model that offers an explanation of social network searchability in terms of
recognizable personal identities: sets of characteristics measured along a
number of social dimensions. Our model defines a class of searchable networks
and a method for searching them that may be applicable to many network search
problems, including the location of data files in peer-to-peer networks, pages
on the World Wide Web, and information in distributed databases.
The Link Prediction Problem for Social Networks
David Liben-Nowell* and Jon Kleinberg#
*Laboratory for Computer Science, MIT; #Department of Computer Science, Cornell University
Abstract:
Given a snapshot of a social network, can we infer which new interactions
among its members are likely to occur in the near future? We formalize this
question as the link prediction problem, and develop approaches to link
prediction based on measures of the "proximity" of nodes in a network.
Experiments on large co-authorship networks suggest that information about
future interactions can be extracted from network topology alone, and that
fairly subtle measures for detecting node proximity can outperform more
direct measures.
See also previous papers related to small-world networks:
Introduction to small-world networks
Zuzana Gedeon - October 23, 2003 - Recursive Partitioning with Multiple Trees
Extracting Representative Tree Models
Hugh A. Chipman, Edward I. George and Robert E. McCulloch
Department of Statistics and Actuarial Science, University of Waterloo
Abstract:
A common criticism of many methods for constructing tree models is that a
single tree or nested sequence of trees is produced, and that much uncertainty
about the tree structure is ignored. Recent search algorithms (bumping,
boosting, simulated annealing, MCMC) address this problem by finding a much
richer collection of trees. They lead to an embarrassment of riches, in that
it may be difficult to make sense of the resultant forest. Quite often, the
problem may not be as bad as it seems: although hundreds of distinct trees are
identified, many will differ only at a few nodes. Other trees may have different
topologies, but produce similar partitions of the predictor space. By defining
several distance metrics on trees, we summarize a forest of trees by several
representative trees and associated clusters. A new plot, the added tree plot
is introduced as a means to decide how many trees to examine while simultaneously
adjusting for the goodness-of-fit of the trees considered.
There's also a useful description of the method in the User's Manual for
FIRM (Formal Inference-based Recursive Modeling) package for recursive partitioning
Ashish Kapoor - October 16, 2003 - Genetic Algorithm for VLSI Layout
A Genetic Algorithm for Macro Cell Placement
Henrik Esbensen
Computer Science Department, Aarhus University
Abstract:
A new genetic algorithm for the macro cell placement problem is presented. The
algorithm is based on a generalization of the two-dimensional bin packing
problem. The genetic encoding of a macro cell placement and the corresponding
genetic operators are described. The algorithm has been tested on MCNC
benchmarks, and the quality of the produced placements are comparable to the
best published results.
Longer background paper on VLSI design:
VLSI Cell Placement Techniques
K. Shahookar and P. Mazumder
Department of Electrical Engineering and Computer Sc~ence
University of Michigan
Abstract:
VLSI cell placement problem is known to be NP complete. A wide repertoire of
heuristic algorithms exists in the literature for efficiently arranging the
logic cells on a VLSI chip. The objective of this paper is to present a
comprehensive survey of the various cell placement techniques, with emphasis
on standard ce11and macro placement. Five major algorithms for placement are
discussed: simulated annealing, force-directed placement, rein-cut placement,
placement by numerical optimization, and evolution-based placement. The first
two classes of algorithms owe their origin to physical laws, the third and
fourth are analytical techniques, and the fifth class of algorithms is derived
from biological phenomena. In each category, the basic algorithm is explained
with appropriate examples. Also discussed are the different implementations
done by researchers.
Neal Richter - October 9, 2003 - Machine Learning: Rule Learning and Extraction
Rule Learning and Extraction (Lecture Slides)
William H Hsu
Department of Computing and Information Sciences, Kansas State Univ.
Abstract:
This week's Colloquium will cover material out of Chapter 10 of Tom Mitchell's
Machine Learning book. The slides linked are from a course based upon this
book at Kansas State University.
NOT REQUIRED
All Lecture Slides from Machine Learning and Pattern Recognition,
Fall 2001,
Prof. William H. Hsu
Text Categorization and Relational Learning (1995)
William W. Cohen
Abstract:
We evaluate the first order learning system FOIL on a series of text
categorization problems. It is shown that FOIL usually forms classifiers with
lower error rates and higher rates of precision and recall with a relational
encoding than with a propositional encoding. We show that FOIL's performance
can be improved by relation selection, a first order analog of feature
selection. Relation selection improves FOIL's performance as measured by any of recall, precision, F-measure, or error rate. With an appropriate level of relational selection, FOIL appears to be compedative with or superior to existing propositional techniques.
Jeff Elser - October 2, 2003 - Fuzzy controllers
Some Crisp Thoughts on Fuzzy Logic
Daniel Abramovitch
Hewlett-Packard Laboratories
Abstract:
In the past year I have been inundated with articles on fuzzy logic as well
as encouraged to use it for control systems. After reading some articles
on fuzzy logic control, listening to a seminar by Zadeh, and attending a
one day course on Intelligent Control, I started forming an opinion about how
fuzzy logic control works. I believe that there are some fundamental pieces
of information not provided in most fuzzy logic control papers. When one
realizes what those pieces of information are, one gets a different opinion
about how and when fuzzy logic control works and when it is more practical
than conventional control. I will first state some opinions on fuzzy logic
and try to justify them. Once this is done, I will return to some of the
articles written by proponents of fuzzy logic and use the previous
understanding to shed some light on what is really responsible for the
improved system performance.
Background reading on fuzzy logic:
Introduction to Fuzzy Control
See also an earlier paper about fuzzy clustering:
Mining Web Access Logs Using Relational Competitive Fuzzy Clustering
Zuzana Gedeon - September 25, 2003 - Improving PageRank
Topic-Sensitive PageRank
Taher H. Haveliwala
Computer Science Department, Stanford University
Abstract:
In the original PageRank algorithm for improving the ranking of search-query
results, a single PageRank vector is computed, using the link structure of
the Web, to capture the relative "importance" of Web pages, independent of
any particular search query. To yield more accurate search results, we
propose computing a set of PageRank vectors, biased using a set of
representative topics, to capture more accurately the notion of importance
with respect to a particular topic. By using these (precomputed) biased
PageRank vectors to generate query-specific importance scores for pages at
query time, we show that we can generate more accurate rankings than with a
single, generic PageRank vector. For ordinary keyword search queries, we
compute the topic-sensitive PageRank scores for pages satisfying the query
using the topic of the query keywords. For searches done in context (e.g.,
when the search query is performed by highlighting words in a Web page),
we compute the topic-sensitive PageRank scores using the topic of the
context in which the query appeared.
See also last week's paper for discussion related to PageRank..
Steve Durbin - September 18, 2003 - Link analysis to improve searching
Implicit Link Analysis for Small Web Search
G.-R. Xue et al
Computer Science and Engineering, Shanghai Jiao-Tong University
and Microsoft Research Asia
Abstract:
Current Web search engines generally impose link analysis-based re-ranking
on web-page retrieval. However, the same techniques, when applied directly
to small web search such as intranet and site search, cannot achieve the
same performance because their link structures are different from the global
Web. In this paper, we propose an approach to constructing implicit links by
mining users' access patterns, and then apply a modified PageRank algorithm
to re-rank web-pages for small web search. Our experimental results indicate
that the proposed method outperforms content-based method by 16%, explicit
link-based PageRank by 20% and DirectHit by 14%, respectively.
See also an earlier paper about Google and its PageRank algorithm
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Dustin Lee - September 11, 2003 - Information markets
Avoiding Moral Hazards in Organizational Forecasting
Tad Hogg and Bernardo A. Huberman
Hewlett-Packard Labs
Abstract:
We describe a new mechanism that induces accurate forecasts within an
organization while reducing moral hazards and the stigma associated with
negative opinions. It is based on the notion of identity escrow, whereby
the identity of a forecaster is kept anonymous and only revealed when a
number of his subordinates detect an attitude that is contrary to the
interests of the organization. An analysis of the relative payoffs between
forecasting and production shows that through the use of identity escrows
one can adjust the size of the prediction group so as to ensure both
production and accurate forecasts.
See also a Kuro5hin op-ed piece about the recently defunct Policy Analysis Market
The Policy Analysis Market: Why It is a Great Idea
Louis Glassy - August 21, 2003 - Missing text reconstruction
Missing-Text Reconstruction
Louis Glassy
Department of Computer Science, Montana Tech
Abstract:
Missing-text reconstruction (MTR) is a new application of text-oriented
pattern recognition. The goal of MTR is to fully reconstruct documents
in which fragments of original text are missing. Using context stored as
n-gram models of the document's source language, the current MTR system
makes sets of hypotheses of the missing text, and combines these sets,
using Dempster's Rule of Combination, to form the best-supported reconstruction
of the missing text. A software system (MITRE) was developed to act as a
proof-of-concept for the MTR techniques discussed.
See also a previous colloquium Missing text reconstruction
Steve Durbin - July 31, 2003 - Machine translation
Statistical Phrase-Based Translation
Philipp Koehn, Franz Josef Och, Daniel Marcu
Department of Computer Science, University of Southern California
Abstract:
We propose a new phrase-based translation model and decoding algorithm
that enables us to evaluate and compare several, previously proposed phrase-based
translation models. Within our framework, we carry out a large number of
experiments to understand better and explain why phrase-based models out-perform
word-based models. Our empirical results, which hold for all examined language
pairs, suggest that the highest levels of performance can be obtained through
relatively sim-ple means: heuristic learning of phrase translations from
word-based alignments and lexical weighting of phrase translations. Surprisingly,
learning phrases longer than three words and learning phrases from high-accuracy
wordlevel alignment models does not have a strong impact on performance.
Learning only syntactically motivated phrases degrades the performance of our
systems.
See also the recent Slashdot story Romancing The Rosetta Stone
Zuzana Gedeon - July 24, 2003 - STL
To go back to more of a tutorial, than regular colloquium next week we will cover Standard Template Library. We will should cover some basics and discuss advantages/disadvantages of using STL in production code. If you have any insights into this area please come and share. Some recommended reading/tutorials can be found at:
SGI's very own Standard Template Library Programmer's Guide
Doug Warner - July 10, 2003 - Software patents
Patents for Software-Related
Inventions
Jeffrey R. Kuester and Ann K. Moceyunas
Abstract:
This paper was written in March of 1995 by Jeffrey R. Kuester and Ann K.
Moceyunas for the purpose of Jeff teaching a college class on software
patents. Jeff is a partner with Thomas, Kayden, Horstemeyer & Risley, L.L.P.,
a patent, copyright, and trademark law firm in Atlanta, Georgia, and his
practice focuses on software, electrical, and telecommunication patent law.
Patently Absurd
James Gleick
Abstract:
Once the province of a nuts-and-bolts world, patents are now being applied
to thoughts and ideas in cyberspace. It's a ridiculous phenomenon and a
nightmare for e-commerce.
Steve Durbin - July 3, 2003 - Mining user behavior to improve searching
Optimizing Search Engines
using Clickthrough Data
Thorsten Joachims
Department of Computer Science, Cornell University
Abstract:
This paper presents an approach to automatically optimizing
the retrieval quality of search engines using clickthrough
data. Intuitively, a good information retrieval system should
present relevant documents high in the ranking, with less
relevant documents following below. While previous approaches
to learning retrieval functions from examples exist,
they typically require training data generated from relevance
judgments by experts. This makes them difficult and expensive
to apply. The goal of this paper is to develop a
method that utilizes clickthrough data for training, namely
the query-log of the search engine in connection with the
log of links the users clicked on in the presented ranking.
Such clickthrough data is available in abundance and can
be recorded at very low cost. Taking a Support Vector Machine
(SVM) approach, this paper presents a method for
learning retrieval functions. From a theoretical perspective,
this method is shown to be well-founded in a risk minimization
framework. Furthermore, it is shown to be feasible
even for large sets of queries and features. The theoretical
results are verified in a controlled experiment. It shows
that the method can effectively adapt the retrieval function
of a meta-search engine to a particular group of users, outperforming
Google in terms of retrieval quality after only a
couple of hundred training examples.
Collaborative_Crawling:
Mining User Experiences for Topical Resource Discovery
Charu Aggarwal
IBM Research
Abstract:
The rapid growth of the world wide web had made the problem
of topic specific resource discovery an important one in
recentyears. In this problem, it is desired to find web pages
which satisfy a predicate specified by the user. Such a predicate
could be a keyword query, a topical query, or some arbitrary
contraint. Several techniques such as focussed crawling
and intelligent crawling have recently been proposed for
topic specific resource discovery. All these crawlers are linkage
based, since they use the hyperlink behavior in order to
perform resource discovery. Recent studies have shown that
the topical correlations in hyperlinks are quite noisy and
may not always show the consistency necessary for a reliable
resource discovery process. In this paper, we will approach
the problem of resource discovery from an entirely different
perspective; we will mine the significant browsing patterns
of world wide web users in order to model the likelihood of
web pages belonging to a specified predicate. This user behavior
can be mined from the freely available traces of large
public domain proxies on the world wide web. We refer to
this technique as collaborative crawling because it mines the
collective user experiences in order to find topical resources.
Such a strategy is extremely effective because the topical
consistency in world wide web browsing patterns turns out
to very reliable. In addition, the user-centered crawling system
can be combined with linkage based systems to create
an overall system whichworks more effectively than a system
based purely on either user behavior or hyperlinks.
Steve Durbin - June 26, 2003 - Content-based recommender system
Building Recommender Systems
using a Knowledge Base of Product Semantics
Rayid Ghani and Andrew Fano
Accenture Technology Labs
Abstract:
Online retailers have access to large amounts of transactional data but
current recommender systems tend to be short-sighted in nature and usually
focus on the narrow problem of pushing a set of closely related products
that try to satisfy the user's current need. Most ecommerce recommender
systems analyze a large amount of transactional data without actually having
any idea of what the items in the transactions mean or what they say about
the customers who purchased or browsed those items. In this paper, we present
a case study of a system that recommends items based on a custom-built knowledge
base that consists of products and associated semantic attributes. Our system
first extracts semantic features that characterize the domain of interest,
apparel products in our case, using text learning techniques and populates a
knowledge base with these products and features. The recommender system analyzes
descriptions of products that the user browses or buys and automatically infers
these semantic attributes to build a model of the user. This abstraction allows
us to not only recommend other items in the same class of products that "match"
the user model but also gives us the ability to understand the customer's
"tastes" and recommend items across categories for which traditional collaborative
filtering and contentbased systems are unsuitable. Our approach also allows us to
"explain" the recommendations in terms of qualitative features which, we believe,
enhances the user experience and helps build the user's confidence in the
recommendations.
Steve Durbin - June 19, 2003 - Data mining challenges and cautions
CoIL Challenge 2000 Tasks and Results:
Predicting and Explaining Caravan Policy Ownership
Peter van der Putten (1,2), Michel de Ruiter (1), Maarten van Someren (3)(1)
Sentient Machine Research, Amsterdam
Leiden Institute of Advanced Computer Science, Leiden University
Sociaal Wetenschappelijke Informatica, Universiteit van Amsterdamm
Abstract:
We present the problem tasks of the CoIL Challenge as they were explained to
the participants. Furthermore, a general overview is given of the Challenge results.
COIL CHALLENGE 2000 ENTRY (prediction winner)
Charles Elkan
Department of Computer Science and Engineering
University of California, San Diego
Magical Thinking in Data Mining:
Lessons From CoIL Challenge 2000
Charles Elkan
Department of Computer Science and Engineering
University of California, San Diego
Abstract:
CoIL challenge 2000 was a supervised learning contest that attracted 43 entries.
The authors of 29 entries later wrote explanations of their work. This paper discusses
these reports and reaches three main conclusions. First, naive Bayesian classifiers
remain competitive in practice: they were used by both the winning entry and the next
best entry. Second, identifying feature interactions correctly is important for
maximizing predictive accuracy: this was the difference between the winning classifier
and all others. Third and most important, too many researchers and practitioners in
datamining do not appreciate properly the issue of statistical significance and the
danger of overfitting. Given a dataset such as the one for the CoIL contest, it is
pointless to apply a very complicated learning algorithm, or to perform a very
time-consuming model search. In either case, one is likely to overfit the training
data and to fool oneself in estimating predictive accuracy and in discovering
useful correlations.
Business-oriented data mining competitions:
CoIL
Challenge 2000
KDD
Cup 1998
KDD Cup 2000
Lou Glassy - May 1, 2003 - Missing text reconstruction
Abstract:
The basic idea in missing text reconstruction is simple:
Based on a corpus of training text, construct n-gram tables which
will represent a statistical model of the natural language of the corpus.
Given a document with "holes" -- missing text -- use the statistical model
represented by the n-grams to reconstruct ("fill in") the holes in the document.
Links on Dempster-Shafer theory:
Dempster-Shafer Theory of Evidence
Dempster-Shafer Theory
Link on information theory:
Introduction to the Information Theory
Online book on information theory:
A Short Course in Information Theory: 8 lectures by David J.C. MacKay
Neal Richter - April 24, 2003 - Genetic Algorithms
Adaptive and Self-adaptive
Evolutionary Computations
Peter J. Angeline
Loral Federal Systems
Abstract:
This paper reviews the various studies that have introduced adaptive
and self-adaptive parameters into Evolutionary Computations. A formal
definition of an adaptive evolutionary computation is provided with an
analysis of the types of adaptive and self-adaptive parameter update
rules currently in use. Previous studies are reviewed and placed into
a categorization that helps to illustrate their similarities and
differences.
Theory of Evolutionary
Algorithms: A Birds Eye View
A. E. Eiben and G. Rudolph
Leiden Institute of Advanced Computer Science, Leiden University, The
Netherlands
Department of Computer Science, Dortmund University, Germany
Abstract:
In this paper we consider the most important questions, research
topics and technical tools used in various branches of evolutionary
algorithms. The road map we give is to facilitate the readers'
orientation in evolutionary computation theory. In the meanwhile, this
survey provides key references for further study and evidence that the
field of evolutionary computation is maturing rapidly, having many
important results and even more interesting challenges.
Steve Durbin - April 10, 2003 - Mental Models of Search Engines
Transparent
queries: Investigating Users' Mental Models of Search Engines
Jack Muramatsu and Wanda Pratt
Information and Computer Science, Univ. of California, Irvine
Abstract:
Typically, commercial Web search engines provide very little feedback
to the user concerning how a particular query is processed and
interpreted. Specifically, they apply key query transformations
without the user's knowledge. Although these transformations have a
pronounced effect on query results, users have very few resources for
recognizing their existence and understanding their practical
importance. We conducted a user study to gain a better understanding
of users' knowledge of and reactions to the operation of several query
transformations that web search engines automatically
employ. Additionally, we developed and evaluated Transparent Queries,
a software system designed to provide users with lightweight feedback
about opaque query transformations. The results of the study suggest
that users do indeed have difficulties understanding the operation of
query transformations without additional assistance. Finally, although
transparency is helpful and valuable, interfaces that allow direct
control of query transformations might ultimately be more helpful for
end-users.
Essay by Don Norman on "real world" design in industry:
Applying the
Behavioral, Cognitive, and Social Sciences to Products
Ashish Kapoor - April 3, 2003 - Chatbots
Agents with Faces: The
Effects of Personification of Agents
Tomoko Koda and Pattie Maes
MIT Media Laboratory
Abstract:
It is still an open question whether software agents should be
personified in the interface. How should faces be used to make the
agent more usable and likable? In order to study the effect of faces
and facial expressions, we compared subjects' responses to and
evaluation of different faces and facial expressions of entertaining
agents. The result shows that having faces and facial expressions is
likable and engaging. It takes users' effort to interpret the meaning
the faces. In terms of facial features, realistic faces are better
liked and rated as more intelligent than abstract faces.
Supplementary papers:
Noah and the Turing
Test
A tutorial on natural
language processing
Introduction with links:
Intelligent
Agents: A Primer
Steve Durbin - March 27, 2003 - Learning agents in computational game theory
Learning
dynamics in social dilemmas
Michael W. Macy*# and Andreas Flache##
*Department of Sociology, Cornell University
##Interuniversity Center for Social Science Theory and Methodology,
University of Groningen, The Netherlands
Abstract:
The Nash equilibrium, the main solution concept in analytical game
theory, cannot make precise predictions about the outcome of repeated
mixed-motive games. Nor can it tell us much about the dynamics by
which a population of players moves from one equilibrium to
another. These limitations, along with concerns about the cognitive
demands of forward-looking rationality, have motivated efforts to
explore backward-looking alternatives to analytical game theory. Most
of the effort has been invested in evolutionary models of population
dynamics. We shift attention to a learning-theoretic
alternative. Computational experiments with adaptive agents identify a
fundamental solution concept for social dilemmas—stochastic
collusion—based on a random walk from a self-limiting
noncooperative equilibrium into a self-reinforcing cooperative
equilibrium. However, we show that this solution is viable only within
a narrow range of aspiration levels. Below the lower threshold, agents
are pulled into a deficient equilibrium that is a stronger attractor
than mutual cooperation. Above the upper threshold, agents are
dissatisfied with mutual cooperation. Aspirations that adapt with
experience (producing habituation to stimuli) do not gravitate into
the window of viability; rather, they are the worst of both
worlds. Habituation destabilizes cooperation and stabilizes
defection. Results from the two-person problem suggest that
applications to multiplex and embedded relationships will yield
unexpected insights into the global dynamics of cooperation in social
dilemmas.
Additional papers:
Home page of
Michael Macy
An introduction to game theory:
Game
Theory
Dustin Lee - March 20, 2003 - Hidden Markov models for sequence analysis
Hidden Markov models
for sequence analysis: extension and analysis of the basic
method
Richard Hughey and Anders Krogh
Computer Engineering, Univ. of Calif., Santa Cruz and NORDITA,
Denmark
Abstract:
Abstract Hidden Markov models (HMMs) are a highly effective means of
modeling a family of unaligned sequences or a common motif within a
set of unaligned sequences. The trained HMM can then be used for
discrimination or multiple alignment. The basic mathematical
description of an HMM and its expectation-maximization training
procedure is relatively straight-forward. In this paper, we review the
mathematical extensions and heuristics that move the method from the
theoretical to the practical. Then, we experimentally analyze the
effectiveness of model regularization, dynamic model modification, and
optimization strategies. Finally it is demonstrated on the SH2 domain
how a domain can be found from unaligned sequences using a special
model type. The experimental work was completed with the aid of the
Sequence Alignment and Modeling software suite.
Additional papers:
UCSC's
Sequence Alignment and Modeling System
Tutorial on hidden Markov models:
Introduction
to Hidden Markov Models
Steve Durbin - March 6, 2003 - Dynamic Bayesian networks for an adaptive user interface
Empirically Grounded
Decision-Theoretic Adaptation to Situation-Dependent Resource
Limitations
Thorsten Bohnenberger, Boris Brandherm, Barbara Grossmann-Hutter,
Dominik Heckmann, Frank Wittig
Department of Computer Science, Saarland University, Germany
Abstract:
This article summarizes research on several interrelated general
issues that can arise in the design and development of user modeling
systems: the learning and subsequent adaptation of general user models
on the basis of empirical data; the modeling of temporally variable
properties of users, in particular time pressure and cognitive load;
and the user-adaptive planning of interactions under uncertainty. The
methods and results are integrated and illustrated with a prototype of
a mobile assistance system for travelers in an airport.
Additional papers:
READY project
publications
Tutorial on Bayesian networks:
A Brief
Introduction to Graphical Models and Bayesian Networks
Neal Richter - February 20, 2003 - Web Search Engines
A Taxonomy of Web Search
Andrei Broder
IBM Research
Abstract:
Classic IR (information retrieval) is inherently predicated on users
searching for information, the so-called "information need". But the
need behind a web search is often not informational—it might be
navigational (give me the url of the site I want to reach) or
transactional (show me sites where I can perform a certain
transaction, e.g. shop, download a file, or find a map). We explore
this taxonomy of web searches and discuss how global search engines
evolved to deal with web-specific needs.
Challenges in Web Search
Engines
Monika R. Henzinger, Rajeev Motwani, and Craig Silverstein
Google & Stanford University
Abstract:
This article presents a high-level discussion of some problems in
information retrieval that are unique to web search engines. The goal
is to raise awareness and stimulate research in these areas.
John Paxton - February 20, 2003 - Machine Learning for Sequential Data
Machine Learning for Sequential
Data: : A Review
Thomas G. Dietterich
Oregon State University
Abstract:
Statistical learning problems in many elds involve sequential
data. This paper formalizes the principal learning tasks and describes
the methods that have been developed within the machine learning
research community for addressing these problems. These methods
include sliding window methods, recurrent sliding windows, hidden
Markov models, conditional random elds, and graph transformer
networks. The paper also discusses some open research issues.
Steve Durbin - February 13, 2003 - Reconciling ontologies
Learning to map between
ontologies on the Semantic Web
AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy
Computer Science and Engineering, University of Washington
Abstract:
Ontologies play a prominent role on the Semantic Web. They make
possible the widespread publication of machine understandable data,
opening myriad opportunities for automated information
processing. However, because of the Semantic Web's distributed nature,
data on it will inevitably come from many different
ontologies. Information processing across ontologies is not possible
without knowing the semantic mappings between their elements. Manually
finding such mappings is tedious, error-prone, and clearly not
possible at the Web scale. Hence, the development of tools to assist
in the ontology mapping process is crucial to the success of the
Semantic Web.
Zuzana Gedeon - February 6, 2003 - Association rules in Recommender Algorithms
Evaluation of
Recommender Algorithms for an Internet Information Broker Based on
Simple Association Rules and on the Repeat-Buying Theory
Andreas Geyer-Schulz, Michael Hahsler
Abstract:
Association rules are a widely used technique to generate
recommendations in commercial and research recommender systems. Since
more and more Web sites, especially of retailers, offer automatic
recommender services using Web usage mining, evaluation of recommender
algorithms becomes increasingly important. In this paper we first
present a framework for the evaluation of different aspects of
recommender systems based on the process of discovering knowledge in
databases of Fayyad et al. and then...
Neal Richter - January 23, 2003 - Language Trees and Zipping
Language Trees and Zipping
Dario Benedetto, Emanuele Caglioti, Vittorio Loreto
Physics & Mathemetics Departments, La Sapienza University, Rome,
Italy
Abstract:
In this letter we present a very general method to extract information
from a generic string of characters, e.g. a text, a DNA sequence or a
time series. Based on data-compression techniques, its key point is
the computation of a suitable measure of the remoteness of two bodies
of knowledge. We present the implementation of the method to
linguistic motivated problems, featuring highly accurate results for
language recognition, authorship attribution and language
classification.
Extended Comment on Language
Trees and Zipping
Joshua Goodman
Microsoft Research
Abstract:
This is the extended version of a Comment submitted to Physical Review
Letters. I first point out the inappropriateness of publishing a
Letter unrelated to physics. Next, I give experimental results showing
that the technique used in the Letter is 3 times worse and 17 times
slower than a simple baseline. And finally, I review the literature,
showing that the ideas of the Letter are not novel. I conclude by
suggesting that Physical Review Letters should not publish Letters
unrelated to physics.
On J. Goodman's comment to
Language Trees and Zipping
Dario Benedetto, Emanuele Caglioti, Vittorio Loreto
Physics & Mathemetics Departments, La Sapienza University, Rome,
Italy
Abstract:
Motivated by the recent submission to cond-mat archives by J. Goodman
whose results apparently discredit the approach we have proposed in a
recent paper (Phys. Rev. Lett. 88, 048702 (2002)), we report the
results of the same experiment performed by Goodman using three
different data compression schemes. As a matter of fact the three
zippers display the same efficiency Goodman obtained using Naive
Bayesian Methods and not, as Goodman claimed, an efficiency three
times smaller. We point out the question of the extreme generality of
approaches based on data compression techniques and we list a large
range of potential applications, including those of interest for the
physics community.
Steve Durbin - January 23, 2003 - Rule extraction from neural nets
Understanding Time-Series
Networks: A Case Study in Rule Extraction
Mark W. Craven and Jude W. Shavlik
Computer Sciences Department, University of Wisconsin-Madison
Abstract:
A significant limitation of neural networks is that the
representations they learn are usually incomprehensible to humans. We
have developed an algorithm, called Trepan, for extracting
comprehensible, symbolic representations from trained neural
networks. Given a trained network, Trepan produces a decision tree
that approximates the concept represented by the network. In this
article, we discuss the application of Trepan to a neural network
trained on a noisy time series task: predicting the Dollar-Mark
exchange rate. We present experiments that show that Trepan is able to
extract a decision tree from this network that equals the network in
terms of predictive accuracy, yet provides a comprehensible concept
representation. Moreover, our experiments indicate that decision trees
induced directly from the training data using conventional algorithms
do not match the accuracy nor the comprehensibility of the tree
extracted by Trepan.
Steve Durbin - December 13, 2002 - Data cleansing
Real-world data is dirty: Data
cleansing and the merge/purge problem
Mauricio A. Hernandez and Salvatore J. Stolfo
Department of Computer Science, Columbia University
Abstract:
The problem of merging multiple databases of information about common
entities is frequently encountered in KDD and decision support
applications in large commercial and government organizations. The
problem we study is often called the Merge/Purge problem and is
difficult to solve both in scale and accuracy. Large repositories of
data typically have numerous duplicate information entries about the
same entities that are difficult to cull together without an
intelligent "equational theory" that identifies equivalent items by a
complex, domain-dependent matching process. We have developed a
system for accomplishing thes Data Cleansing task and demonstrate its
use for cleansing lists of names of potential customers in a direct
marketing-type application. Our results for statistically generated
data are shown to be accurate and effective when processing the data
multiple times using different keys for sorting on each successive
pass. Combing results of individual passes using transitive closure
over the independent results, produces far more accurate results at
lower cost. The system provides a rule programming module that is
easy to program and quite good at finding duplicates especially in an
environment with massive amounts of data. This paper details
improvements in our system, and reports on the successful
implementation for a "real-world" database that conclusively validates
our results previously achieved for statistically generated data.
Steve Durbin - November 22, 2002 - Improved information retrieval for short queries
Impact Transformation:
Effective and Efficient Web Retrieval
Vo Ngoc Anh and Alistair Moffat
Department of Computer Science and Software Engineering, University of
Melbourne
Abstract:
We extend the applicability of impact transformation, which is a
technique for adjusting the term weights assigned to documents so as
to boost the effectiveness of retrieval when short queries are applied
to large document collections. In conjunction with techniques called
quantization and thresholding, impact transformation allows improved
query execution rates compared to traditional vector-space similarity
computations, as the number of arithmetic operations can be
reduced. The transformation also facilitates a new dynamic query
pruning heuristic. We give results based upon the TREC web data that
show the combination of these various techniques to yield highly
competitive retrieval, in terms of both effectiveness and efficiency,
for both short and long queries.
Higher Precision for
Two-Word Queries
K. L. Kwok, C. S. Dept., Queens College, CUNY
Abstract:
Queries have specific properties, and may need individualized methods
and parameters to optimize retrieval. Length is one property. We
look at how two-word queries may attain higher precision by re-ranking
using word co-occurrence evidence in retrieved documents.
Co-occurrence within document context is not sufficient, but window
context including sentence context evidence can provide precision
improvements at low recall region of 4 to 10% using initial retrieval
results, and positively affects pseudo-relevance feedback.
Zuzana Gedeon - November 1, 2002 - Cluster validity methods
Cluster validity methods: part
I
Maria Halkidi, Yannis Batistakis, Michalis Vazirgiannis & Athens
University of Economics and Business
Abstract:
Clustering is an unsupervised process since there are no predefined
classes and no examples that would indicate grouping properties in the
data set. The majority of the clustering algorithms behave differently
depending on the features of the data set and the initial assumptions
for defining groups. Therefore, in most applications the resulting
clustering scheme requires some sort of evaluation as regards its
validity. Evaluating and assessing the results of a clustering
algorithm is the main subject of cluster validity. In this paper we
present a review of the clustering validity and methods. More
specifically, Part I of the paper discusses the cluster validity
approaches based on external and internal criteria.
Cluster validity methods: part II (handed out: no online version
yet)
Maria Halkidi, Yannis Batistakis, Michalis Vazirgiannis & Athens
University of Economics and Business
Abstract:
Clustering results validation is an important topic in the context of
pattern recognition. We review approaches and systems in this
context. In the current part, we present a review of clustering
validity approaches based on relative criteria. Also we discuss the
results of experimental study based on widely known validity
indices. Finally th epaper illustrates issues that are under-addressed
by th erecent approaches and proposes teh research directions in the
field.
Neal Richter - October 25, 2002 - Random Decision Forests
Random Decision Forests
Tin Kam Ho AT&T Bell Laboratories
Abstract:
Decision trees are attractive classifiers due to their high execution
speed. But trees derived with traditional methods often cannot be
grown to arbitrary complexity for possible loss of generalization
accuracy on unseen data. The limitation on complexity usually means
suboptimal accuracy on training data. Following the principles of
stochastic modeling, we propose a method to construct tree-based
classifiers whose capacity can be arbitrarily expanded for increases
in accuracy for both training and unseen data. The essence of the
method is to build multiple trees in randomly selected subspaces of
the feature space. Trees in different subspaces generalize their
classification in complementary ways, and their combined
classification can be monotonically improved. The validity of the
method is demonstrated through experiments on the recognition of
handwritten digits.
C4.5 Decision Forests
Tin Kam Ho AT&T Bell Laboratories
Abstract:
Much of previous attention on decision trees focuses on the splitting
criteria and optimization of tree sizes. The dilemma between
overfitting and achievingmaximum accuracy is seldom resolved. We
propose a method to construct a decision tree based classi-fier that
maintains highest accuracy on training data and improves on
generalization accuracy as it grows incomplexity. Trees are generated
using the well-known C4.5 algorithm, and the classifier consists of
multiple trees constructed in pseudo-randomly selected subspaces of
the given feature space. We compare themethod to single-tree
classifiers and other forest construction methods by experiments on
four public datasets, where the method's superiority is
demonstrated. A measure is given to describe the similarity
betweentrees in a forest, and is related to the combined
classification accuracy.
Steve Durbin - August 16, 2002 - Time series prediction
Noisy time series
prediction using a recurrent neural network and grammatical
inference
C. Lee Giles, S. Lawrence, A. C. Tsoi*
NEC Research Institute and *University of Wollongong, Australia
Abstract:
Financial forecasting is an example of a signal processing problem
which is challenging due to small sample sizes, high noise,
non-stationarity, and non-linearity. Neural networks have been very
successful in a number of signal processing applications. We discuss
fundamental limitations and inherent difficulties when using neural
networks for the processing of high noise, small sample size
signals. We introduce a new intelligent signal processing method which
addresses the difficulties. The method proposed uses conversion into a
symbolic representation with a self-organizing map, and grammatical
inference with recurrent neural networks. We apply the method to the
prediction of daily foreign exchange rates, addressing difficulties
with non-stationarity, overfitting, and unequal a priori class
probabilities, and we find significant predictability in comprehensive
experiments covering 5 different foreign exchange rates. The method
correctly predicts the direction of change for the next day with an
error rate of 47.1%. The error rate reduces to around 40% when
rejecting examples where the system has low confidence in its
prediction. We show that the symbolic representation aids the
extraction of symbolic knowledge from the trained recurrent neural
networks in the form of deterministic finite state automata. These
automata explain the operation of the system and are often relatively
simple. Automata rules related to well-known behavior such as trend
following and mean reversal are extracted.
Steve Durbin - July 26, 2002 - Automatic thesaurus construction
Induction of Semantic
Classes from Natural Language Text
Dekang Lin and Patrick Pantel
Dept. of Computing Science, University of Alberta
Abstract:
Many applications dealing with textual information require
classification of words into semantic classes (or concepts). However,
manually constructing semantic classes is a tedious task. In this
paper, we present an algorithm, UNICON, for UNsupervised Induction of
CONcepts. Some advantages of UNICON over previous approaches include
the ability to classify words with low frequency counts, the ability
to cluster a large number of elements in a high-dimensional space, and
the ability to classify previously unknown words into existing
clusters. Furthermore, since the algorithm is unsupervised, a set of
concepts may be constructed for any corpus.
Steve Durbin - July 19, 2002 - Innovative Applications of Artificial Intelligence 2002 conference presentation
RightNow eService Center:
Internet customer service using a self-learning knowledge base
Stephen D. Durbin, Doug Warner, J. Neal Richter, Zuzana Gedeon
RightNow Technologies
Abstract:
Delivering effective customer service via the Internet requires
attention to many aspects of information management if it is to be
convenient and satisfying for customers, while at the same time
efficient and low in cost for the company or other organization.
Built around a core FAQ document knowledge base, RightNow Web eService
Center uses a variety of AI techniques to facilitate the construction,
maintenance, and navigation of the knowledge base. These include
collaborative filtering, swarm intelligence, natural language
processing, text clustering, and classification rule learning.
Customers using RightNow Web report dramatic decreases in support
costs and increases in customer satisfaction due to the ease of use
provided by the self-learning features of the knowledge base.
Steve Durbin - June 21, 2002 - Clustering: shared nearest-neighbor algorithm
Finding Topics in
Collections of Documents: A Shared Nearest Neighbor Approach
Levent Erto"z, Michael Steinbach, Vipin Kumar
Department of Computer Science and Engineering, University of
Minnesota
Abstract:
An important task in text mining is finding the dominant topics (and
their associated documents) in a collection of documents. While
traditional clustering techniques, e.g., hierarchical clustering and
K-means, are often used for this task, this paper explores a new
clustering algorithm which is based on a shared nearest neighbor
approach. Unlike traditional clustering algorithms, not all the data
is clustered, but instead, our algorithm discovers groups of documents
representing strong topics and discards all documents that are not
part of a strong topic. We evaluate the performance of our algorithm
on real and synthetic data sets and show that the groups of documents
found by our technique often represent more coherent topics than
those produced by K-means. Additionally, we provide a theoretical
explanation for this behavior.
Related papers from the same group:
A New Shared Nearest
Neighbor Clustering Algorithm and its Applications
Has some helpful figures and applications to other data
The Challenges of
Clustering High Dimensional Data
Longer paper reviewing various techniques
Steve Durbin - May 17, 2002 - Expertise locating
Expert finding for
collaborative virtual environments
Mark Maybury, Ray D'Amore, and David House
MITRE Corporation
Summary:
This short paper describes "Expert Finder and XperNet, two software
programs that automatically profile topics of interest and expertise
associated with employees based on employees' tool use, publications,
project roles, and written communication with others."
Expertise Recommender:
a flexible recommendation system and architecture
David W. McDonald and Mark S. Ackerman
Dept. of Information and Computer Science, Univ. of Calif., Irvine
Abstract:
Locating the expertise necessary to solve difficult problems is a
nuanced social and collaborative problem. In organizations, some
people assist others in locating expertise by making referrals. People
who make referrals fill key organizational roles that have been
identified by CSCW and affiliated research. Expertise locating systems
are not designed to replace people who fill these key organizational
roles. Instead, expertise locating systems attempt to decrease
workload and support people who have no other options. Recommendation
systems are collaborative software that can be applied to expertise
locating. This work describes a general recommendation architecture
that is grounded in a field study of expertise locating. Our expertise
recommendation system details the work necessary to fit expertise
recommendation to a work setting. The architecture and implementation
begin to tease apart the technical aspects of providing good
recommendations from social and collaborative concerns.
Doug Warner - May 3, 2002 - Wavelets
Approximate
Query Processing Using Wavelets
Kaushik Chakrabarti, Minos Garofalakis, Rajeev Rastogi, Kyuseok
Shim
Abstract:
Approximate query processing has emerged as a cost-effective approach
for dealing with the huge data volumes and stringent responsetime
requirements of today's Decision Support Systems (DSS). Most work in
this area, however, has so far been limited in its query processing
scope, typically focusing on specific forms of aggregate
queries. Furthermore, conventional approaches based on sampling or
histograms appear to be inherently limited when it comes to
approximating the results of complex queries over high-dimensional DSS
data sets. In this paper, we propose the use of multi-dimensional
wavelets as an effective tool for general-purpose approximate query
processing in modern, high-dimensional applications. Our approach is
based on building wavelet-coefficient synopses of the data and using
these synopses to provide approximate answers to queries. We develop
novel query processing algorithms that operate directly on
thewavelet-coefficient synopses of relational tables, allowing us to
process arbitrarily complex queries entirely in the
wavelet-coefficient domain. This guarantees extremely fast response
times since our approximate query execution engine can dothe bulk of
its processing over compact sets of wavelet coefficients, essentially
postponing the expansion into relational tuples until the end-result
of the query. We also propose a novel wavelet decomposition algorithm
that can build these synopses in an I/O-efficient manner. Finally, we
conduct an extensive experimental study with synthetic as well as
reallife data sets to determine the effectiveness of our wavelet-based
approach compared to sampling and histograms. Our results demonstrate
that our techniques (1) provide approximate answers of better quality
than either sampling or histograms, (2) offer query execution-time
speedups of more than two orders of magnitude, and (3) guarantee
extremely fast synopsisconstruction times that scale linearly with the
size of the data.
TOPIC ISLANDS - A
Wavelet Based Text Visualization System.
Nancy E. Miller, Pak Chung Wong, Mary Brewster, and Harlan Foote.
To appear in IEEE Visualization '98.
Abstract:
We present a novel approach to visualize and explore unstructured
text. The underlying technology, called TOPIC-O-GRAPHY [TM], applies
wavelet transforms to a custom digital signal constructed from words
within a document. The resultant multiresolution wavelet energy is
used to analyze the characteristics of the narrative flow in the
frequency domain, such as theme changes, which is then to the overall
thematic content of the text document using statistical methods. The
thematic characteristics of a document can be analyzed at varying
degrees of detail, ranging from section-sized text partitions to
partitions consisting of a few words. Using this technology, we are
developing a visualization system prototype known as TOPIC ISLANDS
[TM] to browse a document, generate fuzzy document outlines, summarize
text by levels of detail and according to user interests, define
meaningful subdocuments, query text content, and provide summaries of
topic evolution.
Neal Richter - March 22, 2002 - Machine Learning and Natural Language Processing
Machine Learning and Natural
Language Processing
Lluís Marquez
Abstract:
In this report, some collaborative work between the fields of Machine
Learning (ML) and Natural Language Processing (NLP) is presented. The
document is structured in two parts. The first part includes a
superficial but comprehensive survey covering the state-of-the-art of
machine learning techniques applied to natural language learning
tasks. In the second part, a particular problem, namely Word Sense
Disambiguation (WSD), is studied in more detail. In doing so, four
algorithms for supervised learning, which belong to different
families, are compared in a benchmark corpus for the WSD task. Both
qualitative and quantitative conclusions are drawn.
A Machine Learning Approach to POS
Tagging
Lluís Marquez
Abstract:
We have applied the inductive learning of statistical decision trees
and relaxation labelling to the Natural Language Processing (nlp) task
of morphosyntactic disambiguation (Part Of Speech Tagging). The
learning process is supervised and obtains a language model oriented
to resolve pos ambiguities, consisting of a set of statistical
decision trees expressing distribution of tags and words in some
relevant contexts. The acquired decision trees have been directly used
in a tagger that is both relatively simple and fast, and which has
been tested and evaluated on the Wall Street Journal (wsj) corpus with
remarkable accuracy. However, better results can be obtained by
translating the trees into rules to feed a flexible relaxation
labelling based tagger. In this direction we describe a tagger which
is able to use information of any kind (n-grams, automatically
acquired constraints, linguistically motivated manually written
constraints, etc.), and in particular to incorporate the
machine-learned decision trees. Simultaneously, we address the problem
of tagging when only limited training material is available, which is
crucial in any process of constructing, from the scratch, an annotated
corpus. We show that high levels of accuracy can be achieved with our
system in this situation, and report some results obtained when using
it to develop a 5.5 million words Spanish corpus from scratch.
Christina Hayes - February 22, 2002 - Question answering
Christina Hayes and Chris Yahner
Math, Computer Science, MSU
Abstract:
Christina and Chris will give a demonstration of and discuss a
question answering system they built to take natural language
questions about dinosaurs and find the answers in a large text
database. The Q/A system is still under development, but the goal of
this colloquium is to learn about particular problems and discuss some
solutions in this area.
Steve Durbin - February 15, 2002 - Ontologies in practice
Making Ontologies Work for
Resolving Redundancies across Documents
J. O. Everett et al.
Xerox PARC and Fuji-Xerox Palo Alto Laboratory
Abstract:
Knowledge management efforts over the apst decade have produced many
document collections focused on particular domains. As such systems
scale up, they become unwieldy and ultimately unusable if obsolete and
redundant content is not continually identified and removed.
Discovery of Ontologies
from Knowledge Bases
Hendra Suryanto and Paul Compton
School of Computer Science and Engineering, University of New South
Wales
Abstract:
Current approacheds to building knowledge-based systems propose the
development of an ontology as a precursor to building the
problem-solver. This paper outlines an attempt to do the reverse and
discover interesting ontologies from systems built without the
ontology being explicit. [...]
See also previous papers relating to the Xerox Eureka system and to use of WordNet as an ontology.
Alex Dimitrov - February 8, 2002 - Neural Coding and Information Distortion
MSU campus, Room 108 Ag. Sci., 4:00 pm
This is a talk at MSU in the Complex Biological Systems seminar series. The technique discussed is a general method of extracting meaningful information from noisy signals. The following paper describes the algorithm.
Information Distortion and
Neural Coding
Tomas Gedeon, Albert E. Parker, and Alexander G. Dimitrov
Dept. of Mathematical Sciences and Center for Computational Biology,
MSU
Abstract:
Our main interest is the question of how neural ensemble activity
represents sensory stimuli. In this paper we discuss a new approach to
characterizing neural coding schemes. It attempts to describe the
specific stimulus parameters encoded in the neural ensemble activity
and at the same time determines the nature of the neural symbols with
which that information is encoded. This recently developed approach
for the analysis of neural coding minimizes an intrinsic
information-theoretic cost function (the information distortion) to
produce a simple approximation of a coding scheme, which can be
refined as more data becomes available. We study this optimization
problem. The admissible region is a direct product of simplices. We
show that the optimal solution always occurs at a vertex of the
admissible region. This allows us to reformulate the problem as a
maximization problem on the set of vertices and develop an algorithm,
which, under mild conditions, always finds a local extremum. We
compare the performance of this new algorithm to standard optimization
schemes on synthetic cases and on physiological recordings from the
cricket cercal sensory system.
Steve Durbin - January 25, 2002 - Collaborative Knowledge Building with RNW-like Software
Answer Garden 2: Merging
Organizational Memory with Collaborative Help
Mark S. Ackerman, David W. McDonald
Department of Information and Computer Science, University of
California, Irvine
Proceedings of the ACM Conference on Computer-Supported Cooperative
Work (CSCW'96)
Abstract:
This research examines a collaborative solution to a common problem,
that of providing help to distributed users. The Answer Garden 2
system provides a second generation architecture for organizational
and communitymemory applications. After describing the need for
Answer Garden 2's functionality, we describe the architecture of the
system and two underlying systems, the Cafe Construction Kit and
Collaborative Refinery. We also present detailed descriptions of the
collaborative help and collaborative refining facilities in the Answer
Garden 2 system.
Evolution of Contact Point: A
Case Study of a Help Desk and Its Users
Lena Mamykina, Everyday Computing Lab, GVU Center, Georgia Institute
of Technology
Catherine G. Wolf, IBM Thomas J. Watson IBM Research Center
Proceedings of the ACM 2000 Conference on Computer Supported
Cooperative Work, pp. 41-48
Abstract:
This paper describes the evolution of a concept, Contact Point, the
research process through which it evolved, and the work context and
practices which drove its evolution. Contact Point is a web-based
application that helps a business manage its relationships with its
customers. It can also be used within a business as a means for
managing the relationship between parts of the business. In this paper
we describe a study of the applicability of Contact Point to the
technical services organization and field personnel of a medical
device manufacturer. We found that there were opportunities to
potentially reduce call volume through Contact Point. We discovered,
however, that the technical service representatives sometimes filled
roles other than providing information in their telephone
conversations with field personnel. These functions included
reassuring callers that the callers' answers to questions were
correct, providing a rationale for information, and redirecting calls
to other departments. The ability to share a document and collaborate
in real time was viewed as very valuable. We also discovered that the
field personnel need information from a variety of other people in
order to do their jobs. These observations were used to enhance the
next iteration of Contact Point and to develop strategies for the
introduction of Contact Point to users.
Xerox Creates a Knowledge-Sharing
Culture Through Grassroots Efforts
Vicki J. Powers, American Productivity & Quality Center
Abstract:
This article, which is a Case Study of the APQC, contains a
description of the much heralded Eureka system used by Xerox service
reps. Eureka is not available as commercial software, but would be
almost trivial to implement with RightNow Web. Also notable in the
article is the list of domains of knowledge management developed at
Xerox, which brings some definition to the amorphous term "knowledge
management."
Experts Exchange
Check out the above website for an innovative, commercially viable,
worldwide cooperative learning activity that looks like it could be
built with RNW.
Steve Durbin - January 18, 2002 - Natural Language Dialog: Chatbots and Assistants
The unfriendly user: exploring social
reactions to chatterbots
Antonella De Angeli, Graham I. Johnson, and Lynne Coventry
NCR Self-Service, Advanced Technology and Research
Proceedings of the International Conference on Affective Human Factors
Design
Helander, Khalid and Tham, Editors
Asean Academic Press, London, 2001
Abstract:
This paper presents a preliminary evaluation of Alice, a chatterbot
designed in order to elicit anthropomorphic attributions and emotional
reactions from those who chat to 'her'. The analysis is based on both
transcripts of the interaction and user comments collected in a focus
group. Results suggest that the introduction of explicit
anthropomorphism in Human-Computer Interaction (HCI) is a complex
phenomenon, which could generate strong negative reactions from the
part of the user. The finding also demonstrates the importance of
placing the development of user interfaces within a social framework
as the technology tends to establish relationships with users.
Note: You can chat with Alice and read about her at A.L.I.C.E. AI Foundation. Both software
and the AI Markup Language basis for Alice is available there under
GPL.
The Role of a Natural Language
Conversational Interface in Online Sales: A Case Study
Joyce Chai, et al.
IBM T. J. Watson Research Center and MIT Artificial Intelligence
Lab
International Journal of Speech Technology 4, 285-295, 2001
Abstract:
This paper describes the evaluation of a natural language dialog-based
navigation system (HappyAssistant) that helps users access e-commerce
sites to find relevant information about products and services. The
prototype system leverages technologies in natural language processing
and human-computer interaction to create a faster and more intuitive
way of interacting with websites, especially for less experienced
users. The result of a comparative study shows that users prefer the
natural language-enabled navigation two to one over the menu driven
navigation. In addition, the study confirmed the efficiency of using
natural language dialog in terms of the number of clicks and the
amount of time required to obtain the relevant information. In the
case study, as compared to the menu driven system, the average number
of clicks used in the natural language system was reduced by 63.2% and
the average time was reduced by 33.3%.
Doug Warner - January 11, 2002 - Knowledge Gap Analysis.
Review: Knowledge Base Maintenance
Using Knowledge Gap Analysis
Scott Spangler, Jeffrey Kreulen, IBM Almaden Research Center
Proceedings, KDD 2001, p462-466.
Tangential, skim only: Focused
Knowledge Management
John L. Gordon, Michael Edge, NWAIAG, Blackburn College
Applications and Innovations in Intelligent Systems V. Pages
207-219. By SGES Publications. ISBN: 1-899621-19-9
http://www.akri.org/Research/es97/default.htm
Abstract:
Knowledge Management has been an issue within the NWAIAG since 1993
when several problem cases were identified in organisations. These
cases concerned some sort of breakdown in the organisations knowledge
asset. A period of research and analysis of the problems preceded the
formation of a collaborative group of companies intending to develop
tools and systems which support the strategic decision making
process. The project, lead by the NWAIAG aims to raise the profile of
a company's Knowledge Asset, particularly during the planning and
management of change.
In order to help focus the views of a broad range of opinion and experience, the team decided to create a prototype Knowledge Management Tool. This tool, although not intended to be a commercial product, helped to clarify the ideas and opinions of the group and allowed them to develop a common approach. The tool quickly emerged as a bottom up method of implementation for this particular aspect of knowledge asset management. A more complete perspective was developed by refocusing on the demand for information about the knowledge asset, starting at the board room. Furthermore, considerable interest was found in the provision of a graphical visualisation of a companies knowledge asset (or at least part of it) and the dependent relationships which existed within it.
Zuzana Gedeon - December 21, 2001 - Information Bottleneck Method for Clustering.
The Power of Word
Clustering for Text Classification
Noam Slonim and Naftali Tishby
Prepared: January 2001. In proceedings of the European Colloquium on
IR Research, ECIR 2001.
Abstract:
The recently introduced Information bottleneck method provides an
information theoretic framework for extracting features of one
variable, that are relevant for the values of another
variable. several previous works already suggested applying this
method for document clustering, gene expression data analysis,
spectral analysis and more. In this work we present a novel
implementation of this method for supervised text
classification. Specificially, we apply the information bottleneck
method to find word-clusters that preserve the information about
document categories and use these clusters as features for
classification. Previous work used a simmilar clustering procedure to
show thet word-clusters can significantly reduce the feature space
dimensionality, with only a minor change in classification
accuracy. In this work we reproduce these results and go further to
show that when the training sample is small word clusters can yield
significant improvement in classification accuracy (up to 18%) over
the performance using th ewords directly.
(Optional)Multivariate Information
Bottleneck
Noam Slonim and Naftali Tishby
Prepared: January 2001. In proceedings of the European Colloquium on
IR Research, ECIR 2001.
Abstract:
The information bottleneck method is an unsupervised non-parametric
data organization technique. Given a joint distribution P(A,B), this
method constructs a new variable T that extracts partitions, or
clusters, over the values of A that are informative to B. The
information bottleneck has already been applied to document
classification, gene expression, neural code, and spectral
analysis. In this paper, we introduce a general principled framework
for multiwariate extensions of the information bottleneck method
More complete list of papers with local gzipped copies are here.
Zuzana Gedeon - November 30, 2001
Summarization.
(First two papers are handouts they are in proceedings from ICKM 2001
and will be available from ACM library later)
Recent Developments in Text Summarization
Inderjeet Mani
Georgetown University and The MITRE Corporation
Abstract:
With the explosion in the quantity of on-line information in recent
years, demand for text summarization technology is growing. Increased
pressure for technology advances is coming from users of the web,
on-line information sources, and new mobile devices, as well as from
the need for corporate knowledge management. Commercial companies are
increasingly starting to offer text summarization capabilities, often
bundled with information retrieval tools. In this paper, I will
discuss the significance of some recent developments summarization
technology.
Summarization of Discussion Groups
R. Farrell, P. Fairweather, K. Snyder
IBM T.J.Watson Research Center
Abstract:
In this paper we describe an algorithm to generate textual summaries
of discussion groups. Our system combines sentences extracted from
individual postings into variable length summaries by utilizing the
hierachical discourse context provided by discussion threads. We have
incorporated this algorithm into a Web-based application caled IDS
(Interactive discussion Summarizer)
Term Weighting Method
based on Information Gain Ratio for Summarizing Documents retrieved by
IR systems
Tatsunori Mori Miwa Kikuchi Kazufumi Yoshida
Div. of Electrical and Computer Eng., Yokohama National University
Abstract:
This paper proposes a new term weighting method for summarizing
documents retrieved by IR systems. Unlike query-biased summarization
methods, our method utilizes not the information of query, but the
similarity information among original documents by hierarchical
clustering. In order to map the similarity structure of the clusters
into the weight of each word, we adopt the information gain ratio
(IGR) of probabilistic distribution of each word as a term weight. If
the amount of information of a word in a cluster increases after the
cluster is partitioned into sub-clusters, we may consider that the
word contributes to determine the structure of the sub-clusters. The
IGR is a measure to express the degree of such contribution. we will
show the effectiveness of our mathod based on th eIGR by comparison
with other systems
Zuzana Gedeon - November 16, 2001
CIKM 2001
10th International Conference on Information and Knowledge
Management
I will briefly summarize some of the talks, I think are interesting,
and we can then decide on topics to cover in more detail later. Ones
that I found interesting are listed below.
User
navigation for web searching
Using navigation data to improve IR
functions in the context of Web search
Mark H. Hansen, Elizabeth Shriver
Bell-labs, Lucent
Abstract:
As part of the process of delivering content, devices like proxies and
gateways log valuable information about the activities and navigation
patterns of users on the Web. In this study, we consider how this
navigation data can be used to improve Web search. A query posted to a
search engine together with the set of pages accessed during a search
task is known as a search session. We develop a mixture model for the
observed set of search sessions, and propose variants of the classical
EM algorithm for training. The model itself yields a type of
navigation-based query clustering. By implicitly borrowing strength
between related queries, the mixture formulation allows us to identify
the "highly relevant" URLs for each query cluster. Next, we explore
methods for incorporating existing labeled data (the Yahoo! directory,
for example) to speed convergence and help resolve low-traffic
clusters. Finally, the mixture formulation also provides for a simple,
hierarchical display of search results based on the query
clusters. The effectiveness of our approach is evaluated using proxy
access logs for the outgoing Lucent proxy.
Model-based Feedback in the
Language Modeling Approach to Information Retrieval
Chengxiang Zhai
Carnegie Mellon University
Abstract:
This is just my personal summary: feedback into original search using
statistical methods, instead of query expansion using keywords ...,
use results of original query as a new feedback to the search and
compare new results based on KL-divergence to to the results retrieved
originally.
Steve Durbin - November 9, 2001
Introduction to small-world networks
The physics of the Web
Albert-László Barabási
Department of Physics, University of Notre Dame
The physics of
the Web
Abstract:
Statistical mechanics is offering new insights into the structure and
dynamics of the Internet, the World Wide Web and other complex
interacting systems.
The Small-World Phenomenon: an Algorithmic Perspective
Jon Kleinberg
Department of Computer Science, Cornell University
kleinberg-small_world.ps
Abstract:
Long a matter of folklore, the "small-world phenomenon"—the
principle that we are all linked by short chains of
acquaintances—was inaugurated as an area of experimental study in
the social sciences through the pioneering work of Stanley Milgram in
the 1960's. This work was among the first to make the phenomenon
quantitative, allowing people to speak of the "six degrees of
separation" between any two people in the United States. Since then, a
number of network models have been proposed as frameworks in which to
study the problem analytically. One of the most refined of these
models was formulated in recent work of Watts and Strogatz; their
framework provided compelling evidence that the small-world phenomenon
is pervasive in a range of networks arising in nature and technology,
and a fundamental ingredient in the evolution of the World Wide Web.
But existing models are insufficient to explain the striking algorithmic component of Milgram's original findings: that individuals using local information are collectively very effective at actually constructing short paths between two points in a social network. Although recently proposed network models are rich in short paths, we prove that no decentralized algorithm, operating with local information only, can construct short paths in these networks with non-negligible probability. We then define an infinite family of network models that naturally generalizes the Watts-Strogatz model, and show that for one of these models, there is a decentralized algorithm capable of finding short paths with high probability. More generally, we provide a strong characterization of this family of network models, showing that there is in fact a unique model within the family for which decentralized algorithms are effective.
J Kleinberg. The small-world phenomenon: An algorithmic perspective. http://www.cs.cornell.edu/home/kleinber/swn.ps
Doug Warner - November 2, 2001
A Fourier-Spectrum Based Approach to Aggregate and Visualize
Decision Trees for Mobile Applications
Hillol Kargupta and Byung-Hoon Park
kargupta-fourier.pdf
Abstract:
This paper presents a novel Fourier analysis-based technique to
aggregate, transmit, and visualize decision trees in a mobile
environment. Fourier representation of a decision tree has several
useful properties that are particularly useful for mining continuous
data streams from small mobile computing devices. This paper presents
algorithms to compute the Fourier spectrum of a decision tree and the
vice versa. It offers a framework to aggregate decision trees in
their Fourier representations. It also describes a
touch-pad/ticker-based approach to visualize decision trees using
their Fourier spectrum and an implementation for PDAs.
Neal Richter - Oct 26, 2001
Self-Organizing Maps Optimizations & Insights
Read One of these Two:
Genetic Synthesis of Unsupervised Learning Algorithms
Ali DASDAN and Kemal OFLAZER
genetic-synthesis-of-unsupervised.pdf
Abstract:
This paper presents new unsupervised learning algorithms that have
been synthesized using a genetic approach. A set of such learning
algorithms has been compared with the classical Kohonen's Algorithm on
the Self-Organizing Map and has been found to provide a better
performance measure. This study indicates that there exist many
unsupervised learning algorithms that lead to an organization similar
to that of Kohonen's Algorithm, and that genetic algorithms can be
used to search for optimal algorithms and optimal architectures for
the unsupervised learning.
@misc{ das-genetic,
author = "Ali Dasdan",
title = "Genetic Synthesis of Unsupervised Learning Algorithms",
url = "citeseer.nj.nec.com/20882.html" }
GTM: A Principled Alternative to the Self-Organizing Map
Bishop, Svensen, & Williams
bishop97gtm.pdf
Abstract:
The Self-Organizing Map (SOM) algorithm has been extensively studied
and has been applied with considerable success to a wide variety of
problems. However, the algorithm is derived from heuristic ideas and
this leads to a number of significant limitations. In this paper, we
consider the problem of modelling the probability density of data in a
space of several dimensions in terms of a smaller number of latent, or
hidden, variables. We introduce a novel form of latent variable model,
which we call the GTM algorithm (for Generative Topographic Map),
which allows general non-linear transformations from latent space to
data space, and which is trained using the EM
(expectationmaximization). We demonstrate the performance of the GTM
algorithm on simulated data from flow diagnostics for a multi-phase
oil pipeline
Steve Durbin - Oct 19, 2001
Self-organizing maps of document collections
Self-Organizing Maps of Document Collections: A New Approach to
Interactive Exploration
Krista Lagus, Timo Honkela, Samuel Kaski, and Teuvo Kohonen
Neural Networks Research Centre, Helsinki University of Technology
websom.ps
Abstract:
Powerful methods for interactive exploration and search from
collections of free-form textual documents are needed to manage the
ever-increasing flood of digital information. In this article we
present a method, WEBSOM, for automatic organization of full-text
document collections using the self-organizing map (SOM)
algorithm. The document collection is ordered onto a map in an
unsupervised manner utilizing statistical information of short word
contexts. The resulting ordered map where similar documents lie near
each other thus presents a general view of the document space. With
the aid of a suitable (WWWbased) interface, documents in interesting
areas of the map can be browsed. The browsing can also be
interactively extended to related topics, which appear in nearby areas
on the map. Along with the method we present a case study of its
use.
K. Lagus, T. Honkela, S. Kaski, & T. Kohonen. "Self-Organizing Maps of Document Collections: A New Approach to Interactive Exploration," In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1996.
WEBSOM demo site: WEBSOM - A
novel SOM-based approach to free-text mining
Graphical views of medical and financial data, and the Internet: Visual Net
Python - Oct 12, 2001
Python: choice language for prototyping?
No official reading papers.
Suggested: Python book, by O'REILLY
We will be discussing the python features.
Come if you know, or would like to know about this programming
language.
Other web sites of interest:
Python language website has a
wealth of info
Link to Python Tutorial,
Language/Library Reference and more from Python site
Python
Cookbook
Python DevCenter
Zuzana Gedeon - Oct 5, 2001
Finding temporal topics based on graph theory.
Alexandrin Popescul, Gary Flake, Steve Lawrence, Lyle H. Ungar,
C. Lee Giles
Clustering and Identifying Temporal Trends in Document Databases
(2000)
DaimlerChrysler Research & Technology; Stanford University
Abstract:
We introduce a simple and efficient method for clustering and
identifying temporal trends in hyper-linked document databases. Our
method can scale to large datasets because it exploits the underlying
regularity often found in hyper-linked document databases. Because of
this scalability, we can use our method to study the temporal trends
of individual clusters in a statistically meaningful manner. As an
example of our approach, we give a summary of the temporal trends
found in a scientific literature database with thousands of
documents.
popescul_temporal_trends00.ps
@inproceedings{ popescul00clustering,
author = "Alexandrin Popescul and Gary Flake
and Steve Lawrence and Lyle Ungar and C. Lee Giles",
title = "Clustering and Identifying Temporal Trends in Document
Databases",
booktitle = "Advances in Digital Libraries, ADL 2000",
address = "Washington, DC",
pages = "173-182",
year = "2000",
url = "citeseer.nj.nec.com/popescul00clustering.html" }
Erez Hartuv, Ron Shamir
A Clustering Algorithm based on Graph Connectivity (1999)
Dept. of Computer Sciences, Tel-Aviv University
Abstract:
We have developed a novel algorithm for cluster analysis that is based
on graph theoretic techniques. A similarity graph is defined and
clusters in that graph correspond to highly connected subgraphs. A
polynomial algorithm to compute them efficiently is presented. Our
algorithm produces a solution with some provably good properties and
performs well on simulated and real data. Keywords: Algorithms,
Clustering, Minimum cut, Graph connectivity, diameter.
Steve Durbin - Sept 28, 2001 - Conversational access to information
Mehmet H. Goker, Cynthia A. Thompson
Personalized Conversational Case-Based Recommendation
Dept. of Computer Science, NEC Research Institute, Princeton
Abstract:
In this paper, we describe the Adaptive Place Advisor, a user
adaptive, conversational recommendation system designed to help users
decide on a destination, specifically a restaurant. We view the
selection of destinations as an interactive, conversational process,
with the advisory system inquiring about desired item characteristics
and the human responding. The user model, which contains preferences
regarding items, attributes, values, value combinations, and
diversification, is also acquired during the conversation. The system
enhances the user's requirements with the user model and retrieves
suitable items from a case-base. If the number of items found by the
system is unsuitable (too high, too low) the next attribute to be
constrained or relaxed is selected based on the information gain
associated with the attributes. We also describe the current status of
the system and future work.
case-based_recommender.ps
Neal Richter - Sept 21, 2001 - Feature Selection for Text
David D. Lewis
Feature Selection and Feature Extraction for Text Categorization
(1992)
Abstract:
The effect of selecting varying numbers and kinds of features for use
in predicting category membership was investigated on the Reuters and
MUC-3 text categorization data sets. Good categorization performance
was achieved using a statistical classifier and a proportional
assignment strategy. The optimal feature set size for word-based
indexing was found to be surprisingly low (10 to 15 features) despite
the large training sets. The extraction of new text features by
syntactic analysis and feature clustering was investigated on the
Reuters data set. Syntactic indexing phrases, clusters of these
phrases, and clusters of words were all found to provide less
effective representations than individual words.
lewis92feature.pdf
@inproceedings{ lewis92feature,
author = "D. D. Lewis",
title = "{Feature Selection and Feature Extraction for Text
Categorization}",
booktitle = "Proceedings of Speech and Natural Language Workshop",
publisher = "Morgan Kaufmann",
address = "San Mateo, California",
pages = "212-217",
year = "1992",
url = "citeseer.nj.nec.com/lewis92feature.html" }
C. Cardie and D. Pierce.
Error-Driven Pruning of Treebank Grammars for Base Noun Phrase
Identification. ACL/Coling-98, 1998, 218-224. Association for
Computational Linguistics.
Abstract:
This paper addresses the problem of handling skewed class
distributions within the case-based learning (CBL) framework. We first
present as a baseline an informationgain-weighted CBL algorithm and
apply it to three data sets from natural language processing (NLP)
with skewed class distributions. Although overall performance of the
baseline CBL algorithm is good, we show that the algorithm exhibits
poor performance on minority class instances. We then present two CBL
algorithms designed to improve the performance of minority class
predictions. Each variation creates test-case-specific feature weights
by first observing the path taken by the test case in a decision tree
created for the learning task, and then using pathspecific information
gain values to create an appropriate weight vector for use during case
retrieval. When applied to the NLP data sets, the algorithms are shown
to significantly increase the accuracy of minority class predictions
while maintaining or improving overall classification accuracy.
cardie-acl98.ps
Round-Robin DeathMatch! (1 paper each) - September 7, 2001 - Hierarchical Taxonomies using Divisive Partitioning.
Zuzana Gedeon
Hierarchical Taxonomies using Divisive Partitioning.(1998)
Daniel Boley
boley98hierarchical.ps
Abstract:
We propose an unsupervised divisive partitioning algorithm for
document data sets which enjoys many favorable properties. In
particular, the algorithm shows excellent scalability to large data
collections and produces high quality clusters which are competitive
with other clustering methods. The algorithm yields information on the
significant and distinctive words within each cluster, and these words
can be inserted into the naturally occuring hierarchical structure
produced by the algorithm. The result is an automatically generated
hierarchical topical taxonomy of a document set. In this paper, we
show how the algorithm's cost scales up linearly with the size of the
data, illustrate experimentally the quality of the clusters produced,
and show how the algorithm can produce a hierarchical topical
taxonomy. Keywords: unsupervised clustering, hierarchical taxonomy,
scalability, text document data mining
@misc{ boley98hierarchical,
author = "D. Boley",
title = "Hierarchical taxonomies using divisive partitioning",
text = "D. L. Boley. Hierarchical taxonomies using divisive
partitioning. Technical Report TR-98-012, Department of Computer
Science, University of Minnesota, Minneapolis, 1998.",
year = "1998",
url = "citeseer.nj.nec.com/boley98hierarchical.html" }
Steve Durbin
A Procedure for Multi-Class Discrimination and Some Linguistic
Applications by V. Pericliev and R.E. Valdes-Perez
(COLING-ACL), Montreal, 1998
pericliev-coling98.ps
Abstract: The paper describes a novel computational tool for multiple concept learning. Unlike previous approaches like ID3 or C4.5, whose major goal is prediction on unseen instances rathe r than the legibility of the output, our MPD (Maximally Parsimonious Discrimination) program emphasizes the conciseness and intelligibility of the resultant class descriptions, using three intuitive simplicity criteria to this end. We illustrate MPD with some additional application s than those commonly associated with the mentioned algorithms, such as learning verbal case f rames, translational correspondences or morphological rules. These include componential analys is (in lexicology and phonology), language typology, and speech pathology.
Neal Richter
A learner-independent evaluation of the usefulness of statistical
phrases for automated text categorization.
by Maria Fernanda Caropreso, Stan Matwin, and Fabrizio Sebastiani
Text Databases and Document Management: Theory and Practice
caropreso-tc.pdf
Abstract:
In this work we investigate the usefulness of n-grams for document
indexing in text categorization (TC). We call n-gram a set Gk of n
word stems, and we say that Gk occurs in a document Dj when a sequence
of words appears in Dj that, after stop word removal and stemming,
consists exactly of the n stems in Gk, in some order. Previous
researches have investigated the use of n-grams (or some variant of
them) in the context of specific learning algorithms, and thus have
not obtained general answers on their usefulness for TC. In this work
we investigate the usefulness of n-grams in TC independently of any
specific learning algorithm. We do so by applying feature selection to
the pool of all k-grams (k <= n), and checking how many n-grams score
high enough to be selected in the top sigma k-grams. We report the
results of our experiments, using various feature selection measures
and varying values of sigma, performed on the Reuters-21578 standard
TC benchmark. We also report results of making actual use of the
selected n-grams in the context of a linear classifier induced by
means of the Rocchio method.
Round-Robin DeathMatch! (1 paper each) - August 31, 2001
Bikramjit Banerjee
Question Answering by Passage Selection by C.L.A Clarke, et al.
Proceedings of the TREC-9, 2000
Abstract: Not Availiable
trec9-clarke.pdf
Neal Richter
FALCON: Boosting Knowledge for Answer Engines by S. Harabagiu, et
al.
Proceedings of the TREC-9, 2000
trec9-smu-falcon.pdf
Abstract:
This paper presents the features of Falcon, an answer engine that
integrates different forms of syntactic, semantic and pragmatic
knowledge for the goal of achieving better performance. The answer
engine handles question reformulations, finds the expected answer type
from a large hierarchy that incorporates the WordNet semantic net and
extracts answers after performing unifications on the semantic forms
of the question and its candidate answers. To rule out erroneous
answers, it provides a justification option, implemented as an
abductive proof. In TREC-9, Falcon generated a score of 58% for short
answers and 76% for long answers.
Zuzana Gedeon
Representing sentence structure in HMM for information extraction by
S.Ray, M.Craven
IJCAI-01 ie.hmm.ijcai01.pdf
Abstract:
We study the application of Hidden Markov Models (HMMs) to learning
information extractors for n-ary relations from free text. We propose
an approach to representing the grammatical structure of sentences in
the states of the model. We also investigate using an objective
function during HMM training which maximizes the ability of the
learned models to identify the phrases of interest. We evaluate our
methods by deriving extractors for two binary relations in biomedical
domains. Our experiments indicate that our approach learns more
accurate models than several baseline approaches.
Steve Durbin
Scaling Question Answering to the Web by U of Washington.
http://www10.org/cdrom/papers/120/index.html
Abstract:
The wealth of information on the web makes it an attractive resource
for seeking quick answers to simple, factual questions such as ``who
was the first American in space?'' or ``what is the second tallest
mountain in the world?'' Yet today's most advanced web search
services (e.g., Google and AskJeeves) make it surprisingly tedious to
locate answers to such questions. In this paper, we extend
question-answering techniques, first studied in the information
retrieval literature, to the web and experimentally evaluate their
performance.
First we introduce MULDER, which we believe to be the first
general-purpose, fully-automated question-answering system available
on the web. Second, we describe MULDER's architecture, which relies
on multiple search-engine queries, natural-language parsing, and a
novel voting procedure to yield reliable answers coupled with high
recall. Finally, we compare MULDER's performance to that of Google and
AskJeeves on questions drawn from the TREC-8 question track. We find
that MULDER's recall is more than a factor of three higher than that
of AskJeeves. In addition, we find that Google requires 6.6 times as
much user effort to achieve the same level of recall as MULDER.
Neal Richter, Bikramjit Banerjee - August 17, 2001
Review of a few interesting papers from IJCAI-2001
Handout on Question Answering Systems
Tutorial by Dan Molovan
& Sandra Harabagiu
Several Other Papers to Peruse Add Links Later
NLP Driver IR
Automatic Sumamry Generation
Steve Durbin, Zuzana Gedeon - August 10, 2001 - New Zealand Digital Library Project overview
Cancelled due to company party. Have a look for future reference!
The goal of our research program is to explore the potential of
internet-based digital libraries. Our vision is to develop systems
that automatically impose structure on anarchic, uncatalogued,
distributed repositories of information, thereby providing information
consumers with effective tools to locate what they need and to peruse
it conveniently and comfortably.
Project members are actively working on techniques for creating,
managing, and and mainatining collections; extracting metadata from
legacy documents; analysing library usage and user needs; Maori,
Arabic and Chinese language systems; internationalising the library
interface; optical music recognition and musical collections; novel
interfaces for formulating queries and visualising results; novel
interfaces for browsing metadata; text mining for keyphrases,
acronyms, and other metadata; keyphrase extraction and phrase-based
browsing; and other research topics.
Links:
This page has links to a variety of collections in different languages
that make use of the NZDL tools. Documentation, research descriptions,
software downloads(GPL), and publications are available.
New Zealand Digital Library
A very readable review paper by project members is:
Applications of Machine Learning in Information Retrieval
J. Cunningham, J. Littin, and I. H. Wittin
Dept. of Computer Science, Univ. of Waikato, New Zealand
machine_learning_in_IR.pdf
Proceedings of the Ninth National Conference on Artificial
Intelligence.
San Jose CA, AAAI Press, pp. 47-52.
Cunningham, S. J., Litten, J., Witten, I. H. (1997).
Applications of machine learning in Information retrieval.
Working paper 97/6. C.S department, University of Waikato, New
Zealand.
Zuzana Gedeon - July 27, 2001 - Novelty Detection, DA and other new methods.
A Hierarchical Probabilistic Model for Novelty Detection in Text
(1999)
L. Douglas Baker, Thomas Hofmann Andrew K. McCallum, Yiming Yang
baker99novelty.ps
Abstract:
Topic Detection and Tracking (TDT) is a variant of classification in
which the classes are not known or fixed in advance. Consider for
example an incoming stream of news articles or email messages that are
to be classified by topic; new classes must be created as new topics
arise. The problem is a challenging one for machine
learning. Instances of new topics must be recognized as not belonging
to any of the existing classes (detection), and instances of old
topics must be correctly classified (tracking)—often with extremely
little training data per class. This paper proposes a new approach to
TDT based on probabilistic, generative models. Strong statistical
techniques are used to address the many chalenges: hierarchical
shrinkage for sparse data, statistical "garbage collection" for new
event detection, clustering in time to separate the different events
of a common topic, and deterministic annealing for creating the
hierarchy. Preliminary experimental results show promise.
@misc{ baker99hierarchical,
author = "D. Baker and T. Hofmann and A. McCallum and Y. Yang",
title = "A hierarchical probabilistic model for novelty detection in
text",
text = "D. Baker, T. Hofmann, A. McCallum, and Y. Yang. 1999. A
hierarchical probabilistic model for novelty detection in
text. In ICML-99. In submission.",
year = "1999",
url = "citeseer.nj.nec.com/article/baker99hierarchical.html" }
Neal Richter - July 13, 2001 - Adaptive Clustering
Cluster-Based Adaptive Information Retrieval
Jay N. Bhuyany, Jitender S. Deogunz, Vijay V. Raghavan
cluster-based-adaptive-information.ps
Abstract:
This paper discusses the issues involved in the design of a complete
information retrieval system based on useroriented clustering
schemes. Clusters are constructed taking into account the users'
perception of similarity between documents. The system accumulates
feedback from the users and employs it to construct useroriented
clusters. An optimization function to improve the effectiveness of the
clustering process is developed. A retrieval process based on the
clustering scheme is described. The system developed is experimentally
validated and compared with existing systems.
Adaptive Clustering Of Hypermedia Documents
Andrew Johnson, Farshad Fotouhi
johnson96adaptive.pdf
Abstract:
A hypermedia system connects various types of information into a
network where related nodes of information (text, audio, video) are
connected by links. Clustering these nodes is an effective way to
reduce information-overhead, allowing the user to browse through the
clusters as well as the individual nodes. In this paper, we compare
the use of two adaptive algorithms (genetic algorithms, and neural
networks) in clustering hypermedia documents. These clusters allow the
user to index into this overwhelming number of nodes and find needed
information quickly. We base the clustering on the user's paths
through the hypermedia document and not on the content of the nodes or
the structure of the links in the document, thus the clustering
reflects the unique relationships each user sees among the nodes. The
original hypermedia document remains untouched, however each user will
now have a personalized index into this document.
@article{ johnson96adaptive,
author = "Andrew Johnson and Farshad Fotouhi",
title = "Adaptive Clustering of Hypermedia Documents",
journal = "Information Systems",
volume = "21",
number = "6",
pages = "459-473",
year = "1996",
url = "citeseer.nj.nec.com/article/johnson94adaptive.html" }
Bikramjit Banerjee - June 22, 2001 - Scalable BIRCH
Clustering Large Datasets in Arbitrary Metric Spaces
V. Ganti, R. Ramakrishnan and J. Gehrke
birchfm.ps
Abstract:
Clustering partitions a collection of objects into groups called
clusters, such that similar objects fall into the same
group. Similarity between objects is defined by a distance function
satisfying the triangle inequality; this distance function along with
the collection of objects describes a distance space. In a distance
space, the only operation possible on data objects is the computation
of distance between them. All scalable algorithms in the literature
assume a special type of distance space, namely a k-dimensional vector
space, which allows vector operations on objects.We present two
scalable algorithms designed for clustering very large datasets in
distance spaces. Our first algorithm BUBBLE is, to our knowledge, the
first scalable clustering algorithm for data in a distance space. Our
second algorithm BUBBLE-FM improves upon BUBBLE by reducing the number
of calls to the distance function, which maybe computationally very
expensive. Both algorithms make only a single scan over the database
while producing high clustering quality. In a detailed experimental
evaluation, we study both algorithms in terms of scalability and
quality of clustering. We also show results of applying the algorithms
to a real-life dataset.
V. Ganti, R. Ramakrishnan, J. Gehrke, A. Powell, and J. French. Clustering large datasets in arbitrary metric spaces. Technical report, University of Wisconsin-Madison, 1998.
Doug Warner - June 15, 2001 - Targeted Marketing
Segmentation-Based Modeling for Advanced Targeted Marketing
C. Apte, E. Bibelnieks, R. Natarajan, E. Pednault, F. Tipu,
D. Campbell, B. Nelson
IBM T.J. Watson Research Center and Fingerhut Business Intelligence
segmentation-model-kdd01-apte.pdf
Abstract:
Fingerhut Business Intelligence (BI) has a long and successful
history of building statistical models to predict consumer behavior.
The models constructed are typically segmentation-based models in
which the target audience is split into subpopulations (i.e., customer
segments) and individually tailored statistical models are then
developed for each segment. Such models are commonly employed in the
direct-mail industry; however, segmentation is often performed on an
ad-hoc basis without directly considering how segmentation affects the
accuracy of the resulting segment models. Fingerhut B I approached IBM
Research with the problem of how to build segmentation-based models
more effectively so as to maximize predictive accuracy. The IBM
Advanced Targeted Marketing—Single Events(TM) (IBM ATM -SE(TM))
solution is the result of IBM Research and Fingerhut BI directing
their efforts jointly towards solving this problem. This paper
presents an evaluation of ATMSE's modeling capabilities using data
from Fingerhut's catalog mailings.
Steve Durbin - June 8, 2001 - Ontologies I
Using WordNet in a Knowledge-Based Approach to Information
Retrieval
R. Richardson and A. F. Smeaton
Dublin City University
richardson-wordnet_in_ir.ps
Abstract:
The application of natural language processing tools and techniques to
information retrieval tasks has long since been identified as
potentially useful for the quality of information
retrieval. Traditionally, IR has been based on matching words or terms
in a query with words or terms in a document. In this paper we
introduce an approach to IR based on computing a semantic distance
measurement between concepts or words and using this word distance to
compute a similarity between a query and a document. Two such semantic
distance measures are presented in this paper and both are benchmarked
on queries and documents from the TREC collection. Although our
results in terms of precision and recall are disappointing, we
rationalise this in terms of our experimental setup and our results
show promise for future work in this area.
Steve Durbin - May 11, 2001 - Topic words
Finding Topic Words for Hierarchical Summarization
Dawn Laurie, W. Bruce Croft, and Arnold Rosenberg
University of Massachusetts, Amherst
croft_topic_words.pdf
Abstract:
Hierarchies have long been used for categorization, summarization, and
access to information. In this paper we define summarization in terms
of a probabilistic language model and use the definition to explore a
new technique for automatically generating topic hierarchies by
applying a graph-theoretic algorithm, which is an approximation of the
Dominating Set Problem. The algorithm efficiently generates terms
according to a language model. We compare the new technique to
previous methods proposed for constructing topic hierarchies including
subsumption and lexical hierarchies, as well as words found using
TF-IDF. Our results show that the new technique performs as well as or
better than these other techniques.
Neal Richter - May 4, 2001 - Search Engines
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Sergey Brin and Lawrence Page, Founders of Google
google.pdf
Abstract:
In this paper, we present Google, a prototype of a large-scale search
engine which makes heavy use of the structure present in
hypertext. Google is designed to crawl and index the Web efficiently
and produce much more satisfying search results than existing
systems. The prototype with a full text and hyperlink database of at
least 24 million pages is available at http://google.stanford.edu/ To
engineer a search engine is a challenging task. Search engines index
tens to hundreds of millions of web pages involving a comparable
number of distinct terms. They answer tens of millions of queries
every day. Despite the importance of large-scale search engines on the
web, very little academic research has been done on them. Furthermore,
due to rapid advance in technology and web proliferation, creating a
web search engine today is very different from three years ago. This
paper provides an in-depth description of our large-scale web search
engine—the first such detailed public description we know of to
date. Apart from the problems of scaling traditional search techniques
to data of this magnitude, there are new technical challenges involved
with using the additional information present in hypertext to produce
better search results. This paper addresses this question of how to
build a practical large-scale system which can exploit the additional
information present in hypertext. Also we look at the problem of how
to effectively deal with uncontrolled hypertext collections where
anyone can publish anything they want.
Fulcrum SearchServer 4.0
Hummingbird Ltd
SearchServer Technical
Whitepaper
Abstract:
Fulcrum SearchServer is a robust, multi-platform indexing and
retrieval engine that delivers full-function text retrieval
capabilities to the end user either from a custom desktop application
or from the Web. SearchServer features a comprehensive set of advanced
text retrieval capabilities, a scaleable client/server distributed
processing architecture, open, standards-based interfaces and superior
performance. These features are all built on a proven search and
retrieval technology. Long-term compatibility and interoperability
with other system components is assured through Hummingbird's
commitment to industry standards—including use of the familiar SQL
query and data manipulation model and the Open Database Connectivity
(ODBC) standard. SearchServer accesses text in its original format and
location, ensuring seamless integration with existing corporate
infrastructures such as word processors and databases. The Hummingbird
application development products, used in conjunction with other
popular development tools, enable rapid prototyping, development and
deployment of SearchServer-based custom application solutions.
Bikramjit Banerjee - April 27, 2001 - Automated answering of questions
An Intelligent System for Question Answering
Sanda M. Harabagiu, Dan I. Moldovan
reno.ps
Abstract:
This paper introduces a system intended for question answering based
on abductive inference. The system uses a large knowledge base
structured around WordNet. Texts and questions are uniformly processed
and semantic paths between concepts are established using a marker
propagation method. The paths bring forward inferences and
implicatures that are otherwise difficult to extract. Implicit
inferences contribute to the information provided by explicit
inferences, producing answers for a variety of question types.
Zuzana Gedeon - April 20, 2001 - Frequent sets - an approach for meaningfull labeling?
Finding All Maximal Frequent Sequences in Text
Helena Ahonen-Myka
ham_icml99.ps
Abstract:
In this paper we present a novel algorithm for discovering maximal
frequent sequences in a set of documents, i.e., such sequences of
words that are frequent in the document collection and, moreover, that
are not contained in any other longer frequent sequence. A sequence is
considered to be frequent if it appears in at least oe documents, when
oe is the frequency threshold given. Our approach combines bottom-up
and greedy methods, and, hence, is able to extract maximal sequences
without considering all the frequent subsequences of them. This is a
necessity, since maximal frequent sequences in documents may be rather
long. When we have found the set of maximal frequent sequences for
each...
Helena Ahonen. Finding all maximal frequent sequences in text. In ICML99 Workshop, Machine Learning in Text Data Analysis, Bled, Slovenia, 1999.
CLOSET: An Efficient Algorithm for Mining Frequent Closed
Itemsets
Jian Pei, Jiawei Han, and Runying Mao
pei00closet.ps
Abstract:
Association mining may often derive an undesirably large set of
frequent itemsets and association rules. Recent studies have proposed
an interesting alternative: mining frequent closed itemsets and their
corresponding rules, which has the same power as association mining
but substantially reduces the number of rules to be presented. In this
paper, we propose an efficient algorithm, CLOSET, for mining closed
itemsets, with the development of three techniques: (1) applying a
compressed, frequent pattern tree FP-tree structure for mining closed
itemsets without candidate generation, (2) developing a single prefix
path compression technique to identify frequent closed itemsets
quickly, and (3)...
Neal Richter - March 30, 2001 - Summarization & Data Mining
Automatic Labeling of Document Clusters
Alexandrin Popescul, Lyle H. Ungar (U of Penn.)
labeling_KDD00.pdf
Abstract:
Automatically labeling document clusters with words which indicate
their topics is difficult to do well. The most commonly used method,
labeling with the most frequent words in the clusters, ends up using
many words that are virtually void of descriptive power even after
traditional stop words are removed. Another method, labeling with the
most predictive words, often includes rather obscure words. We present
two methods of labeling document clusters motivated by the model that
words are generated by a hierarchy of mixture components of varying
generality. The first method assumes existence of a document hierarchy
(manually constructed or resulting from a hierarchical clustering
algorithm) and uses a 2 test of significance to detect different word
usage across categories in the hierarchy. The second method selects
words which both occur frequently in a cluster and effectively
discriminate the given cluster from the other clusters. We compare
these methods on abstracts of documents selected from a subset of the
hierarchy of the Cora search engine for computer science research
papers. Labels produced by our methods showed superior results to the
commonly employed methods.
@misc{ popescul-automatic,
author = "Alexandrin Popescul and Lyle H. Ungar",
title = "Automatic Labeling of Document Clusters",
url = "citeseer.nj.nec.com/popescul00automatic.html" }
A Mutually Beneficial Integration of Data Mining and Information
Extraction
Un Yong Nahm and Raymond J. Mooney (U of Texas)
discotex-aaai-00.ps
Abstract:
Text mining concerns applying data mining techniques to unstructured
text. Information extraction (IE) is a form of shallow text
understanding that locates specific pieces of data in natural language
documents, transforming unstructured text into a structured
database. This paper describes a system called DiscoTEX, that combines
IE and data mining methodologies to perform text mining as well as
improve the performance of the underlying extraction system. Rules
mined from a database extracted from a corpus of texts are used to
predict additional information to extract from future documents,
thereby improving the recall of IE. Encouraging results are presented
on applying these techniques to a corpus of computer job postings from
an Internet newsgroup.
Zuzana Gedeon - March 23, 2001 - Deterministic Annealing
Deterministic Annealing for Clustering, Compression,
Classification, Regression, and Related Optimization Problems
K. Rose
rose98.pdf
Abstract:
The deterministic annealing approach to clustering and its extensions
has demonstrated substantial performance improvements over standard
supervised and unsupervised learning methods in a variety of important
applications including compression, estimation, pattern recognition
and classification, and statistical regression. The method offers
three important features: 1) the ability to avoid many poor local
optima; 2) applicability to many different structures/architectures;
and 3) the ability to minimize the right cost function even when its
gradient vanish almost everyvhere, as is the case of the empirical
classification error...
Notes:
This paper describes deterministic annealing method and its
aplicability for various clustering problems. It has been used in
speech recognition and vector quantization. It is long paper (30
pages) so we should skip most of the details there. read first 3 pages
(introduction + part A. in chapter II. up to (principled
derivation...). On page 9 (pp.2218) is an algorithm description.
Kenneth Rose. Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proceedings of the IEEE, 86:11:2210-2239, 1998.
Clustering algorithms for cricket neural data.
Zuzana's MS project report
zuzana-ms.ps
Abstract:
In our paper we describe the deterministic annealing approach to
clustering, test it on two and three dimensional data sets with and
without the noise, and propose a novel application of this method to
clusering of cricket neural data.
Steve Durbin - March 16, 2001 - Machine Translation
Toward a Scoring Function for Quality-Driven Machine Translation
Douglas A. Jones and Gregory M. Rusk
jones-coling-2000.ps
Abstract:
We describe how we constructed an automatic scoring function for
machine translation quality; this function makes use of arbitrarily
many pieces of natural language processing software that has been
designed to process English language text. By machine-learning values
of functions available inside the software and by constructing
functions that yield values based upon the software output, we are
able to achieve preliminary, positive results in machine-learning the
difference between human-produced English and machine-translation
English. We suggest how the scoring function may be used for MT
system development.
Doug Warner - March 2, 2001 - Summarization
Query-Relevant Summarization Using FAQs
Adam Berger, Vibhu O. Mittal
berger2000.ps
Abstract:
This paper introduces a statistical model for query-relevant
summarization: succinctly characterizing the relevance of a document
to a query. Learning parameter values for the proposed model requires
a large collection of summarized documents, which we do not have, but
as a proxy, we use a collection of FAQ (frequently-asked question)
documents. Taking a learning approach enables a principled,
quantitative evaluation of the proposed system, and the results of
some initial experiments—on a collection of Usenet FAQs and on a
FAQ-like set of customer-submitted questions to several large retail
companies—suggest the plausibility of learning for summarization.
@misc{ vibhu-queryrelevant,
author = "Adam Berger Vibhu",
title = "Query-Relevant Summarization using FAQs",
url = "citeseer.nj.nec.com/341776.html" }
For review, see also July 28, 2000: Summarization.
Steve Durbin - February 23, 2001 - Shallow Parsing
A Rule-Based Approach to Prepositional Phrase Attachment
Disambiguation
Eric Brill, Phillip Resnik
brill-prep_phrase_attachment.ps
Abstract:
In this paper, we describe a new corpus-based approach to
prepositional phrase attachment disambiguation, and present results
comparing performance of this algorithm with other corpus-based
approaches to this problem.
[This illustrates adaptation of the Brill method to problems similar to those of identifying negation and adverbial multiplication as needed for Emotix. See also description of Brill method.]
E. Brill and P. Resnik. A rule-based approach to prepositional phrase attachment disambiguation. In 15th International Conference on Computational Linguistics (COLING94) , Kyoto, Japan, 1994.
Bikramjit Banerjee - February 16, 2001 - Distributional Clustering
Distributional Clustering of Words for Text Classification
L. Douglas Baker, Andrew McCallum
clustering-sigir98s.ps
Abstract:
This paper applies Distributional Clustering (Pereira et al. 1993) to
document classification. The approach clusters words into groups based
on the distribution of class labels associated with each word. Thus,
unlike some other unsupervised dimensionality-reduction techniques,
such as Latent Semantic Indexing, we are able to compress the feature
space much more aggressively, while still maintaining high document
classification accuracy. Experimental results obtained on three
real-world data sets show that we can reduce the feature
dimensionality by three orders of magnitude and lose only 2%
accuracy—significantly better than Latent Semantic Indexing
(Deerwester et al. 1990), class-based clustering (Brown et al. 1992),
feature selection by mutual information (Yang and Pederson 1997), or
Markovblanket-based feature selection (Koller and Sahami 1996). We
also show that less aggressive clustering sometimes results in
improved classification accuracy over classification without
clustering.
@inproceedings{ baker98distributional,
author = "L. Douglas Baker and Andrew K. McCallum",
title = "Distributional clustering of words for text
classification",
booktitle = "Proceedings of {SIGIR}-98,
21st {ACM} International Conference on Research and Development in
Information Retrieval",
publisher = "ACM Press, New York, US",
address = "Melbourne, AU",
editor = "W. Bruce Croft and Alistair Moffat
and Cornelis J. van Rijsbergen and Ross Wilkinson
and Justin Zobel",
pages = "96-103",
year = "1998",
url = "citeseer.nj.nec.com/baker98distributional.html" }
Neal Richter - February 9, 2001 - User Profiling & Adaptive Web Sites
Mining Web Access Logs Using Relational Competitive Fuzzy
Clustering
Olfa Nasraoui, Hichem Frigui, Anupam Joshi, Raghu Krishnapuram
nasraoui-1999a.pdf
Abstract:
The proliferation of information on the World-Wide Web has made the
personalization of this information space a necessity. An important
part of Web personalization is to mine typical user profiles from the
vast amount of historical data stored in access logs. In this paper,
we define the notion of a "user session" and a new distance measure
between two web sessions that captures the organization of a web
site. A competitive agglomeration clustering algorithm which can
automatically cluster data into a parsimonious number of components is
used to analyze server access logs and obtain typical session profiles
of users.
@misc{ nasraoui-mining,
author = "O. Nasraoui and H. Frigui and A. Joshi and
R. Krishnapuram",
title = "Mining Web Access Logs Using Relational Competitive Fuzzy
Clustering",
text = "O. Nasraoui, H. Frigui, A. Joshi, and R. Krishnapuram,
Mining Web Access Logs Using Relational Competitive Fuzzy
Clustering, to be presented at the Eight International Fuzzy
Systems Association World Congress - IFSA 99, Taipei, August
99.",
url = "citeseer.nj.nec.com/286359.html" }
Improving the Effectiveness of a Web Site with Web Usage Mining
Myra Spiliopoulou, Carsten Pohlek, Lukas C. Faulstich
spiliop-1999.ps
Abstract:
Effective web presence is for many companies indispensable for their
success to the global market. In recent years, several methods have
been developed for measuring and improving the effectiveness of
commercial sites. However, they mostly concentrate on web page design
and on access analysis. In this study, we propose a methodology of
assessing the quality of a web site in turning its users into
customers. Our methodology is based on the discovery and comparison of
navigation patterns of customers and non-customers. This comparison
leads to rules on how the site's topology should be improved. We
further propose a technique for dynamically adapting the site
according to those rules.
Doug Warner - February 2, 2001 - Visualizing Graphs
Multilevel Visualization of Clustered Graphs
Peter Eades, Qingwen Feng
eades-fang97.ps
Abstract:
Clustered graphs are graphs with recursive clustering structures over
the vertices. This type of structure appears in many systems. Examples
include CASE tools, management information systems, VLSI design tools,
and reverse engineering systems. Existing layout algorithms represent
the clustering structure as recursively nested regions in the
plane. However, as the structure becomes more and more complex, two
dimensional plane representations tend to be insufficient. In this
paper, firstly, we describe some two dimensional plane drawing
algorithms for clustered graphs; then we show how to extend two
dimensional plane drawings to three dimensional multilevel
drawings. We consider two conventions: straight-line convex drawings
and orthogonal rectangular drawings; and we show some examples.
@inproceedings{ eades96multilevel,
author = "Peter Eades and Qing-Wen Feng",
title = "Multilevel Visualization of Clustered Graphs",
booktitle = "Proc. Graph Drawing, {GD}",
number = "1190",
month = "18-20~",
publisher = "Springer-Verlag",
address = "Berlin, Germany",
pages = "101-112",
year = "1996",
url = "citeseer.nj.nec.com/eades97multilevel.html" }
Darin Rambo - January 26, 2001 - User Interface Insights
Darin will be giving handouts
Bikram Banerjee - January 19, 2001 - Optimization for Hierarchical Clustering
Iterative Optimization and Simplification of Hierarchical
Clusterings
Doug Fisher
fisher96a.ps
Abstract:
Clustering is often used for discovering structure in data. Clustering
systems differ in the objective function used to evaluate clustering
quality and the control strategy used to search the space of
clusterings. Ideally, the search strategy should consistently
construct clusterings of high quality, but be computationally
inexpensive as well. In general, we cannot have it both ways, but we
can partition the search so that a system inexpensively constructs a
`tentative' clustering for initial examination, followed by iterative
optimization, which continues to search in background for improved
clusterings. Given this motivation, we evaluate an inexpensive
strategy for creating initial clusterings, coupled with several
control strategies for iterative optimization, each of which
repeatedly modifies an initial clustering in search of a better
one. One of these methods appears novel as an iterative optimization
strategy in clustering contexts. Once a clustering has been
constructed it is judged by analysts—often according to task-specific
criteria. Several authors have abstracted these criteria and posited a
generic performance task akin to pattern completion, where the error
rate over completed patterns is used to `externally' judge clustering
utility. Given this performance task, we adapt resampling-based
pruning strategies used by supervised learning systems to the task of
simplifying hierarchical clusterings, thus promising to ease
post-clustering analysis. Finally, we propose a number of objective
functions, based on attribute-selection measures for decision-tree
induction, that might perform well on the error rate and simplicity
dimensions.
@article{ fisher96iterative,
author = "Doug Fisher",
title = "Iterative Optimization and Simplification of Hierarchical
Clusterings",
journal = "Journal of Artificial Intelligence Research",
volume = "4",
pages = "147-180",
year = "1996",
url = "citeseer.nj.nec.com/fisher95iterative.html" }
Neal Richter - January 12, 2001 - Interactive Clustering
Relevance and Reinforcement in Interactive Browsing
Anton Leuski
leouski-ir-208.ps
Abstract:
We consider the problem of browsing the top ranked portion of the
documents returned by an information retrieval system. We describe an
interactive relevance feedback agent that analyzes the inter-document
similarities and can help the user to locate the interesting
information quickly. We show how such an agent can be designed and
improved by using neural networks and reinforcement learning. We
demonstrate that its performance significantly exceeds the performance
of the traditional relevance feedback approach.
@misc{ leuski-relevance,
author = "Anton Leuski",
title = "Relevance and Reinforcement in Interactive Browsing",
url = "citeseer.nj.nec.com/leuski00relevance.html" }
A. Leuski. Relevance and reinforcement in interactive browsing.
In Proceedings of Ninth International Conference
on Information and Knowledge Management (CIKM'00),
pages 119-126, November 2000.
Interactive Cluster Visualization for Information Retrieval
James Allan, Anton V. Leouski, and Russell C. Swan
allan-leuski-97.ps
Abstract:
This study investigates the ability of cluster visualization to help a
user rapidly identifyrelevant documents. It provides added support for
the truth of the Cluster Hypothesis on retrieved documents and shows
that clustering of relevant documents is readily visible. The study
then shows the visual effect of a technique similar to relevance
feedback and shows how to enhance that effect to further help the user
locate relevant material.
J. Allan, A. Leouski, and R. Swan. Interactive Cluster Visualization for Information Retrieval. Technical Report IR-116, Center for Intelligent Information Retrieval, University of Massachusetts, Amherst, 1997.
Steve Durbin - January 5, 2001 - Part of Speech Tagging
Some Advances in Transformation-Based Part of Speech Tagging
Eric Brill
brill-aaai94-tagger.ps
Abstract:
Most recent research in trainable part of speech taggers has explored
stochastic tagging. While these taggers obtain high accuracy,
linguistic information is captured indirectly, typically in tens of
thousands of lexical and contextual probabilities. In (Brill 1992), a
trainable rule-based tagger was described that obtained performance
comparable to that of stochastic taggers, but captured relevant
linguistic information in a small number of simple non-stochastic
rules. In this paper, we describe a number of extensions to this
rulebased tagger. First, we describe a method for expressing lexical
relations in tagging that stochastic taggers are currently unable to
express. Next, we show a rule-based approach to tagging unknown
words. Finally, we show how the tagger can be extended into a k-best
tagger, where multiple tags can be assigned to words in some cases of
uncertainty.
Brill E. "Some advances in transformation-based part of speech tagging". In Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94). 1994.
Doug Warner - December 15, 2000 - LAPART Neural Networks
Acquiring rule sets as a product of learning in a logical neural
architecture
Healy, M.J.; Caudell, T.P.
Electronic copy pending, Paper copy available
Abstract:
Envisioning neural networks as systems that learn rules calls forth
the verification issues already being studied in knowledge-based
systems engineering, and complicates these with neural-network
concepts such as nonlinear dynamics and distributed memories. We show
that the issues can be clarified and the learned rules visualized
symbolically by formalizing the semantics of rule-learning in the
mathematical language of two-valued predicate logic. We further show
that this can, at least in some cases, be done with a fairly simple
logical model. We illustrate this with a combination of two example
neural-network architectures, LAPART, designed to learn rules as
logical inferences from binary data patterns, and the stack interval
network, which converts real-valued data into binary patterns that
preserve the semantics of the ordering of real values. We discuss the
significance of the formal model in facilitating the analysis of the
underlying logic of rule-learning and numerical data
representation. We provide examples to illustrate the formal model,
with the combined stack interval/LAPART networks extracting rules from
numerical data.
Bikram Banerjee - December 8, 2000 - Similarity Assessment
General-Purpose Case-Based Planning: Methods and Systems
Ralph Bergmann, Héctor Muñoz-Avila, and Manuela
Veloso
ai-com.ps
First Paragraph:
Planning consists of the construction of some course of actions to
achieve a specified set of goals, starting from an initial
situation. For example, determining a plan of actions (often called
operators) for transporting a certain set of goods from a source
location to some destination provided that there are different means
of transportation is a typical planning problem in the transportation
domain. Many planning problems of practical interest also appear in
engineering domains, e.g., the domain of process planning.
M. Veloso, H. Munoz-Avila, and R. Bergmann. General-purpose case-based planning: Methods and systems. AI Communications, 9(3):128-137, 1996.
(OPTIONAL) Constructive Similarity Assessment: Using Stored
Cases to Define New Situations
David B. Leake
p-92-01.ps
Abstract:
A fundamental issue in case-based reasoning is similarity assessment:
determining similarities and differences between new and retrieved
cases. Many methods have been developed for comparing input case
descriptions to the cases already in memory. However, the success of
such methods depends on the input case description being sufficiently
complete to reflect the important features of the new situation, which
is not assured. In case-based explanation of anomalous events during
story understanding, the anomaly arises because the current situation
is incompletely understood; consequently, similarity assessment based
on matches between known current features and old cases is likely to
fail because of gaps in the current case's description.
Our solution to the problem of gaps in a new case's description is an approach that we call constructive similarity assessment. Constructive similarity assessment treats similarity assessment not as a simple comparison between fixed new and old cases, but as a process for deciding which types of features should be investigated in the new situation and, if the features are borne out by other knowledge, added to the description of the current case. Constructive similarity assessment does not merely compare new cases to old: using prior cases as its guide, it dynamically carves augmented descriptions of new cases out of memory.
@techreport{ leake92constructive,
author = "David Leake",
title = "Constructive Similarity Assessment: Using Stored Cases to
Define New Situations",
year = "1992",
url = "citeseer.nj.nec.com/60240.html" }
Neal Richter - December 1, 2000 - Genetic Algorithms for Learning Bayesian Networks
Learning Bayesian Networks from Incomplete Data using Evolutionary
Algorithms
James W. Myers, Kathryn B. Laskey, Kenneth A. DeJong
MKD-99.ps
Abstract:
This paper describes an evolutionary algorithm approach to learning
Bayesian networks from incomplete data. This problem is characterized
by a huge solution space with a highly multi-modal
landscape. State-of-the-art approaches all involve using deterministic
approaches such as the expectation-maximization algorithm. These
approaches are guaranteed to find local maxima, but do not explore the
landscape for other modes. Our approach evolves the structure of the
network and the missing data. We use a factorial design to choose a
good set of parameters for selection, crossover, and mutation. We show
that our algorithm produces accurate results for a classification
problem with missing data.
Zuzana Gedeon - November 17, 2000 - Suffix Trees
A practical Suffix-Tree Implementation for String
Searches
Bogdan Dorohonceanu and Craig Nevill-Manning
Paper Copy Only
See also:
Direct link to the article with the abstract and source code for their
implementation of Suffix trees with Java:
http://www.ddj.com/articles/2000/0007/0007toc.htm
http://www.ddj.com/ftp/2000/2000_07/aa700.txt
http://www.ddj.com/ftp/2000/2000_07/aa700.zip
Suffix trees are used for string searches. Our authors describe how to build a generalized suffix tree data structure using as few hardware resources as possible while still approaching the time complexity derived in theory. Additional resources include aa700.txt (listings) and aa700.zip (source code).
Revisit: Grouper: A Dynamic Clustering
Interface to Web Search Results
Oren Zamir and Oren Etzioni
Revisit: Fast and Intuitive Clustering of Web
Documents
Oren Zamir Oren Etzioni Omid Madani Richard M. Karp
Gary Boone - 9:00am - 10:00am - Thursday, November 9, 2000 - Text Analysis
Extreme Dimensionality Reduction for Text Learning:
Cluster-generated Feature Spaces
Gary Boone
gary_boone_thesis.ps
Abstract
An exciting array of information management applications would be
possible if computers could effectively understand text-based
information. Significant progress has been made by employing
statistical and machine learning techniques, but fundamental
limitation of these vector space approaches is the large
dimensionality of the resulting learning space. With typical domain
vocabularies and therefore representational dimensionalities
containing 10,000-40,000 words, applicable learning techniques are
limited. Feature selection techniques can reduce these values, but
performance typically suffers as words are eliminated below a few
thousand feature words.
Cluster-generated Feature Spaces is a new method that generates features with fast and accurate performance with fewer than 100 features and in many cases less than 10. The approach emphasizes learning a new feature space into which training and query data are subsequently projected. In contrast to feature selection approaches, no feature words are discarded. As a result, dimensionality can be reduced while avoiding the problem of "feature invisibility" which occurs when no feature words are present in a document. The method has the advantages that it is faster than comparable techniques, results in equal or more accurate performance of classification tasks, and enables the application of a broad range of learning techniques in the new, low dimensional space.
Performance results will be shown for email, newsgroup, and web page classification tasks. Comparisons to other common techniques will be given, including Naive Bayes, Information Retrieval clustering and non-clustering techniques, Latent Semantic Analysis, and Support Vector Machines.
By clustering training data in text learning tasks, we can create more effective features than commonly used term features. These cluster features dramatically reduce the dimensionality of the learning space, allowing the use of inductive algorithms that are intractable in high dimensions. The result is equal or better accuracy in classification tasks and faster classification times. Additionally, because no features are discarded in the feature generation process, the problem of feature invisibility is avoided. The process operates by first creating k cluster features, then projecting the training data into the k-dimensional cluster features space. A small k allows inductive algorithms with compact representations and rapid performance, such as Bayesian neural nets, to be used. Applying the cluster features approach to classification tasks for standardized newsgroup and real-world email data sets, we can reduce the learning spaces to fewer than 10 dimensions.
Steve Durbin - November 3, 2000 - Clicktrack Data Analysis
I will consider ways we might extract more information from user clicktrack data in order—adapt as appropriate to go beyond consensus recommendations based on an average of all previous user activity. I'll also indicate how visualization tools like Mathematica may be helpful in developing effective algorithms. A summary and paper describing one approach is at http://www.ics.uci.edu/~icadez/projects/weblog/.
Visualization of Navigation Patterns on a Web Site Using
Model-Based Clustering
Igor Cadez, David Heckerman, Christopher Meek, Padhraic Smyth, and
Steven White
vis-of-nav-pattern.ps
Abstract
We present a new methodology for visualizing navigation patterns on a
Web site. In our approach, we rst partition site users into clusters
such that only users with similar navigation paths through the site
are placed into the same cluster. Then, for each cluster, we display
these paths for users within that cluster. The clustering approach we
employ is model based (as opposed to distance based) and partitions
users according to the order in which they request Web pages. In
particular, we cluster users by learning a mixture of first-order
Markov models using the Expectation Maximization algorithm. Our
algorithm scales linearly with both number of users and number of
clusters, and our implementation easily handles millions of users and
thousands of clusters in memory.
Lou Glassy - October 20, 2000 - Markov Chains
The Practice of Programming
Brian W. Kernighan and Rob Pike
Chapter 3: Design and Implementation
No online version.
Calculating Markov Chains using text.. can be used to produce readable unique text.
Neal Richter - October 13, 2000 - Learning Belief Networks
Learning Accurate Belief Nets
Wei Zhou and Russell Greiner
learn-bn.ps
Abstract
Bayesian belief nets (BNs) are typically used to answer a range of
queries, where each answer requires computing the probability of a
particular hypothesis given some specified evidence. An effective
BN-learning algorithm should, therefore, learn an accurate BN, which
returns the correct answers to these specific queries. This report
first motivates this objective, arguing that it makes effective use of
the data that is encountered, and that it can be more appropriate than
the typical "maximum likelihood" algorithms for learning BNs. We then
describe several different learning situations, which differ based on
how the query information is presented. Based on our analysis of the
inherent complexity of these tasks, we define three algorithms for
learning the best CPtables for a given BN-structure, and then
demonstrate empirically that these algorithms work effectively.
@misc{ zhou99learning,
author = "W. Zhou and R. Greiner",
title = "Learning accurate belief nets",
text = "W. Zhou and R. Greiner. Learning accurate belief
nets. Technical report, UofAlberta CS, 1999.",
year = "1999",
url = "citeseer.nj.nec.com/article/zhou99learning.html" }
Darin Rambo - October 6, 2000 - Interface/Usability
The Trials and Tribulations of Building an Adaptive User
Interface
Benhamin Korvemaker & Russel Greiner
greiner.ps
Abstract
As every user has his own ideosyncracies and preferences, an interface
that is honed for one user may be problematic for another. To
accommodate a diverse range of users, many computer applications
therefore include an interface that can be customized—e.g., by
adjusting parameters, or defining macros. This allows each user to
have his "own" version of the interface, honed to his specific
preferences. However, most such interfaces require the user to perform
this customization by hand—a tedious process that requires the user
to be aware of his personal preferences. We are therefore exploring
adaptive interfaces, that can autonomously determine the user's
preference, and adjust the interface appropriately. This paper reports
a series of experiments towards building such an adaptive
interface—here a Unix shell that can predict the user's next
command based on his previous interactions, and use this to simplify
the user's future interactions. After summarizing the Davison/Hirsh
(1998) work (for learning "command stubs"), we then explore several
ways of extending and improving this system; e.g., to predict entire
command lines, to use various other types of information, etc.
@misc{ korvemaker99trials,
author = "B. Korvemaker and R. Greiner",
title = "The trials and tribulations of building an adaptive user
interface",
text = "Benjamin Korvemaker and Russell Greiner. The trials and
tribulations of building an adaptive user
interface. http://www.cs.ualberta.ca/greiner/PAPERS/AdaptUI.ps,
1999.",
year = "1999",
url = "citeseer.nj.nec.com/113705.html" }
Edited Adaptive Hypermedia: Combining Human and Machine
Intelligence to Achieve Filtered Information
Kristina Hook, SICS, Asa Rudstrom and Annika Waern.
hyper.ps
Abstract
We discuss a novel approach to filtering of hypermedia information
based on an information broker and user environment coupled
together. The advantage of the proposed approach, edited adaptive
hypermedia , is that it combines human expertise with machine
intelligence in order to achieve high quality of the filtered
information provided to the end users.
@misc{ hook97edited,
author = "K. Hook and A. Rudstrom and A. Waern",
title = "Edited adaptive hypermedia: Combining human and machine
intelligence to achieve filtered information",
text = "K. Hook, Asa Rudstrom, and A. Waern. Edited adaptive
hypermedia: Combining human and machine intelligence to achieve
filtered information. In Workshop on Flexible Hypertext,
Apr. 1997.",
year = "1997",
url = "citeseer.nj.nec.com/hook97edited.html" }
More stuff just for giggles
Lumiere Project: Bayesian Reasoning for Automated Assistance a.k.a. "the annoying little microsoft helper thing"
Doug Warner - September 29, 2000 - Knowledge Representation
Similarity, Structure, and Knowledge: A Representational
Approach to Assessment
Peder J. Johnson, Timothy E. Goldsmith, Kathleen W. Teague
No Abstract, Paper copy only
Bikramjit Banerjee - September 22, 2000 - Master's Thesis
Fast Convergence of Concurrent Reinforcement Learners
Bikramjit Banerjee
Abstract
When several agents learn concurrently, the payoff received by an
agent is dependent on the behaviors of the other agents. As the other
agents learn and adapt their strategies, the reward of one agent
becomes non-stationary. Such a non-stationary environment violates the
assumptions underlying the convergence-proofs of single-agent
Reinforcement t-Learning techniques. As a result, the standard RL
techniques like Q-learning are not guaranteed to converge in a
multi-agent environment. Furthermore, the focus of convergence for
multiple concurrent learners is on an equilibrium strategy-profile
rather than the optimal strategies for individual agents. M. Littman
introduced the minimax-Q algorithm that guarantees convergence to
equilibrium Q-values in the limit, in zero-sum games. This was
extended to general-sum domains by Hu&Wellman. These two techniques
are shown to be identical in zero-sum games. The notion of minimax-Q
is extended to the general-sum domains and studied experimentally
compared to the Hu-Wellman algorithm. Popular speed-up techniques like
Qlambda and SARSA are used to augment minimax-Q algorithm and
experimentally verified to expedite minimax-Q. The convergence of the
new SARSA-minimax-Q algorithm, to minimax-Q values, is proved under
appropriate assumptions. Finally, we look beyond Nash-Equilibria as
the only acceptable point of convergence in general-sum games. We
experimentally show that even naive learners can get higher pay offs,
depending on the game-structure and argue in favor of coordination
mechanisms to attempt convergence to non-Nash-Equilibrium but
higher-payoff (also called non-myopic) equilibrium points.
bbanerjee.ps
Neal Richter - September 1, 2000 - Grouper Web Clustering/Searching
Grouper: A Dynamic Clustering Interface to Web Search
Results
Oren Zamir and Oren Etzioni
Abstract
Users of Web search engines are often forced to sift through the long
ordered list of document "snippets" returned by the engines. The IR
community has explored document clustering as an alternative method of
organizing retrieval results, but clustering has yet to be deployed on
most major search engines. The NorthernLight search engine organizes
its output into "custom folders" based on pre-computed document
labels, but does not reveal how the folders are generated or how well
they correspond to users' interests.
In this paper, we introduce Grouper—an interface to the
results of the HuskySearch meta-search engine, which dynamically
groups the search results into clusters labeled by phrases extracted
from the snippets. In addition, we report on the first empirical
comparison of user Web search behavior on a standard ranked-list
presentation versus a clustered presentation. By analyzing HuskySearch
logs, we are able to demonstrate substantial differences in the number
of documents followed, and in the amount of time and effort expended
by users accessing search results through these two interfaces.
zamir-www8.ps
@article{ zamir99grouper,
author = "Oren Zamir and Oren Etzioni",
title = "Grouper: A Dynamic Clustering Interface to Web Search
Results",
journal = "WWW8 / Computer Networks",
volume = "31",
number = "11-16",
pages = "1361-1374",
year = "1999",
url = "citeseer.nj.nec.com/zamir99grouper.html" }
Fast and Intuitive Clustering of Web Documents
Oren Zamir Oren Etzioni Omid Madani Richard M. Karp
Abstract
Conventional document retrieval systems (e.g., Alta Vista) return long
lists of ranked documents in response to user queries. Recently,
document clustering has been put forth as an alternative method of
organizing the results of a retrieval [4]. A person browsing the
clusters can discover patterns that would be overlooked in the
traditional ranked-list presentation.
In this context, a document clustering algorithm has two key requirements. First, the algorithm ought to produce clusters that are easy-to-browse—a user needs to determine at a glance whether the contents of a cluster are of interest. Second, the algorithm has to be fast even when applied to thousands of documents with no preprocessing.
This paper describes several novel clustering methods, which
intersect the documents in a cluster to determine the set of words (or
phrases) shared by all the documents in the cluster. We report on
experiments that evaluate these intersection-based clustering methods
on collections of snippets returned from Web search engines. First, we
show that word-intersection clustering produces superior clusters and
does so faster than standard techniques. Second, we show that our O(n
log n) expected time phrase-intersection clustering method produces
comparable clusters and does so more than two orders of magnitude
faster than word-intersection.
zamir-kdd97_paper.ps
@inproceedings{ zamir97fast,
author = "Oren Zamir and Oren Etzioni and Omid Madani and Richard
M. Karp",
title = "Fast and Intuitive Clustering of Web Documents",
booktitle = "Knowledge Discovery and Data Mining",
pages = "287-290",
year = "1997",
url = "citeseer.nj.nec.com/zamir97fast.html" }
Ganesh Prabu - August 25, 2000 - Classifying for Hierarchically Clustering - Finding Closest Documents
Hierarchically classifying documents using very few words
Daphne Koller, Mehran Sahami
Abstract
The proliferation of topic hierarchies for text documents has resulted
in a need for tools that automatically classify new documents within
such hierarchies. Existing classification schemes which ignore the
hierarchical structure and treat the topics as separate classes are
often inadequate in text classification where the there is a large
number of classes and a huge number of relevant features needed to
distinguish between them. We propose an approach that utilizes the
hierarchical topic structure to decompose the classification task into
a set of simpler problems, one at each node in the classification
tree. As we show, each of these smaller problems can be solved
accurately by focusing only on a very small set of features, those
relevant to the task at hand. This set of relevant features varies
widely throughout the hierarchy, so that, while the overall relevant
feature set may be large, each classifier only examines a small
subset. The use of reduced feature sets allows us to utilize more
complex (probabilistic) models, without encountering many of the
standard computational and robustness difficulties.
Classifier
@inproceedings{ koller97hierarchically,
author = "D. Koller and M. Sahami",
title = "Hierarchically Classifying Documents Using Very Few
Words",
booktitle = "Proceedings of the Fourteenth International
Conference on Machine Learning ({ICML}-97)",
year = "1997",
url = "citeseer.nj.nec.com/article/koller97hierarchically.html" }
An Optimal Algorithm for closest pair Maintenance
Sergei N. Bespamyatnikh
Abstract
Given a set S of n points in k-dimensional space , and an Lt metric
the dynamic closest pair problem is defined as follows: find a closest
pair of S after each update of S (the insertion or the deletion of a
point). For fixed dimension k and fixed metric Lt, we give a data
structure of size O(n) that maintains a closest pair of S in O(log
n)time per insertion and deletion
Distance
Sergei N. Bespamyatnikh. An optimal algorithm for closest pair maintenance. In Proc. 11th Annu. ACM Sympos. Comput. Geom., pages 152-161, 1995.
Lou Glassy - August 11, 2000 - Information Space Topology
Chapter 3, Lou's Master's Thesis
by Lou Glassy {Montana State University}
lou_thesis.ps
Neal Richter - August 4, 2000 - E-Mail Management/Routing
Concept Features in Re:Agent, an Intelligent Email Agent
by Gary Boone {Georgia Institute of Technology}
Abstract:
An important issue in the application of machine learning techniques
to information management tasks is the nature of features extracted
from textual information. We have created an intelligent email agent
that can learn actions such as filtering, prioritizing, downloading to
palmtops, and forwarding email to voice-mail using automatic feature
extraction. Our agent's new feature extraction approach is based on
first learning concepts present within the mail, then using these
concepts as features for learning actions to perform on the
messages. What features should be chosen? This paper describes the
concept features approach and considers two sources for learning
conceptual features: groups defined by the user and groups defined by
the agent's task. Additionally, features may be defined by vectorized
examples or keywords. Experimental results are provided for an email
sorting task.
@misc{ boone-concept,
author = "Gary Boone",
title = "Concept Features in Re:Agent, an Intelligent Email Agent",
url = "citeseer.nj.nec.com/123300.html" }
Boone, Gary. (1998).
Concept features in Re:Agent, an intelligent email agent.
Proceedings of the Second International Conference on Autonomous
Agents.
New York: ACM Press, 141-148.
A Comparison of Classifiers and Document Representations for the
Routing Problem
by Heinrich Schutze, David A. Hully, Jan O. Pedersen {XEROX
Research}
Abstract:
In this paper, we compare learning techniques based on statistical
classification to traditional methods of relevance feedback for the
document routing problem. We consider three classification tech-niques
which have decision rules that are derived via explicit error
minimization: linear discriminant analysis, logistic regression, and
neural networks. We demonstrate that the classifiers perform 10-15%
better than relevance feedback via Rocchio expansion for the TREC-2
and TREC-3 routing tasks.
Error minimization is difficult in high-dimensional feature spaces
because the convergence process is slow and the models are prone to
over-fitting. We use two different strategies, latent semantic
in-dexing and optimal term selection, to reduce the number of
features. Our results indicate that features based on latent semantic
indexing are more effective for techniques such as linear discriminant
analysis and logistic regression, which have no way to protect against
over fitting. Neural networks perform equally well with either set of
features and can take advantage of the additional information
available when both feature sets are used as input.
sigir95.ps
@inproceedings{ schutze95comparison,
author = "Hinrich Schutze and David A. Hull and Jan O. Pedersen",
title = "A Comparison of Classifiers and Document Representations
for the Routing Problem",
booktitle = "Research and Development in Information Retrieval",
pages = "229-237",
year = "1995",
url = "citeseer.nj.nec.com/schutze95comparison.html" }
Doug Warner - July 28, 2000 - Summarization
Ultra-Summarization: A Statistical Approach to Generating Highly
Condensed Non-Extractive Summaries
by Michael J. Witbrock et. al.
Abstract:
None.
p315-witbrock.pdf
Learning and Representing Topic
by Thomas Hofmann
Abstract:
This paper presents a novel statistical mixture model for natural
language learning in information retrieval. The described learning
architecture is based on word occurrence statistics and extracts
hierarchical relations between groups of documents as well as an
abstractive organization of keywords. To train the model we derive a
generalized, annealed version of the Expectation-Maximization (EM)
algorithm for maximum likelihood estimation. The benefits of the model
for interactive information retrieval and automated cluster
summarization are experimentally investigated.
conald98_summizaration.ps
@misc{ mixture-learning,
author = "Hierarchical Mixture",
title = "Learning and Representing Topic",
url = "citeseer.nj.nec.com/119738.html" }
Ganesh Prabu - July 21, 2000 - Efficient Clustering
BIRCH: An Efficient Data Clustering Method for Very Large
Databases
by Tian Zhang et. al.
Abstract:
Finding useful patterns in large datasets has attracted considerable
in terest recently, and one of the most widely studied problems in
this area is the identification of clusters, or densely populated
regions, in a multi-dimensiona l dataset. Prior work does not
adequately address the problem of large datasets and minimization of
I/O costs.
This paper presents a data clustering method named BIRCH (Balanced Iterative Red ucing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constr aints). BIRCH can typically find a good clustering with a single scan of the dat a, and improve the quality further with a few additional scans. BIRCH is also th e first clustering algorithm proposed in the database area to handle "noise" (da ta points that are not part of the underlying pattern) effectively.
We evaluate BIRCH's time/space efficiency, data input order
sensitivity, and clustering quality through several experiments. We
also present a performance compa risons of BIRCH versus CLARANS, a
clustering method proposed recently for large datasets, and show that
BIRCH is consistently superior.
BIRCH.ps
Tian Zhang, Raghu Ramakrishnan, and Miron Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages 103-114, Montreal, Canada, 1996
CURE: An Efficient Clustering Algorithm for Large
Databases
by Sudipto Guha et. al.
Abstract:
Clustering, in data mining, is useful for discovering groups and
identifying interesting distributions in the underlying
data. Traditional clustering a lgorithms either favor clusters with
spherical shapes and similar sizes, or are very fragile in the
presence of outliers. We propose a new clustering algorithm called
CURE that is more robust to outliers, and identifies clusters having
non-spherical shapes and wide variances in size. CURE achieves this
by representing each cluster by a certain fixed number of points that
are generated by selecting well scattered points from the cluster and
then shrinking them toward the center of the cluster by a specified
fraction. Having more than one representative point per cluster allows
CURE to adjust well to the geometry of non-spherical shap es and the
shrinking helps to dampen the effects of outliers. To handle large da
tabases, CURE employs a combination of random sampling and
partitioning. A rando m sample drawn from the data set is first
partitioned and each partition is part ially clustered. The partial
clusters are then clustered in a second pass to yie ld the desired
clusters. Our experimental results confirm that the quality of
clusters produced by CURE is much better than those found by existing
algorithms. Furthermore, they demonstrate that random sampling and
partitioning enable CURE to not only outperform existing algorithms
but also to scale well for large data bases without sacrificing
clustering quality.
CURE.ps
(Note: This is a large file and should be printed on a fast printer or
during a time when printer demand is low.)
@inproceedings{ guha98cure,
author = "Sudipto Guha and Rajeev Rastogi and Kyuseok Shim",
title = "{CURE}: an efficient clustering algorithm for large
databases",
pages = "73-84",
year = "1998",
url = "citeseer.nj.nec.com/article/guha98cure.html" }
Neal Richter - July 14, 2000 - Text Categorization & Data Mining
Combining Classifiers in Text Categorization
by Leah S. Larkey and W. Bruce Croft.
Abstract:
Three different types of classifiers were investigated in the context
of a text categorization problem in the medical domain: the automatic
assignment of ICD9 codes to dictated inpatient discharge
summaries. K-nearest-neighbor, relevance feedback, and Bayesian
independence classifiers were applied individually and in
combination. A combination of different classifiers produced better
results than any single type of classifier. For this specific medical
categorization problem, new query formulation and weighting methods
used in the k-nearest-neighbor classifier improved performance.
cohen-96.ps
@inproceedings{ larkey96combining,
author = "Leah S. Larkey and W. Bruce Croft",
title = "Combining classifiers in text categorization",
booktitle = "Proceedings of {SIGIR}-96,
19th {ACM} International Conference on Research and Development
in Information Retrieval",
publisher = "ACM Press, New York, US",
address = "Z{\"{u}}rich, CH",
editor = "Hans-Peter Frei and Donna Harman and Peter
Sch{\"{a}}uble and Ross Wilkinson",
pages = "289-297",
year = "1996",
url = "citeseer.nj.nec.com/larkey96combining.html" }
A Framework for Comparing Text Categorization Approaches
by Isabelle Moulinier.
Abstract:
For the past few years, text categorization has emerged as an
application domain to machine learning techniques. Several approaches
have already been proposed. This paper does not present yet another
technique. It is rather an attempt to unify the approaches encountered
so far. Moreover this state-of-the-art enables us to stress a
shortcoming in earlier research: the lack of evaluation of inductive
learners in the categorization process. We present a first attempt to
remedy this lack. We expose an experimental framework, that fits in
with our unified view of text categorization methods. This framework
allows us to conduct a set of tentative experiments in order to assess
which characteristics allow a learner to perform well on the text
categorization task.
moulinier.ps
@misc{ moulinier96framework,
author = "I. Moulinier",
title = "A Framework for Comparing Text Categorization Approaches",
text = "Isabelle Moulinier. A Framework for Comparing Text
Categorization Approaches. In AAAI Spring Symposium on Machine
Learning and Information Access, Stanford University, March
1996.",
year = "1996",
url = "citeseer.nj.nec.com/moulinier96framework.html" }
