Excerpt_Artificial Intelligence: A Modern Approach 3rd Edition
Different people approach Al with different goals in mind, Two important questions to ask are: Are you concerned with thinking or behavior? Do you want to model humans or work from an ideal standard?
In this book, we adopt the view that intelligence is concerned mainly with rational action. Ideally, an intelligent agent takes the best possible action in a situation. We study the problem of building agents that are intelligent in this sense.
Philosophers (going back to 400 B.C.) made AI conceivable by considering the ideas that the mind is in some ways like a machine, that it operates on knowledge encoded in some internal language, and that thought can be used to choose what actions to take.
Mathematicians provided the tools to manipulate statements of logical certainty as well as uncertain, probabilistic statements. They also set the groundwork for understanding computation and reasoning about algorithms.
Economists formalized the problem of making decisions that maximize the expected outcome to the decision maker.
Neuroscientists discovered some facts about how the brain works and the ways in which it is similar to and different from computers.
Psychologists adopted the idea that humans and animals can be considered information- processing machines. Linguists showed that language use fits into this model.
Computer engineers provided the ever-more-powerful machines that make AI applications possible.
Control theory deals with designing devices that act optimally on the basis of feedback from the environment. Initially, the mathematical tools of control theory were quite different from AI, but the fields are coming closer together.
The history of Al has had cycles of success, misplaced optimism. and resulting cutbacks in enthusiasm and funding. There have also been cycles of introducing new creative approaches and systematically refining the best ones.
AI has advanced more rapidly in the past decade because of greater use of the scientific method in experimenting with and comparing approaches.
Recent progress in understanding the theoretical basis for intelligence has gone hand in hand with improvements in the capabilities of real systems. The subfields of AI have become more integrated, and AI has found common ground with other disciplines.
2. Intelligent Agents
An agent is something that perceives and acts in an environment. The agent ftmction for an avail. specifics the action taken by the agent in response to any percept sequence.
The performance measure evaluates the behavior of the agent in an environment A rational agent acts so as to maximize the expected value of the performance measure, given the percept sequence it has seen so far.
A task environment specification includes the performance measure, the external environment, the actuators. and the sensors. In designing an agent, the first step must always be to specify the task environment as fully as possible.
Task environments vary along several significant dimensions. They can be fully or partially observable, single-agent or multiagent, deterministic or stochastic, episodic or sequential, static or dynamic, discrete or continuous, and known or unknown.
The agent program implements the agent function. There exists a variety of basic agent-program designs reflecting the kind of information made explicit and used in the decision process. The designs vary in efficiency, compactness, and flexibility. The appropriate design of the agent program depends on the nature of the environment.
Simple reflex agents respond directly to percepts, whereas model-based reflex agents maintain internal state to track aspects of the world that are not evident in the current percept. Goal-based agents act to achieve their goals, and utility-based agents try to maximize their own expected “happiness.”
All agents can improve their performance through learning.
3. Solving Problems by Searching
Before an agent can start searching for solutions, a goal must be identified and a well-defined problem must be formulated.
A problem consists of five parts: the initial state, a set of actions, a transition model describing the results of those actions, a goal test function, and a path cost function. The environment of the problem is represented by a state space. A path through the state space from the initial state to a goal state is asolution.
Search algorithms treat states and actions as atomic: they do not consider any internal structure they might possess.
A general TREE-SEARCH algorithm considers all possible paths to find a solution, whereas a GRAPH-SEARCH algorithm avoids consideration of redundant paths.
Search algorithms are judged on the basis of completeness, optimality, time complexity, and space complexity. Complexity depends on h, the branching factor in the state space, and d, the depth of the shallowest solution.
Uninformed search methods have access only to the problem definition. The basic algorithms are as follows:
*Breadth-first search expands the shallowest nodes first; it is complete, optimal for unit step costs. but has exponential space complexity.
*Uniform-cost search expands the node with lowest path cast, g(n), and is optimal for general step costs.
*Depth-first search expands the deepest unexpanded node first. It is neither complete nor optimal, but has linear space complexity. Depth limited search adds a depth bound.
*Iterative deepening search calls depth-first search with increasing depth limits until a goal is found. It is complete, optimal for unit step costs, has time complexity comparable to breadth-first search, and has linear space complexity.
** Bidirectional search can enormously reduce time complexity, but it is not always applicable and may require too much space.
Informed search methods may have access to a heuristic function h(n) that estimates the cost of a solution from n. * The generic best-first search algorithm selects a node for expansion according to an evaluation function.
*Greedy best-first search expands nudes with minimal h(n). It is not optimal but is often efficient.
* **A\ search expands nodes with minimal f(n) = g(n) + h(n). A is complete and optimal, provided that h(n) is admissible (for TREE-SEARCH) or consistent (for GRAPH-SEARCH). The space complexity of A is still prohibitive.
*RBFS (recursive best-first search) and SMA\ (simplified memory-bounded A*) are robust, optimal search algorithms that use limited amounts of memory; given enough time, they can solve problems that A* cannot solve because it runs out of memory.
The performance of heuristic search algorithms depends on the quality of the heuristic function. One can sometimes construct good heuristics by relaxing the problem definition, by storing precomputed solution costs for subproblems in a pattern database, or by learning from experience with the problem class.
4. Beyond Classical Search
Local search methods such as hill climbing operate on complete-state formulations, keeping only a small number of nodes in memory. Several stochastic algorithms have been developed, including simulated annealing, which returns optimal solutions when given an appropriate cooling schedule.
Many local search methods apply also to problems in continuous spaces. Linear programming and convex optimization problem; obey certain restrictions on the shape of the state space and the nature of the objective function, and admit polynomial-time algorithms that are often extremely efficient in practice.
A genetic algorithm is a stochastic hill-climbing search in which a large population of states is maintained. New states are generated by mutation and by crossover, which combines pairs of states from the population.
In nondeterministic environments, agents can apply AND—OR search to generate Contingent plans that reach the goal regardless of which outcomcs occur during execution.
When the environment is partially observable, the belief state represents the set of possible states that the agent might be in.
Standard search algorithms can be applied directly to belief-state space to solve sensorless problems, and belief-state AND—OR search can solve general partially observable problems. Incremental algorithms that construct solutions state by state within a belief state are often more efficient.
Exploration problems arise when the agent has no idea about the states and actions of its environment. For safely explorable environments, online search agents can build a map and find a goal if one exists. Updating heuristic estimates from experience provides an effective method to escape from local minima.
5. Adversarial Search
A game can be defined by the initial state (how the board is set up), the legal actions in each state, the result of each action, a terminal test (which says when the game is over), and a utility function that applies to terminal states.
In two-player zero-sum games with perfect information, the minimax algorithm can select optimal moves by a depth-first enumeration of the game tree.
The alpha—beta search algorithm computes the same optimal move as minimax, but achieves much greater efficiency by eliminating subtrees that are provably irrelevant.
Usually, it is not feasible to consider the whole game tree (even with alpha—beta), so we need to cut the search off at some point and apply a heuristic evaluation function that estimates the utility of a state.
Many game programs precompute tables of best moves in the opening and endgame so that they can look up a move rather than search.
Games of chance can be handled by an extension to the minimax algorithm that evaluates a chance node by taking the average utility of all its children, weighted by the probability of each child.
Optimal play in games of imperfect information, such as Kriegspiel and bridge, requires reasoning about the current and future belief states of each player. A simple approximation can be obtained by averaging the value of an action over each possible configuration of missing information.
Programs have bested even champion human players at games such as chess, checkers, and Othello. Humans retain the edge in several games of imperfect information, such as poker, bridge, and Kriegspiel, and in games with very large branching factors and little good heuristic knowledge, such as Go.
6. Constraint Satisfaction Probllems
Constraint satisfaction problems (CSPs) represent a state with a set of variable/value pairs and represent the conditions for a solution by a set of constraints on the variables. Many important real-world problems can be described as CSPs.
A number of inference techniques use the constraints to infer winch variable/value pairs are consistent and which are not. These include node, arc, path, and k-consistency.
Backtracking search, a form of depth-first search, is commonly used for solving CSPs. Inference can be interwoven with search.
The minimtun-remaining-values and degree heuristics are domain-independent methods for deciding which variable to choose next in a backtracking search. The least-constraining-value heuristic helps in deciding which value to try first for a given variable. Backtracking occurs when no legal assignment can be found for a variable. Conflict-directed backjumping backtracks directly to the source of the problem.
Local search using the min-conflicts heuristic has also been applied to constraint satisfaction problems with great success.
The complexity of solving a CSP is strongly related to the structure of its constraint graph. Tree-structured problems can be solved in linear time. Cutset conditioning can reduce a general CSP to a tree-structured one and is quite efficient if a small cutset can be found. Tree decomposition techniques transform the CSP into a tree of subproblems and are efficient if the tree width of the constraint graph is small.
7. Logical Agents
Intelligent agents need knowledge about the world in order to reach good decisions.
Knowledge is contained in agents in the form of sentences in a knowledge representation language that are stored in a knowledge base.
A knowledge-based agent is composed of a knowledge base and an inference mechanism. It operates by storing sentences about the world in its knowledge base, using the inference mechanism to infer new sentences, and using these sentences to decide what action to take.
A representation language is defined by its syntax, which specifies the structure of sentences, and its semantics, which defines the truth of each sentence in each possible world or model.
The relationship of entailment between sentences is crucial to our understanding of reasoning. A sentence α entails another sentence β if it is true in all worlds where is is true. Equivalent definitions include the validity of the sentence α = β and the unsatisfiability of the sentence it α ¬ β.
Inference is the process of deriving new sentences from old ones. Sound inference algorithms derive only sentences that are entailed; complete algorithms derive all sentences that are entailed.
Propositional logic is a simple language consisting of proposition symbols and logical connectives. It can handle propositions that are known true, known false, or completely unknown.
The set of possible models, given a fixed propositional vocabulary, is finite, so entailment can be checked by enumerating models. Efficient model-checking inference algorithms for propositional logic include backtracking and local search methods and can often solve large problems quickly.
Inference rules are patterns of sound inference that can be used to find proofs. The resolution rule yields a complete inference algorithm for knowledge bases that arc expressed in conjunctive normal form. Forward chaining and backward chaining are very natural reasoning algorithms for knowledge bases in Horn form.
Local search methods such as WALKSAT can be used to find solutions. Such algo- rithms are sound but not complete.
Logical state estimation involves maintaining a logical sentence that describes the set of possible states consistent with the observation history. Each update step requires inference using the transition model of the environment, which is built from successor-state axioms that specify how each fluent changes.
Decisions within a logical agent can be made by SAT solving: finding possible models specifying future action sequences that reach the goal. This approach works only for fully observable or sensorless environments.
Propositional logic does not scale to environments of unbounded size because it lacks the expressive power to deal concisely with time, space, and universal patterns of relationships among objects.
8. First-Order Logic
Knowledge representation languages should be declarative, compositional, expressive, context independent, and unambiguous.
Logics differ in their ontological commitments and epistemological commitments. While propositional logic commits only to the existence of facts, firsi-order logic commits to the existence of objects and relations and thereby gains expressive power.
The syntax of first-order logic builds on that of propositional logic. It adds terms to represent objects, and has universal and existential quantifiers to construct assertions about all or some of the possible values of the quantified variables.
A possible world, or model, for first-order logic includes a set of objects and an interpretation that maps constant symbols to objects, predicate symbols to relations among objects, and function symbols to functions on objects.
An atomic sentence is true just when the relation named by the predicate holds between the objects named by the terms. Extended interpretations, which map quantifier variables to objects in the model, define the truth of quantified sentences.
Developing a knowledge base in first-order logic requires a careful process of analyzing the domain, choosing a vocabulary, and encoding the axioms required to support the desired inferences.
9. Inference in First-Order Logic
A first approach uses inference rules (universal instantiation and existential instan- tiation) to propositionalize the inference problem. Typically, this approach is slow, unless the domain is small.
The use of unification to identify appropriate substitutions for variables eliminates the instantiation step in first-order proofs, making the process more efficient in many cases.
A lifted version of Modus Ponens uses unification to provide a natural and powerful inference nile, generalized Modus Ponens. The forward-chaining and backward-chaining algorithms apply this rule to sets of definite clauses.
Generalized Modus Ponens is complete for definite clauses, although the entailment problem is semidecidable. For Datalog knowledge bases consisting of function-free definite clauses, entailment is decidable.
Forward chaining is used in deductive databases, where it can be combined with relational database operations. It is also used in production systems, which perform efficient updates with very large rule sets Forward chaining is complete for Datalog and runs in polynomial time.
Backward chaining is used in logic programming systems, which employ sophisticated compiler technology to provide very fast inference. Backward chaining suffers from redundant inferences and infinite loops; these can be alleviated by memoization.
Prolog, unlike first-order logic, uses a closed world with the unique names assumption and negation as failure. These make Prolog a more practical programming language, but bring it further from pure logic.
The generalized resolution inference rule provides a complete proof system for first-order logic, using knowledge bases in conjunctive normal form.
Several strategies exist for reducing the search space of a resolution system without compromising completeness. One of the most important issues is dealing with equality; we showed how demodulation and paramodulation can be used.
Efficient resolution-based theorem provers have been used to prove interesting mathematical theorems and to verify and synthesize software and hardware.
10. Classical Planning
Planning systems are problem-solving algorithms that operate on explicit propositional or relational representations of states and actions. These representations make possible the derivation of effective heuristics and the development of powerful and flexible algorithms for solving problems.
PDDL, the Planning Domain Definition Language, describes the initial and goal states as conjunctions of literals, and actions in terms of their preconditions and effects.
State-space search can operate in the forward direction (progression) or the backward direction (regression), Effective heuristics can be derived by subgoal independence assumptions and by various relaxations of the planning problem.
A planning graph can be constructed incrementally, starting from the initial state, Each layer contains a superset of all the literals or actions that could occur at that time step and encodes mutual exclusion (mutex) relations among literals or actions that cannot cooccur Planning graphs yield useful heuristics for state-space and partial-order planners and can be used directly in the GRAPHPLAN algorithm.
Other approaches include first-order deduction over situation calculus axioms; encoding a planning problem as a Boolean satisfiability problem or as a constraint satisfaction problem; and explicitly searching through the space of partially ordered plans.
Each of the major approaches to planning has its adherents, and there is as yet no con- sensus on which is best. Competition and cross-fertilization among the approaches have resulted in significant gains in efficiency for planning systems.
11. Planning and Acting in the Real World
Many actions consume resources, such as money, gas, or raw materials. It is convenient to treat these resources as numeric measures in a pool rather than try to reason about. say, each individual coin and bill in the world. Actions can generate and consume resources, and it is usually cheap and effective to check partial plans for satisfaction of resource constraints before attempting further refinements.
Time is one of the most important resources. It can be handled by specialized scheduling algorithms, or scheduling can be integrated with planning.
Hierarchical task network (HTN) planning allows the agent to take advice from the domain designer in the form of high-level actions (HLAs) that can be implemented in various ways by lower-level action sequences. The effects of HLAs can be defined with angelic semantics, allowing provably correct high-level plans to be derived without consideration of lower-level implementations. HTN methods can create the very large plans required by many real-world applications.
Standard planning algorithms assume complete and correct information and deterministic, fully observable environments. Many domains violate this assumption.
Contingent plans allow the agent to sense the world during execution to dccidc what branch of the plan to follow, hi some cases, sensorless or conformant planning can be used to construct a plan that works without the need for perception. Both conformant and contingent plans can be constructed by search in the space of belief states. Efficient representation or computation of belief states is a key problem.
An online planning agent uses execution monitoring and splices in repairs as needed to recover from unexpected situations, which can be due to nondeterministic actions, exogenous events, or incorrect models of the environment.
Multiagent planning is necessary when there are other agents in the environment with which to cooperate or compete. Joint plans can be constructed, but must be augmented with some form of coordination if two agents are to agree on which joint plan to execute.
This chapter extends classic planning to cover nondeterministic environments (where outcomes of actions are uncertain), but it is not the last word on planning. Chapter 17 describes techniques for stochastic environments (in which outcomes of actions have probabilities associated with them): Markov decision processes, partially observable Markov decision processes, and game theory. In Chapter 21 we show that reinforcement learning allows an agent to learn how to behave from past successes and failures.
12. Knowledge Representation
Large-scale knowledge representation requites a general-purpose ontology to organize and tie together the various specific domains of knowledge.
A general-purpose ontology needs to cover a wide variety of knowledge and should be capable, in principle, of handling any domain.
Building a large, general-purpose ontology is a significant challenge that has yet to be fully realized, although current frameworks seem to be quite robust.
We presented an upper ontology based on categories and the event calculus. We covered categories, subcategories, parts, structured objects, measurements, substances, events, time and space, change, and beliefs.
Natural kinds cannot be defined completely in logic, but properties of natural kinds can be represented.
Actions, events, and time can be represented either in situation calculus or in more expressive representations such as event calculus. Such representations enable an agent to construct plans by logical inference.
We presented a detailed analysis of the Internet shopping domain, exercising the general ontology and showing how the domain knowledge can be used by a shopping agent.
Special-purpose representation systems, such as semantic networks and description logics, have been devised to help in organizing a hierarchy of categories. Inheritance is an important form of inference, allowing the properties of objects to be deduced from their membership in categories.
The closed-world assumption, as implemented in logic programs, provides a simple way to avoid having to specify lots of negative information. It is best interpreted as a default that can be overridden by additional information.
Nonmonotonic logics, such as circumscription and default logic, are intended to capture default reasoning in general.
Truth maintenance systems handle knowledge updates and revisions efficiently.
13. Quantifying Uncertainty
Uncertainty arises because of both laziness and ignorance. It is inescapable in complex, nondeterministic, or partially observable environments.
Probabilities express the agent’s inability to reach a definite decision regarding the truth of a sentence. Probabilities summarize the agent’s beliefs relative to the evidence.
Decision theory combines the agent’s beliefs and desires, defining the best action as the one that maximizes expected utility.
Basic probability statements include prior probabilities and conditional probabilities over simple and complex propositions.
The axioms of probability constrain the possible assignments of probabilities to propositions. An agent that violates the axioms must behave irrationally in some cases.
The full joint probability distribution specifies the probability of each complete assignment of values to random variables. It is usually too large to create or use in its explicit form, but when it is available it can be used to answer queries simply by adding up entries for the possible worlds corresponding to the query propositions.
Absolute independence between subsets of random variables allows the full joint distribution to be factored into smaller joint distributions, greatly reducing its complexity. Absolute independence seldom occurs in practice.
Bayes’ rule allows unknown probabilities to be computed from known conditional probabilities, usually in the causal direction. Applying Bayes’ rule with many pieces of evidence runs into the same scaling problems as does the full joint distribution.
Conditional independence brought about by direct causal relationships in the domain might allow the full joint distribution to be factored into smaller, conditional distributions. The naive Bayes model assumes the conditional independence of all effect variables, given a single cause variable, and grows linearly with the number of effects.
A wumpus-world agent can calculate probabilities for unobserved aspects of the world, thereby improving on the decisions of a purely logical agent. Conditional independence makes these calculations tractable.
14. Probabilistic Reasoning
A Bayesian network is a directed acyclic graph whose nodes correspond to random variables; each node has a conditional distribution for the node, given its parents.
Bayesian networks provide a concise way to represent conditional independence relationships in the domain.
A Bayesian network specifies a full joint distribution; each joint entry is defined as the product of the corresponding entries in the local conditional distributions. A Bayesian network is often exponentially smaller than an explicitly enumerated joint distribution.
Many conditional distributions can be represented compactly by canonical families of distributions. Hybrid Bayesian networks, which include both discrete and continuous variables, use a variety of canonical distributions.
Inference in Bayesian networks means computing the probability distribution of a set of query variables, given a set of evidence variables. Exact inference algorithms, such as variable elimination, evaluate sums of products of conditional probabilities as efficiently as possible.
In polytrees (singly connected networks), exact inference takes time linear in the size of the network. In the general case, the problem is intractable.
Stochastic approximation techniques such as likelihood weighting* and Markov chain Monte Carlo** can give reasonable estimates of the true posterior probabilities in a network and can cope with much larger networks than can exact algorithms.
Probability theory can be combined with representational ideas from first-order logic to produce very powerful systems for reasoning under uncertainty, Relational probability models (RPMs) include representational restrictions that guarantee a well-defined probability distribution that can be expressed as an equivalent Bayesian network. Open universe probability models handle existence and identity uncertainty, defining probabilty distributions over the infinite space of first-order possible worlds.
Various alternative systems for reasoning under uncertainty have been suggested. Generally speaking, truth-functional systems are not well suited for such reasoning.
15. Probabilistic Reasoning over Time
The changing state of the world is handled by using a set of random variables to represent the state at each point in time.
Representations can be designed to satisfy the Markov property, so that the future is independent of the past given the present. Combined with the assumption that the process is stationary -— that is, the dynamics do not change over time – this greatly simplifies the representation.
A temporal probability model can he thought of as containing a transition model describing the state evolution and a sensor model describing the observation process.
The principal inference tasks in temporal models are filtering, prediction, smoothing, and computing the most likely explanation. Each of these can be achieved using simple, recursive algorithms whose rim time is linear in the length of the sequence.
Three families of temporal models were studied in more depth: hidden Markov models, Kalman filters, and dynamic Bayesian networks (which include the other two as special cases).
Unless special assumptions are made, as in Kalman filters, exact inference with many stare variables is intractahle. In practice, the particle filtering algorithm seems to he an effective approximation algorithm.
When trying to keep track of many objects, uncertainty arises as to which observations belong to which objects – the data association problem. The number of association hypotheses is typically intractably large, but MCMC and particle filtering algorithms for data association work well in practice.
16. Making Simple Decisions
Probability theory describes what an agent should believe on the basis of evidence, utility theory describes what an agent wants, and decision theory puts the two together to describe what an agent should do.
We can use decision theory to build a system that makes decisions by considering all possible actions and choosing the one that leads to the best expected outcome. Such a system is known as a rational agent.
Utility theory shows that an agent whose preferences between lotteries are consistent with a set of simple axioms can be described as possessing a utility function; further-more, the agent selects actions as if maximizing its expected utility.
Multiattribute utility theory deals with utilities that depend on several distinct attributes of states. Stochastic dominance is a particularly useful technique for making unambiguous decisions, even without precise utility values for attributes.
Decision networks provide a simple formalism for expressing and solving decision problems. They are a natural extension of Bayesian networks, containing decision and utility nodes in addition to chance nodes.
Sometimes, salving a problem involves finding more information before making a decision. The value of information is defined as the expected improvement in utility compared with making a decision without the information.
Expert systems that incorporate utility information have additional capabilities compared with pure inference systems. In addition to being able to make decisions, they can use the value of information to decide which questions to ask, if any; they can recommend contingency plans; and they can calculate the sensitivity of their decisions to small changes in probability and utility assessments.
17. Making Complex Decisions
Sequential decision problems in uncertain environments, also called Markov decision processes, or MDPs, are defined by a transition model specifying the probabilistic outcomes of actions and a reward function specifying the reward in each state.
The utility of a state sequence is the sum of all the rewards over the sequence, possibly discounted over time. The solution of an MDP is a policy that associates a decision with every stale that the agent might reach. An optimal policy maximizes the utility of the state sequences encountered when it is executed.
The utility of a state is the expected utility of the state sequences encountered when an optimal policy is executed, starting in that state. The value iteration algorithm for solving MDPs works by iteratively solving the equations relating the utility of each state to those of its neighbors.
Policy iteration alternates between calculating the utilities of states under the current policy and improving the current policy with respect to the current utilities.
Partially observable MDPs, or POMDPs, are much more difficult to solve than are MDPs. They can be solved by conversion to an MOP in the continuous space of belief states; both value iteration and policy iteration algorithms have been devised. Optimal behavior in POMDPs includes information gathering to reduce uncertainty and therefore make better decisions in the future.
A decision-theoretic agent can be constructed for POMDP environments. The agent uses a dynamic decision network to represent the transition and sensor models, to update its belief state, and to project forward possible action sequences.
Game theory describes rational behavior for agents in situations in which multiple agents interact simultaneously. Solutions of games are Nash equilibria – strategy profiles in which no agent has an incentive to deviate from the specified strategy.
Mechanism design can be used to set the rules by which agents will interact, in order to maximize some global utility through the operation of individually rational agents. Sometimes, mechanisms exist that achieve this goal without requiring each agent to consider the choices made by other agents.
18. Learning From Examples
Learning takes many forms, depending on the nature of the agent, the component to be improved, and the available feedback.
If the available feedback provides the correct answer for example inputs, then the learning problem is called supervised learning. The task is to learn a function y = h(x). Learning a discrete-valued function is called classification; learning a continuous function is called regression.
Inductive learning involves finding a hypothesis that agrees well with the examples. Ockham’s razor suggests choosing the simplest consistent hypothesis. The difficulty of this task depends on the chosen representation.
Decision trees can represent all Boolean fractions. The information-gain heuristic provides an efficient method for finding a simple, consistent decision tree.
The performance of at learning algorithm is measured by the learning curve, which shows the prediction accuracy on the test set as a function of the training-set size.
When there are multiple models to choose from, cross-validation can be used to select a model that will generalize well.
Sometimes not all errors are equal. A loss function tells us how bad each error is; the goal is then to minimize loss over a validation set.
Computational learning theory analyzes the sample complexity and computational complexity of inductive learning. There is a tradeoff between the expressiveness of the hypothesis language and the ease of learning.
Linear regression is a widely used model. The optimal parameters of a linear regression model can he found by gradient descent search, or computed exactly.
A linear classifier with a hard threshold – also known as a perceptron – can be trained by a simple weight update rule to fit data that are linearly separable. In other cases, the rule fails to converge.
19. Knowledge in Learning
The use of prior knowledge in learning leads to a picture of cumulative learning, in which learning agents improve their learning ability as they acquire more knowledge.
Prior knowledge helps learning by eliminating otherwise consistent hypotheses and by “filling in” the explanation of examples, thereby allowing for shorter hypotheses. These contributions often result in faster teaming from fewer examples.
Understanding the different logical roles played by prior knowledge, as expressed by entailment constraints, helps to define a variety of learning techniques.
Explanation-based learning (EBL) extracts general rules from single examples by explaining the examples and generalizing the explanation. It provides a deductive method for turning first-principles knowledge into useful, efficient, special purpose expertise.
Relevance-based learning (RBL) uses prior knowledge in the form of determinations to identify the relevant attributes, thereby generating a reduced hypothesis space and speeding up learning. RBL also allows deductive generalizations from single examples.
Knowledge-based inductive learning (KBIL) finds inductive hypotheses that explain sets of observations with the help of background knowledge.
Inductive logic programming (ILP) techniques perform KBIL on knowledge that is expressed in first-order logic. ILP methods can learn relational knowledge that is not expressible in attribute-based systems,
1LP can be done with a top-down approach of refining a very general rule or through a bottom-up approach of inverting the deductive process.
1LP methods naturally generate new predicates with which concise new theories can be expressed and show promise as general-purpose scientific theory formation systems.
20. Learning Probabilistic Models
Bayesian learning methods formulate learning as a form of probabilistic inference, using the observations to update a prior distribution over hypotheses. This approach provides a good way to implement Ockham’s razor, but quickly becomes intractable for complex hypothesis spaces.
Maximum a posteriori (MAP) learning selects a single most likely hypothesis given the data. The hypothesis prior is still used and the method is often more tractable than full Bayesian learning.
Maximum-likelihood learning simply selects the hypothesis that maximizes the likelihood of the data; it is equivalent to MAP learning with a uniform prior. In simple cases such as linear regression and fully observable Bayesian networks, maximum-likelihood solutions can be found easily in closed form. Naive Bayes learning is a particularly effective technique that scales well.
When some variables are hidden, local maximum likelihood solutions can be found using the EM algorithm. Applications include clustering using mixtures of Gaussians, learning Bayesian networks, and learning hidden Markov models.
Learning the structure of Bayesian networks is an example of model selection. This usually involves a discrete search in the space of structures. Some method is required for trading off model complexity against degree of fit.
Nonparametric models represent a distribution using the collection of data points. Thus, the number of parameters grows with the training set. Nearest-neighbors methods look at the examples nearest to the point in question, whereas kernel methods form a distance-weighted combination of all the examples.
21. Reinforcement Learning
The overall agent design dictates the kind of information that must be learned. The three main designs we covered were the model-based design, using a model P and a utility function U; the model-free design, using an action-utility function Q; and the reflex design, using a policy r.
Utilities can be learned using three approaches: *Direct utility estimation uses the total observed reward-to-go for a given state as direct evidence for learning its utility.
*Adaptive dynamic programming (ADP) learns a model and a reward function from observations and then uses value or policy iteration to obtain the utilities or an optimal policy. ADP makes optimal use of the local constraints on utilities of states imposed through the neighborhood structure of the environment.
\ Temporal-difference (TD) methods update utility estimates to match those of successor states. They can be viewed as simple approximations to the ADP approach that can learn without requiring a transition model. Using a learned model to generate pseudoexperiences can, however, result in faster learning.
Action-utility functions, or Q-functions, can be learned by an ADP approach or a TD approach. With TD, Q-learning requires no model in either the learning or action-selection phase. This simplifies the learning problem but potentially restricts the ability to learn in complex environments, because the agent cannot simulate the results of possible courses of action.
When the learning agent is responsible for selecting actions while it learns, it must trade off the estimated value of those actions against the potential for learning useful new information. An exact solution of the exploration problem is infeasible, but some simple heuristics do a reasonable job.
In large state spaces, reinforcement learning algorithms must use an approximate functional representation in order to generalize over states. The temporal-difference signal can be used directly to update parameters in representations such as neural networks.
Policy-search methods operate directly on a representation of the policy, attempting to improve it based on observed performance. The variation in the performance in a stochastic domain is a serious problem; for simulated domains this can be overcome by fixing the randomness in advance.
22. Natural Language Processing
Probabilistic language models based on n-grams recover a surprising amount of information about as language. They can perform well on such diverse tasks as language identification, spelling correction, genre classification, and named-entity recognition.
These language models can have millions of features, so feature selection and preprocessing of the data to reduce noise is important.
Text classification can be done with naive Bayes n-gram models or with any of the classification algorithms we have previously discussed. Classification can also be seen as a problem in data compression.
Information retrieval systems use a very simple language model based on bags of words, yct still manage to perform well in tcrms of recall and precision on very large corpora of text. On Web corpora, link-analysis algorithms improve performance.
Question answering can be handled by an approach based on information retrieval, for questions that have multiple answers in the corpus. When more answers are available in the corpus, we can use techniques that emphasize precision rather than recall.
Information-extraction systems use a more complex model that includes limited notions of syntax and semantics in the form of templates. They can be built from finite-state automata, HMMs, or conditional random fields, and can be learned from examples.
In building a statistical language system, it is best to devise a model that can make good use of available data, even if the model seems overly simplistic.
23. Natural Language for Communication
Formal language theory and phrase structure grammars (and in particular, context. free grammar) are useful tools for dealing with some aspects of natural language. The probabilistic context-free grammar (PCFG) formalism is widely used.
Sentences in a context-free language can be parsed in O(n3) time by a chart parser such as the CYK algorithm, which requires grammar rules to be in Chomsky Normal Form.
A treebank can be used to learn a grammar. It is also possible to learn a grammar from an unparsed corpus of sentences, but this is less successful.
A lexicalized PCFG allows us to represent that some relationships between words are mare common than others.
It is convenient to augment a grammar to handle such problems as subject–verb agreement and pronoun case. Definite clause grammar (DCG) is a formalism that allows for augmentations. With DCG, parsing and semantic interpretation (and even generation) can be done using logical inference.
Semantic interpretation can also be handled by an augmented grammar.
Ambiguity is a very important problem in natural language understanding; most sentences have many possible interpretations, but usually only one is appropriate. Disam-biguation relies on knowledge about the world, about the current situation, and about language use.
Machine translation systems have been implemented using a range of techniques, from full syntactic and semantic analysis to statistical techniques based on phrase frequencies. Currently the statistical models are most popular and most successful.
Speech recognition systems are also primarily based on statistical principles. Speech systems are popular and useful, albeit imperfect. Together, machine translation and speech recognition are two of the big successes of natural language technology. One reason that the models perform well is that large corpora are available—both translation and speech are tasks that are performed in the wild” by people every day. In contrast, tasks like parsing sentences have been less successful, in part because no large corpora of parsed sentences are available in the wild” and in part because parsing is not useful in and of itself.
The process of image formation is well understood in its geometric and physical aspects. Given a description of a three-dimensional scene, we can easily produce a picture of it from some arbitrary camera position (the graphics problem). Inverting the process by going from an image to a description of the scene is more difficult.
To extract the visual information necessary for the tasks of manipulation ; navigation, and recognition, intermediate representations have to be constructed. Early vision image-processing algorithms extract primitive features from the image, such as edges and regions.
There are various cues in the image that enable one to obtain three-dimensional in- formation about the scene: motion, stereopsis, texture, shading, and contour analysis. Each of these cues relies en background assumptions about physical scenes to provide nearly unambiguous interpretations.
Object recognition in its full generality is a very hard problem. We discussed brightness-based and feature-based approaches. We also presented a simple algorithm for pose estimation. Other possibilities exist.
Robots are equipped with sensors fur perceiving their environment and effectors with which they can assert physical forces on their environment. Most robots are either manipulators anchored at fixed locations or mobile robots that can move.
Robotic perception concerns itself with estimating decision-relevant quantities from sensor data. To do so, we need an internal representation and a method for updating this intemal representation over time. Common examples of hard perceptual problems include localization, mapping, and object recognition.
Probabilistic filtering algorithms such as Kalman filters and particle filters arc useful for robot perception. These techniques maintain the belief state, a posterior distribution over state variables.
The planning of robot motion is usually done in configuration space, where each point specifies the location and orientation of the robot and its joint angles.
Configuration space search algorithms include cell decomposition techniques, which decompose the space of all configurations into finitely many cells, and skeletonization techniques, which project configuration spaces into lower-dimensional manifolds. The motion planning problem is then solved using search in these simpler structures.
A path found by a search algorithm can be executed by using the path as the reference trajectory for a PID controller. Controllers are necessary in robotics to accommodate small perturbations; path planning alone is usually insufficient.
Potential field techniques navigate robots by potential functions, defined over the distance to obstacles and the goal location. Potential field techniques may get stuck in local minima but they can generate motion directly without the need for path planning.
Sometimes it is easier to specify a robot controller directly, rather than deriving a path from an explicit model of the environment. Such controllers can often be written as simple finite state machines.
26. Philosophical Foundations
Philosophers use the term weak AI for the hypothesis that machines could possibly behave intelligently. and strong AI for the hypothesis that such machines would count as having actual minds (as opposed to simulated minds).
Alan Turing rejected the question “Can machines think” and replaced it with a behavioral test. He anticipated many objections to the possibility of thinking machines. Few AI researchers pay attention to the Turing Test, preferring to concentrate on their systems’ performance on practical tasks, rather than the ability to imitate humans.
There is general agreement in modem times that mental states are brain states.
Arguments for and against strong AI are inconclusive. Few mainstream AI researchers believe that anything significant hinges on the outcome of the debate.
Consciousness remains a mystery.
We identified six potential threats to society posed by AI and related technology. We concluded that some of the threats are either unlikely or differ little from threats posed by “unintelligent” technologies. One threat in particular is worthy of further consideration: that ultraintelligent machines might lead to a future that is very different from today – we may not like it, and at that paint we may not have a choice. Such considerations lead inevitably to the conclusion that we must weigh carefully, and soon, the possible consequences of AI research.