ning agents LEARNING FROM OBSERVATIONS CHAPTER 18,SECTIONS 1-3 Outline Learning element ◇Learning agents nt is dictat d by nat type of performance e Decision tree learning Measuring learning performance tkindofedback9bbpresC upervised le: Learning Inductive learning (a.k.a.Science) ent Simplest form:learn a function from examples (tabula rasa) f is the target function +1 Learning modifies the agent's decision mechanisms to improve perfor m五a set of examples ts to learn /-why?)
Learning from Observations Chapter 18, Sections 1–3 Chapter 18, Sections 1–3 1 Outline ♦ Learning agents ♦ Inductive learning ♦ Decision tree learning ♦ Measuring learning performance Chapter 18, Sections 1–3 2 Learning Learning is essential for unknown environments, i.e., when designer lacks omniscience Learning is useful as a system construction method, i.e., expose the agent to reality rather than trying to write it down Learning modifies the agent’s decision mechanisms to improve performance Chapter 18, Sections 1–3 3 Learning agents Performance standard Agent Environment Sensors Effectors Performance element changes knowledge learning goals Problem generator feedback Learning element Critic experiments Chapter 18, Sections 1–3 4 Learning element Design of learning element is dictated by ♦ what type of performance element is used ♦ which functional component is to be learned ♦ how that functional compoent is represented ♦ what kind of feedback is available Example scenarios: Performance element Alpha−beta search Logical agent Simple reflex agent Component Eval. fn. Transition model Transition model Representation Weighted linear function Successor−state axioms Neural net Utility−based agent Dynamic Bayes net Percept−action fn Feedback Outcome Outcome Win/loss Correct action Supervised learning: correct answers for each instance Reinforcement learning: occasional rewards Chapter 18, Sections 1–3 5 Inductive learning (a.k.a. Science) Simplest form: learn a function from examples (tabula rasa) f is the target function An example is a pair x, f(x), e.g., O O X X X , +1 Problem: find a(n) hypothesis h such that h ≈ f given a training set of examples (This is a highly simplified model of real learning: – Ignores prior knowledge – Assumes a deterministic, observable “environment” – Assumes examples are given – Assumes that the agent wants to learn f—why?) Chapter 18, Sections 1–3 6
Inductive learning method Inductive learning method E.g.,curve fitting: E.g.,curve fitting: Inductive learning method Inductive learning method maoa/aaas Contreern Inductive learning method Inductive learning method C e aeaa E.g.curve fitting E.g.curve hitting: Ockham'smxmizecmbination of simplicty
Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: x f(x) Chapter 18, Sections 1–3 7 Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: x f(x) Chapter 18, Sections 1–3 8 Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: x f(x) Chapter 18, Sections 1–3 9 Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: x f(x) Chapter 18, Sections 1–3 10 Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: x f(x) Chapter 18, Sections 1–3 11 Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: x f(x) Ockham’s razor: maximize a combination of consistency and simplicity Chapter 18, Sections 1–3 12
Attribute-based representations Hypothesis spaces How many distinct decision trees with n Boolean attributes?7 a ta u -uu uuu Tha (T)e (F) Decision trees Hypothesis spaces sible on for hypothese How many distinct decision trees with n Bodlean attributes? E."tree for deciding whethertowa -number of Boolean functions Expressiveness Hypothesis spaces A es with2ow Prefer to find more compact decision tree
Attribute-based representations Examples described by attribute values (Boolean, discrete, continuous, etc.) E.g., situations where I will/won’t wait for a table: Example Attributes Target Alt Bar Fri Hun Pat Price Rain Res Type Est WillWait X1 T F F T Some $$$ F T French 0–10 T X2 T F F T Full $ F F Thai 30–60 F X3 F T F F Some $ F F Burger 0–10 T X4 T F T T Full $ F F Thai 10–30 T X5 T F T F Full $$$ F T French >60 F X6 F T F T Some $$ T T Italian 0–10 T X7 F T F F None $ T F Burger 0–10 F X8 F F F T Some $$ T T Thai 0–10 T X9 F T T F Full $ T F Burger >60 F X10 T T T T Full $$$ F T Italian 10–30 F X11 F F F F None $ F F Thai 0–10 F X12 T T T T Full $ F F Burger 30–60 T Classification of examples is positive (T) or negative (F) Chapter 18, Sections 1–3 13 Decision trees One possible representation for hypotheses E.g., here is the “true” tree for deciding whether to wait: No Yes No Yes No Yes No Yes No Yes No Yes None Some Full >60 30−60 10−30 0−10 No Yes Alternate? Hungry? Reservation? Bar? Raining? Alternate? Patrons? Fri/Sat? F T WaitEstimate? F T T T F T T F T F T Chapter 18, Sections 1–3 14 Expressiveness Decision trees can express any function of the input attributes. E.g., for Boolean functions, truth table row → path to leaf: T F A B F T B A B A xor B F F F F T T T F T T T F F F F T T T Trivially, there is a consistent decision tree for any training set w/ one path to leaf for each example (unless f nondeterministic in x) but it probably won’t generalize to new examples Prefer to find more compact decision trees Chapter 18, Sections 1–3 15 Hypothesis spaces How many distinct decision trees with n Boolean attributes?? Chapter 18, Sections 1–3 16 Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions Chapter 18, Sections 1–3 17 Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2 n rows Chapter 18, Sections 1–3 18
Hypothesis spaces Hypothesis spaces How many distinct decision trees with n Boolean attributes? How many purely conjunctive hypotheses (e.g-.Hungry ARain)? ncre may get Hypothesis spaces Decision tree learning dea:(recursively)choose "most significant"attribute as root of (sub)tree E.gwith6 Boolean there are .44.073.709551.616 trees a de ke if a ares-DTLeramzl Hypothesis spaces Choosing an attribute themthat() E.g.,with 6 Boolean attributes,there are 18,446.744.073.709,551,616 trees 888888 888888 How many purely conjunctive hypotheses (e.g Hungry ARain)?? 0。 g日88 88 bout the classificatio
Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2 n Chapter 18, Sections 1–3 19 Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2 n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees Chapter 18, Sections 1–3 20 Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2 n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry ∧ ¬Rain)?? Chapter 18, Sections 1–3 21 Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2 n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry ∧ ¬Rain)?? Each attribute can be in (positive), in (negative), or out ⇒ 3 n distinct conjunctive hypotheses More expressive hypothesis space – increases chance that target function can be expressed – increases number of hypotheses consistent w/ training set ⇒ may get worse predictions Chapter 18, Sections 1–3 22 Decision tree learning Aim: find a small tree consistent with the training examples Idea: (recursively) choose “most significant” attribute as root of (sub)tree function DTL(examples, attributes, default) returns a decision tree if examples is empty then return default else if all examples have the same classification then return the classification else if attributes is empty then return Mode(examples) else best ←Choose-Attribute(attributes, examples) tree ←a new decision tree with root test best for each value vi of best do examplesi ← {elements of examples with best = vi} subtree ←DTL(examplesi, attributes − best,Mode(examples)) add a branch to tree with label vi and subtree subtree return tree Chapter 18, Sections 1–3 23 Choosing an attribute Idea: a good attribute splits the examples into subsets that are (ideally) “all positive” or “all negative” None Some Full Patrons? French Italian Thai Burger Type? Patrons? is a better choice—gives information about the classification Chapter 18, Sections 1–3 24
Information Performance measurement nformation answers questions How do we know thath?(Hume's Problem of Induction) answer initially.the more information i 1)Use theorems of computational/statistical learning theory 2)Tryh on a new test set of examples Scale:1 bit=answer to Boolean question with prior (0.,0.5) (use same distribution over example space as training set) Information in an answer when prior is (P....P)is Learning curve=%correct on test set as a function of training set size H(B.,P》=1-乃gB 0.9 (entropy of the prior) o8 0.7 0.6 0.5 0102030405060708090100 Training set size Information contd. Performance measurement contd. Suppose we have ppositive and rve depends o ss target function)vs.non-realizable E.g.for 12 restaurant examples,so we need 1 bit non-realizability can be due to missing attributes nt expressiveness (e.g.. rrelevant attributes) -roalizable +H(/g+)./+n》 nonrealizable p+n For Patrons?.this is 0.459 bits.for Type this is(still)1 bit choose the attribute that minimizes the remaining information needed #of examples Example contd。. Summary Decision tree leamed from the 12 examples Leaming needed for unknown environments,lazy designers Learning agent=performance element+learning element Learning method depends on type of performance element.available feedback,type of component to be improved,and its representation 奧 ftallan That Decision tree leaming using information gain Fri/Sat? No/ Learning performance=prediction accuracy measured on test se
Information Information answers questions The more clueless I am about the answer initially, the more information is contained in the answer Scale: 1 bit = answer to Boolean question with prior h0.5, 0.5i Information in an answer when prior is hP1, . . . , Pni is H(hP1, . . . , Pni) = Σ n i = 1 − Pi log2 Pi (also called entropy of the prior) Chapter 18, Sections 1–3 25 Information contd. Suppose we have p positive and n negative examples at the root ⇒ H(hp/(p+n), n/(p+n)i) bits needed to classify a new example E.g., for 12 restaurant examples, p = n = 6 so we need 1 bit An attribute splits the examples E into subsets Ei , each of which (we hope) needs less information to complete the classification Let Ei have pi positive and ni negative examples ⇒ H(hpi/(pi+ni), ni/(pi+ni)i) bits needed to classify a new example ⇒ expected number of bits per example over all branches is Σi pi + ni p + n H(hpi/(pi + ni), ni/(pi + ni)i) For Patrons?, this is 0.459 bits, for Type this is (still) 1 bit ⇒ choose the attribute that minimizes the remaining information needed Chapter 18, Sections 1–3 26 Example contd. Decision tree learned from the 12 examples: No Yes Fri/Sat? None Some Full Patrons? Yes No Hungry? Type? French Italian Thai Burger F T T F F T F T Substantially simpler than “true” tree—a more complex hypothesis isn’t justified by small amount of data Chapter 18, Sections 1–3 27 Performance measurement How do we know that h ≈ f? (Hume’s Problem of Induction) 1) Use theorems of computational/statistical learning theory 2) Try h on a new test set of examples (use same distribution over example space as training set) Learning curve = % correct on test set as a function of training set size 0.4 0.5 0.6 0.7 0.8 0.9 1 0 10 20 30 40 50 60 70 80 90 100 % correct on test set Training set size Chapter 18, Sections 1–3 28 Performance measurement contd. Learning curve depends on – realizable (can express target function) vs. non-realizable non-realizability can be due to missing attributes or restricted hypothesis class (e.g., thresholded linear function) – redundant expressiveness (e.g., loads of irrelevant attributes) % correct # of examples 1 nonrealizable redundant realizable Chapter 18, Sections 1–3 29 Summary Learning needed for unknown environments, lazy designers Learning agent = performance element + learning element Learning method depends on type of performance element, available feedback, type of component to be improved, and its representation For supervised learning, the aim is to find a simple hypothesis that is approximately consistent with training examples Decision tree learning using information gain Learning performance = prediction accuracy measured on test set Chapter 18, Sections 1–3 30