ID3 & Entropy
Thứ Năm, 18 tháng 2, 2010
, Posted by Thiên Thần CNTT at 23:23
ID3
An early technique by the influential Ross Quinlan that influenced a large part of the research on Decision Trees is useful to look at in order to understand basic decision tree construction.Splitting Criteria
A fundamental part of any algorithm that constructs a decision tree from a dataset is the method in which it selects attributes at each node of the tree. In the previous exercise, you may have found it was better to place certain attributes (such as 'District') higher up the tree in order to produce a short tree. Q. Why is this?A. Some attributes split the data up more purely than others. That means that their values correspond more consistently with instances that have particular values of the target attribute (the one we want to predict) than those of another attribute. Therefore, we might say that such attributes have some underlying relationship with the target attribute . But how can this be quantified in some way? Essentially, we would like some sort of measure that enables us to compare attributes with each other and then be able to decide to put ones that split the data more purely higher up the tree.
Entropy
A measure used from Information Theory in the ID3 algorithm and many others used in decision tree construction is that of Entropy. Informally, the entropy of a dataset can be considered to be how disordered it is. It has been shown that entropy is related to information, in the sense that the higher the entropy, or uncertainty, of some data, then the more information is required in order to completely describe that data. In building a decision tree, we aim to decrease the entropy of the dataset until we reach leaf nodes at which point the subset that we are left with is pure, or has zero entropy and represents instances all of one class (all instances have the same value for the target attribute).We measure the entropy of a dataset,S, with respect to one attribute, in this case the target attribute, with the following calculation:

where Pi is the proportion of instances in the dataset that take the ith value of the target attribute
Using the example of the marketing data, we know that there are two classes in the data and so we use the fractions that each class represents in an entropy calculation:
Entropy S = -([9/14 responses, 5/14 no responses])
= -9/14 log2 9/14 - 5/14 log2 5/14 = 0.947 bits
Ok now we want a quantitative way of seeing the effect of splitting the dataset by using a particular attribute (which is part of the tree building process). We can use a measure called Information Gain, which calculates the reduction in entropy (Gain in information) that would result on splitting the data on an attribute, A.= -9/14 log2 9/14 - 5/14 log2 5/14 = 0.947 bits

where v is a value of A
, |Sv| is the subset of instances of S where A takes the value v,
and |S| is the number of instances

- Gain(S,House Type) = 0.049 bits
- Gain(S,Income) =0.151 bits
- Gain(S,Previous Customer) = 0.048 bits
With this node evaluation technique we can procede recursively through the subsets we create until leaf nodes have been reached throughout and all subsets are pure with zero entropy. This is exactly how ID3 and other variants work.
Decision Tree Construction Algorithm
We can now present the basic ID3 algorithm in pseudo-code:Input: A data set, S
Output: A decision tree
Ok, so go and play about with this method on the next page interactively. You can see how easy it now is to construct a small decision tree with this entropy measure as a guide.
Output: A decision tree
- If all the instances have the same value for the target attribute then return a decision tree that is simply this value (not really a tree - more of a stump).
- Else
- Compute Gain values (see above) for all attributes and select an attribute with the highest value and create a node for that attribute.
- Make a branch from this node for every value of the attribute
- Assign all possibe values of the attribute to branches.
- Follow each branch by partitioning the dataset to be only instances whereby the value of the branch is present and then go back to 1.
http://decisiontrees.net/