Entropy for "Outlook" over all its values: The value (or contribution to information) of an attribute is calculated as And what is "best"? <>/Metadata 6052 0 R/ViewerPreferences 6053 0 R>>
The misclassification rate is obviously different in the sense that it is linear, however it shares many of the same properties: it is maximized at p = 1/2, equals zero at p = 0,1 and it is concave.

Ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. <>
(GINI squares the probabilitywhile Entropy multiplies the probability times its log. Higher entropy partitioning (large number of small partitions) is penalized! We choose the simpler model. Which tree size corresponds to the lowest cross-validated classification error rate?

Plot the expressions for misclassification error, Gini index, and cross-entropy as functions of \(p\). Error(C0=9,C1=1) = 1 - max(0.9,0.1) = 1 - 0.9 = 0.1, These approaches are discussed further below, or equivalently, find the lowest impurity measure after splitting (M). Linearly scan these values, each time updating the count matrix and computing gini index, Choose the split position that has the least gini index. The smallest Gini index gives you the highest gain. To enforce bagging, and not variable subsampling, we assign mtry to equal the number of covariates in our model, which is 19.

x]]IM3&QbMvB!6'1(cup=. We apply this tree to our test set, and observe it obtains about 20% test error. Compare the training error rates between the pruned and unpruned trees. 7 0 obj
Q4e.

Space of possible decision trees is exponentially large. Q5d.

In this case we sacrifice some bias for a significantly lower variance model, it does appear to be slightly over-pruned however. Ans 5f,g: We investigate the variable importance by plotting our boosting model, using the summary() command. Choose the split that achieves most reduction, which maximizes GAIN. entropy(p1,p2, .,pn) = p1*b1+ p2*b2 + pn*bn, Note that bi = - log2 pi in the cases of a simple (binary) split into two classes.

Perform boosting on the training set with 1000 trees for a range of values of the shrinkage parameter . These are all career statistics; so how many career bats, runs batted in, hits, walks and runs the player has made is, in respective order, the most influential information pertaining to his salary. The gains for the other attributes doing a similar calculation: So choose "Outlook" as the "best" attribute. Given a collection of records, build a model that can be used to predictadesignated classvariable. Which variables appear to be the most important predictors in the boosted model? entropy(p1,p2, .,pn) = -p1logp1 - p2logp2 - pnlogpn.

gain(attr) = (information before split) - (information after split), gain("Outlook") = info([9,5]) - info([2,3], [4,0], [3,2]) Q4d. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. Here's a graphical breakdown of the attributes. We see the five most influential variables are CAtBat, CRBI, CHits, CWalks, CRuns.

Which is higher? This can help cancel out the above problem. That is, whatattribute provides the best split? Step 5, Let Dt be the set of training records that reach a node t. Choosing which attribute to use for splitting is considered below. Clearly the highly loyal customers stay highly loyal. Q4h. Customer ID has highest information gain because entropy for all the children is zero. Q5a. Step 4

If the instances in the subclass satisfy predefined criteria, or if the set of remaining attribute choices for this path of the tree is null, specify the classification for new instances following this decision path. Q5e. recursively apply this procedure to each subset.

The leftmost binary split has no solid or definitive determination. Pick one of the terminal nodes, and interpret the information displayed. <>
Q4i. Produce a plot with different shrinkage values on the x-axis and the corresponding training set MSE on the y-axis.

Q4f. Static discretize once at the beginning, consider all possible splits and finds the best cut.

1 0 obj
<>
endobj

Create a training set consisting of the first 200 observations, and a test set consisting of the remaining observations. Parent Node, p, is split into k partitions where niis number of records in partition i.

<>
Multi-way: use as many partitions as values in nominal/binary cases, Versus binary: can have variations on different combinations of groups, for example, Multi-way: use as many partitions as distinct values, Binary split: What is the test error rate? ), With a split of [0,6]: P(C1) = 0/6 = 0 P(C2) = 6/6 = 1, then Entropy([0,6]) = 0 log 0 1 log 1 = 0 0 = 0.000, With a split of [1,5]: P(C1) = 1/6 and P(C2) = 5/6, then Entropy([1,5]) = (1/6) log2 (1/6) (5/6) log2 (1/6) = 0.65, With a split of [2,4]: P(C1) = 2/6 and P(C2) = 4/6, then Entropy([2,4]) = (2/6) log2 (2/6) (4/6) log2 (4/6) = 0.92, With a split of [3,3: P(C1) = 1/2 and P(C2) = 1/2, then Entropy([3,3]) = (1/2) log2 (1/2) (1/2) log2 (1/2) = 1.00, Parent Node, p, is split into k partitions where Ans: We compare the boostings minimum test error of 0.456 with the test error with that of a simple linear model, and also lasso regression. Nodes with a purer class distribution are preferred. Step 2 Fit a tree to the training data, with Purchase as the response and the other variables except for Buy as predictors. <>
Computationally inefficient!

This is quite surprsing, it is not based upon the current years statistics. Q5c.

Learn/build a model that maps each attribute set x into one of the predefined class labels y, Features extracted from email message header and content, Features extracted from telescope images, Elliptical, spiral, or irregular-shaped galaxies, An induction process builds the model (learn), A deduction process tests the model for accuracy/quality, Input training set used in classification model to predict an output, Use a combination of different classifiers or variations of the same classifier, Click through the steps to see the animated, Step 0 Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. endstream

GINI(C0=9,C1=1) = 1 - (0.9)2- (0.1)2 = 0.18, Entropy(C0=5,C1=5) = -(0.5)log(0.5) + -(0.5)log(0.5) = -0.5*(-1) + -0.5*(-1) = 1

endobj
Divides values into two subsets, Different ways of handling Compare the test MSE of boosting to the test MSE that results from applying two of the regression approaches. %

Step 3 Which is higher?

Node impurity measures tend to prefer splits that result in large number of partitions, each being small but pure. We now use boosting to predict Salary in the Hitters data set. Q4j. endobj
The gains were not shown, Classifiers tend to have biases in the modeling process that ensemble classifiers may be able to work around. So we need a meausure of node impurity. Calculating a log requires much more work computationally. We create the pruned tree and compare training/testing errors. Step 1 <>
Unpruned: Train = 15.8%, Test = 20%. xQo0#;S5vJv]u=@6HX9a-P\wUM>3Y=2,D0PAe}7+ab"=;o+nt{WRDz{Ar$LfNr240Y9y[67IY68#&HI)~]OI[FQBxtAoR} ahDb:bS}5W/e5`'F endobj
Ans 4d,e: Looking at the last terminal node, we see if LoyalCH is high (greater than 0.765) then the customer is almost always (95% of the time) purchases Citrus Hill orange juice. Q4b. endobj
We learned of three measures for impurity used by classification trees. Compute impurity measure (P) before splitting, Compute impurity measure (M) after splitting, Compute impurity measure of each child node. Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations. Apply the cv.tree() function to the training set in order to determine the optimal size tree.

<>
2 0 obj
Gini index: Q5g. Forthe collection or a subsetof the collection called the training set: Each record of the collection (training set) is by characterized by a tuple (x,y), where x is the attribute set and y is the class label.

We see it is minimized when the size is equal to 2 or 7. endobj
We see the Gini Index and the Cross-entropy are similar, they both maximize at p = 1/2 and both are concave. #8Gci,c5pP"yFdj rx0A)!576w\>= + 1SnwB.]3'0vz42`#Z:c# b#PAT-mep2s9\'FZS
QJA%yZf{8$7bxqrOF#4Y6MTJ+b,ox} =QOe.vOPtmq&lM Q4c. Q4a. Q4g. This problem involves the OJ data set which is part of the ISLR package. 5 0 obj
Error(C0=5) = 1 - max(0.5,0.5) = 1 - 0.5 = 0.5 stream
discretization to form an ordinal categorical attribute.

Now recompute gain on the remaining three attributes and repeat process within the "Outlook" categories. Gini index is used in decision tree algorithms such as CART, SLIQ, SPRINT, Each splitting value has a count matrix associated with it. is the entropy for a given node t of class j. How many terminal nodes does the tree have? <>
Pruned: Train = 18.3%, Test = 21.5%. endobj
Compare the test error rates between the pruned and unpruned trees. endobj
3 0 obj
We plot the tree with the following command. Use the child link values to further subdivide the instances into subclasses. Only three variables end up being used in the trees construction, LoyalCH, SalesPriceMM, PriceDiff; so the customer loyalty to Citrus Hill, the price of Minute Maid OJ and the price difference form the crucial deciding factors in a customers purchases. The observed test MSE 0f 0.49 is considerably better than that of linear or lasso regression, however it is still slightly greater than boostings 0.45. The resulting plot is displayed. 8 0 obj
9 0 obj
Q5f. Are the curves similar?

6 0 obj

then use an attribute test to split the data into smaller subsets. Ans: The three impurity measures are plotted, with cross-entropy scaled by a factor of 1/2 to bring it to the same range as the Gini index. The middle one seems to have a fairly predictable outcomes based on car type. ni is number of records in partition i.

%PDF-1.7
We apply bagging to the training set, using random forests specifically. <>/Font<>/XObject<>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 720 540] /Contents 4 0 R/Group<>/Tabs/S/StructParents 0>>
So choose "Humidity" as a best attribute for the sunny case.

Entropy(C0=9,C1=1) = -(0.9)log(0.9) + -(0.1)log(0.1) = -0.9(-0.152) + -0.1(-3.322) = 0.469, Misclassification error index: Which attributes to select? This example magically chooses a good one. 4 0 obj

Ans 4f-k: We compute the cross validation error whilst pruning the tree, and it vs tree size. Extremely fast at classifying unknown records, Robust to noise (especially when methods to avoid overfitting are employed), Can easily handle redundant or irrelevant attributes (unless the attributes are interacting). And "Windy" for the rainy case. Much repetition of work. endobj
Entropy is used in the ID3 and C4.5 decision tree algorithms.

stream
Linear regression obtains a slightly better MSE than lasso, which is unsurprising as we have a much larger number of samples than covariates so we dont really need to enforce sparsity. But we know this is not a predictor attribute. The rightmost, using ID is entirely useless for prediction, albeit perfect on the training data, real future id's will be out of range. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes. n = number of records at parent node p. choose the attribute that minimizes weighted average Gini index of the children. endobj
= Number of distinct values, Class counts in each of the partitions, A < v and A >= v, For each possible partition v, scan the database to gather count matrix and compute its Gini index. Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis. That is, p( j | t) is the relative frequency of class j at node t. For 2-class problem (p, 1 p): Choose the attribute test condition that produces the highest gain: Do we choose attribute A or Bfor the decision? Nevertheless the MSE of linear regression, 0.62, is vastly greater than boostings 0.45. 11 0 obj
Ans 5a-d: We prepare the data and apply boosting on a range of shrinkage parameters, plotting both the train and test MSE values resulting from the proposed shrinkage. Produce a plot with different shrinkage values on the x-axis and the corresponding test set MSE on the y-axis.

That is, p( j | t) is the relative frequency of class j at node t. Maximum is log nc when records are equally distributed among all classes implying least information, Minimum is 0.0 when all records belong to one class, implying most information, Entropy based computations are quite similar to the GINI index computations, but more computationally intensive.