Entropy and Information Gain
Segmentation

If the segmentation is done using values of known variables when the target is unknown, then these segments can be used to predict the value of the target variable.

Example
“Middleaged professionals who reside in New York City on average have a churn rate of 5%”
 Definition of the segment
 “middleaged professionals who reside in New York City”
 the predicted value of the target variable for the segment
 “a churn rate of 5%”
 Definition of the segment
 Questions?
 How can we judge whether a variable contains important information about the target variable? How much?
Example: a binary (two class) classification problem
[Source: Data Science for Business]
 A set of people to be classified
 Predictor attributes
 Two types of heads: square and circular
 Two types of bodies: rectangular and oval
 Two colors of bodies: gray and white
 Target variable
 Loan writeoff: yes or no
 The label over each head represents the value of the target variable (writeoff or not)
 Question
 Which of the attributes would be best to segment these people into groups?
 Homogeneous with respect to the target variable
 A group is pure
 If every member of a group has the same value for the target, then the group is pure.
 A group is impure
 If there is at least one member of the group that has a different value for the target variable than the rest of the group, then the group is impure.
 In real data, we seldom expect to find a variable that will make the segments pure.
 A group is pure
 Complications
 Attributes rarely split a group perfectly.
 Not all attributes are binary; many attributes have three or more distinct values.
 Some attributes take on numeric values (continuous or integer)
 How well each attribute splits a set of examples into segments?
 Most common splitting criterion: information gain
 it is based on a purity measure called entropy (Claude Shannon, 1948)
Entropy
 Entropy
 a measure of disorder that can be applied to a set, such as one of our individual segments

 properties within the set
 is the probability (the relative percentage) of property within the set
 : all members of the set have property
 : no members of the set have property
[Source: Data Science for Business]
 Entropy Example
 consider a set S of 10 people with seven of the nonwriteoff class and three of the writeoff class
Information Gain (IG)

How informative an attribute is with respect to our target?

Information Gain
 Measurement of how much an attribute improves (decreases) entropy over the whole segmentation it creates.
 It measures the change in entropy due to any amount of new information being added
 In the context of supervised segmentation, we consider the information gained by splitting the set on all values of a single attribute.

Information Gain Formula
 Attribute we split on has different values
 Parent set: the original set of examples
 children sets: the result of splitting on the attribute values
 : th children set
 : the proportion of instances belonging to the th children set
[Source: Data Science for Business]
 Information Gain Example
 Parent set
 30 instances are divided to two classes ( and )
 : 16
 : 14
 High entropy
 30 instances are divided to two classes ( and )
 1st child set
 Criterion: BALANCE < 50M
 13 instances –>
 : 12
 : 1
 2nd child set
 Criterion: BALANCE >= 50M
 17 instances –>
 : 4
 : 13
 Information Gain
 Parent set