# Entropy and Information Gain

## Segmentation

• If the segmentation is done using values of known variables when the target is unknown, then these segments can be used to predict the value of the target variable.

• Example

“Middle-aged professionals who reside in New York City on average have a churn rate of 5%”

• Definition of the segment
• “middle-aged professionals who reside in New York City”
• the predicted value of the target variable for the segment
• “a churn rate of 5%”
• Questions?
• How can we judge whether a variable contains important information about the target variable? How much?

## Example: a binary (two class) classification problem [Source: Data Science for Business]

• A set of people to be classified
• Predictor attributes
• Two types of heads: square and circular
• Two types of bodies: rectangular and oval
• Two colors of bodies: gray and white
• Target variable
• Loan write-off: yes or no
• The label over each head represents the value of the target variable (write-off or not)
• Question
• Which of the attributes would be best to segment these people into groups?
• Homogeneous with respect to the target variable
• A group is pure
• If every member of a group has the same value for the target, then the group is pure.
• A group is impure
• If there is at least one member of the group that has a different value for the target variable than the rest of the group, then the group is impure.
• In real data, we seldom expect to find a variable that will make the segments pure.
• Complications
• Attributes rarely split a group perfectly.
• Not all attributes are binary; many attributes have three or more distinct values.
• Some attributes take on numeric values (continuous or integer)
• How well each attribute splits a set of examples into segments?
• Most common splitting criterion: information gain
• it is based on a purity measure called entropy (Claude Shannon, 1948)

## Entropy

• Entropy
• a measure of disorder that can be applied to a set, such as one of our individual segments
• $entropy = -p_1 \times log_2 (p_1) - p_2 \times log_2 (p_2) - ... = - \sum_{i=1}^{k} p_i \times log_2 (p_i)$
• $k$ properties within the set
• $p_i$ is the probability (the relative percentage) of property $i$ within the set
• $p_i = 1$: all members of the set have property $i$
• $p_i = 0$: no members of the set have property $i$ [Source: Data Science for Business]

• Entropy Example
• consider a set S of 10 people with seven of the non-write-off class and three of the write-off class
• $p(non\text{-}write\text{-}off) = 7 / 10 = 0.7$
• $p(write\text{-}off) = 3 / 10 = 0.3$
• $entropy(S) = -0.7 \times log_2 (0.7) - 0.3 \times log_2 (0.3) = 0.7 \times 0.51 + 0.3 \times 1.74 \simeq 0.88$

## Information Gain (IG)

• How informative an attribute is with respect to our target?

• Information Gain

• Measurement of how much an attribute improves (decreases) entropy over the whole segmentation it creates.
• It measures the change in entropy due to any amount of new information being added
• In the context of supervised segmentation, we consider the information gained by splitting the set on all values of a single attribute.
• Information Gain Formula

• Attribute we split on has $k$ different values
• Parent set: the original set of examples
• $k$ children sets: the result of splitting on the attribute values
• $c_i$: $i$th children set
• $p(c_i)$: the proportion of instances belonging to the $i$th children set
• $IG (patent, children)$ $= entropy(parent) - [p(c_1) \cdot entropy(c_1) + p(c_2) \cdot entropy(c_2) + ...] = entropy(parent) - \sum_{i=1}^{k} p(c_i) \cdot entropy(c_i)$ [Source: Data Science for Business]

• Information Gain Example
• Parent set
• 30 instances are divided to two classes ( $\bullet$ and $\star$)
• $\bullet$: 16
• $\star$: 14
• $entropy(parent) = - p(\bullet) \times log_2 (\bullet) - p(\star) \times log_2 (\star)$ $\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \simeq 0.53 \times 0.9 + 0.47 \times 1.1 \simeq 0.99$
• High entropy
• 1st child set
• Criterion: BALANCE < 50M
• 13 instances –> $p(BALANCE < 50M) = 13 / 30 \simeq 0.43$
• $\bullet$: 12
• $\star$: 1
• $entropy(1st child) = - p(\bullet) \times log_2 (\bullet) - p(\star) \times log_2 (\star)$ $\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \simeq 0.92 \times 0.12 + 0.08 \times 3.7 \simeq 0.39$
• 2nd child set
• Criterion: BALANCE >= 50M
• 17 instances –> $p(BALANCE >= 50M) = 17 / 30 \simeq 0.57$
• $\bullet$: 4
• $\star$: 13
• $entropy(2nd child) = - p(\bullet) \times log_2 (\bullet) - p(\star) \times log_2 (\star)$ $\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \simeq 0.24 \times 2.1 + 0.76 \times 0.39 \simeq 0.79$
• Information Gain
• $IG = entropy(parent) - [p(BALANCE < 50M) \times entropy(1st child) + p(BALANCE >= 50M) \times entropy(2nd child)]$ $\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \simeq 0.99 - [0.43 \times 0.39 + 0.57 \times 0.79] \simeq 0.37$