
Entropy and Information Gain
Segmentation
 Questions?
 How can we judge whether a variable contains important information about the target variable? How much?
Example: a binary (two class) classification problem
[Source: Data Science for Business]
 A set of people to be classified
 Predictor attributes
 Two types of heads: square and circular
 Two types of bodies: rectangular and oval
 Two colors of bodies: gray and white
 Target variable
 Loan writeoff: yes or no
 The label over each head represents the value of the target variable (writeoff or not)
 Question
 Which of the attributes would be best to segment these people into groups?
 Homogeneous with respect to the target variable
 A group is pure
 If every member of a group has the same value for the target, then the group is pure.
 A group is impure
 If there is at least one member of the group that has a different value for the target variable than the rest of the group, then the group is impure.
 In real data, we seldom expect to find a variable that will make the segments pure.
 Complications
 Attributes rarely split a group perfectly.
 Not all attributes are binary; many attributes have three or more distinct values.
 Some attributes take on numeric values (continuous or integer)
If you enjoyed this article please consider sharing it!