Entropy and Information Gain

Segmentation

  • If the segmentation is done using values of known variables when the target is unknown, then these segments can be used to predict the value of the target variable.

  • Example

    “Middle-aged professionals who reside in New York City on average have a churn rate of 5%”

    • Definition of the segment
      • “middle-aged professionals who reside in New York City”
    • the predicted value of the target variable for the segment
      • “a churn rate of 5%”
  • Questions?
    • How can we judge whether a variable contains important information about the target variable? How much?

Example: a binary (two class) classification problem

Binary Classification Problem

[Source: Data Science for Business]

  • A set of people to be classified
  • Predictor attributes
    • Two types of heads: square and circular
    • Two types of bodies: rectangular and oval
    • Two colors of bodies: gray and white
  • Target variable
    • Loan write-off: yes or no
    • The label over each head represents the value of the target variable (write-off or not)
  • Question
    • Which of the attributes would be best to segment these people into groups?
  • Homogeneous with respect to the target variable
    • A group is pure
      • If every member of a group has the same value for the target, then the group is pure.
    • A group is impure
      • If there is at least one member of the group that has a different value for the target variable than the rest of the group, then the group is impure.
    • In real data, we seldom expect to find a variable that will make the segments pure.
  • Complications
    • Attributes rarely split a group perfectly.
    • Not all attributes are binary; many attributes have three or more distinct values.
    • Some attributes take on numeric values (continuous or integer)
  • How well each attribute splits a set of examples into segments?
    • Most common splitting criterion: information gain
    • it is based on a purity measure called entropy (Claude Shannon, 1948)

Entropy

  • Entropy
    • a measure of disorder that can be applied to a set, such as one of our individual segments
    • entropy = -p_1 \times log_2 (p_1) - p_2 \times log_2 (p_2) - ... = - \sum_{i=1}^{k} p_i \times log_2 (p_i)
      • k properties within the set
      • p_i is the probability (the relative percentage) of property i within the set
      • p_i = 1: all members of the set have property i
      • p_i = 0: no members of the set have property i

Binary Classification Problem

[Source: Data Science for Business]

  • Entropy Example
    • consider a set S of 10 people with seven of the non-write-off class and three of the write-off class
    • p(non\text{-}write\text{-}off) = 7 / 10 = 0.7
    • p(write\text{-}off) = 3 / 10 = 0.3
    • entropy(S) = -0.7 \times log_2 (0.7) - 0.3 \times log_2 (0.3) = 0.7 \times 0.51 + 0.3 \times 1.74 \simeq 0.88

Information Gain (IG)

  • How informative an attribute is with respect to our target?

  • Information Gain

    • Measurement of how much an attribute improves (decreases) entropy over the whole segmentation it creates.
    • It measures the change in entropy due to any amount of new information being added
    • In the context of supervised segmentation, we consider the information gained by splitting the set on all values of a single attribute.
  • Information Gain Formula

    • Attribute we split on has k different values
    • Parent set: the original set of examples
    • k children sets: the result of splitting on the attribute values
      • c_i: ith children set
      • p(c_i): the proportion of instances belonging to the ith children set
    • IG (patent, children) = entropy(parent) - [p(c_1) \cdot entropy(c_1) + p(c_2) \cdot entropy(c_2) + ...] = entropy(parent) - \sum_{i=1}^{k} p(c_i) \cdot entropy(c_i)

Information Gain

[Source: Data Science for Business]

  • Information Gain Example
    • Parent set
      • 30 instances are divided to two classes (\bullet and \star)
        • \bullet: 16
        • \star: 14
      • entropy(parent) = - p(\bullet) \times log_2 (\bullet) - p(\star) \times log_2 (\star) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \simeq 0.53 \times 0.9 + 0.47 \times 1.1 \simeq 0.99
      • High entropy
    • 1st child set
      • Criterion: BALANCE < 50M
      • 13 instances –> p(BALANCE < 50M) = 13 / 30 \simeq 0.43
        • \bullet: 12
        • \star: 1
      • entropy(1st child) = - p(\bullet) \times log_2 (\bullet) - p(\star) \times log_2 (\star) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \simeq 0.92 \times 0.12 + 0.08 \times 3.7 \simeq 0.39
    • 2nd child set
      • Criterion: BALANCE >= 50M
      • 17 instances –> p(BALANCE >= 50M) = 17 / 30 \simeq 0.57
        • \bullet: 4
        • \star: 13
      • entropy(2nd child) = - p(\bullet) \times log_2 (\bullet) - p(\star) \times log_2 (\star) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \simeq 0.24 \times 2.1 + 0.76 \times 0.39 \simeq 0.79
    • Information Gain
      • IG = entropy(parent) - [p(BALANCE < 50M) \times entropy(1st child) + p(BALANCE >= 50M) \times entropy(2nd child)] \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \simeq 0.99 - [0.43 \times 0.39 + 0.57 \times 0.79] \simeq 0.37

댓글 남기기

이메일은 공개되지 않습니다. 필수 입력창은 * 로 표시되어 있습니다

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

%d 블로거가 이것을 좋아합니다: