• Entropy and Information Gain

    Segmentation

    • If the segmentation is done using values of known variables when the target is unknown, then these segments can be used to predict the value of the target variable.

    • Example

      “Middle-aged professionals who reside in New York City on average have a churn rate of 5%”

      • Definition of the segment
        • “middle-aged professionals who reside in New York City”
      • the predicted value of the target variable for the segment
        • “a churn rate of 5%”
    • Questions?
      • How can we judge whether a variable contains important information about the target variable? How much?

    Example: a binary (two class) classification problem

    Binary Classification Problem

    [Source: Data Science for Business]

    • A set of people to be classified
    • Predictor attributes
      • Two types of heads: square and circular
      • Two types of bodies: rectangular and oval
      • Two colors of bodies: gray and white
    • Target variable
      • Loan write-off: yes or no
      • The label over each head represents the value of the target variable (write-off or not)
    • Question
      • Which of the attributes would be best to segment these people into groups?
    • Homogeneous with respect to the target variable
      • A group is pure
        • If every member of a group has the same value for the target, then the group is pure.
      • A group is impure
        • If there is at least one member of the group that has a different value for the target variable than the rest of the group, then the group is impure.
      • In real data, we seldom expect to find a variable that will make the segments pure.
    • Complications
      • Attributes rarely split a group perfectly.
      • Not all attributes are binary; many attributes have three or more distinct values.
      • Some attributes take on numeric values (continuous or integer)
    • How well each attribute splits a set of examples into segments?
      • Most common splitting criterion: information gain
      • it is based on a purity measure called entropy (Claude Shannon, 1948)

    Entropy

    • Entropy
      • a measure of disorder that can be applied to a set, such as one of our individual segments
      • entropy = -p_1 \times log_2 (p_1) - p_2 \times log_2 (p_2) - ... = - \sum_{i=1}^{k} p_i \times log_2 (p_i)
        • k properties within the set
        • p_i is the probability (the relative percentage) of property i within the set
        • p_i = 1: all members of the set have property i
        • p_i = 0: no members of the set have property i

    Binary Classification Problem

    [Source: Data Science for Business]

    • Entropy Example
      • consider a set S of 10 people with seven of the non-write-off class and three of the write-off class
      • p(non\text{-}write\text{-}off) = 7 / 10 = 0.7
      • p(write\text{-}off) = 3 / 10 = 0.3
      • entropy(S) = -0.7 \times log_2 (0.7) - 0.3 \times log_2 (0.3) = 0.7 \times 0.51 + 0.3 \times 1.74 \simeq 0.88

    Information Gain (IG)

    • How informative an attribute is with respect to our target?

    • Information Gain

      • Measurement of how much an attribute improves (decreases) entropy over the whole segmentation it creates.
      • It measures the change in entropy due to any amount of new information being added
      • In the context of supervised segmentation, we consider the information gained by splitting the set on all values of a single attribute.
    • Information Gain Formula

      • Attribute we split on has k different values
      • Parent set: the original set of examples
      • k children sets: the result of splitting on the attribute values
        • c_i: ith children set
        • p(c_i): the proportion of instances belonging to the ith children set
      • IG (patent, children) = entropy(parent) - [p(c_1) \cdot entropy(c_1) + p(c_2) \cdot entropy(c_2) + ...] = entropy(parent) - \sum_{i=1}^{k} p(c_i) \cdot entropy(c_i)

    Information Gain

    [Source: Data Science for Business]

    • Information Gain Example
      • Parent set
        • 30 instances are divided to two classes (\bullet and \star)
          • \bullet: 16
          • \star: 14
        • entropy(parent) = - p(\bullet) \times log_2 (\bullet) - p(\star) \times log_2 (\star) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \simeq 0.53 \times 0.9 + 0.47 \times 1.1 \simeq 0.99
        • High entropy
      • 1st child set
        • Criterion: BALANCE < 50M
        • 13 instances –> p(BALANCE < 50M) = 13 / 30 \simeq 0.43
          • \bullet: 12
          • \star: 1
        • entropy(1st child) = - p(\bullet) \times log_2 (\bullet) - p(\star) \times log_2 (\star) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \simeq 0.92 \times 0.12 + 0.08 \times 3.7 \simeq 0.39
      • 2nd child set
        • Criterion: BALANCE >= 50M
        • 17 instances –> p(BALANCE >= 50M) = 17 / 30 \simeq 0.57
          • \bullet: 4
          • \star: 13
        • entropy(2nd child) = - p(\bullet) \times log_2 (\bullet) - p(\star) \times log_2 (\star) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \simeq 0.24 \times 2.1 + 0.76 \times 0.39 \simeq 0.79
      • Information Gain
        • IG = entropy(parent) - [p(BALANCE < 50M) \times entropy(1st child) + p(BALANCE >= 50M) \times entropy(2nd child)] \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \simeq 0.99 - [0.43 \times 0.39 + 0.57 \times 0.79] \simeq 0.37

답글 남기기

이메일은 공개되지 않습니다. 필수 입력창은 * 로 표시되어 있습니다.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

%d 블로거가 이것을 좋아합니다: