Some intuitions behind the Information Gain, Gain ratio and Symmetrical Uncertain calculated by the FSelectorRcpp package, that can be a good proxy for correlation between unordered factors.
I a big fan of using FSelectorRcpp
in the exploratory phase to get some overview about the data. The main workhorse is the information_gain
function which calculates… information gain. But how to interpret the output of this function?
To understand this, you need to know a bit about entropy
. The good place is its Wikipedia page -
https://en.wikipedia.org/wiki/Entropy_(information_theory). If you don’t know anything about entropy from the information theory please start there.
Now go the code. To calculate the entropy in FSelectorRcpp
all variables must be categorized (factor
or character
). By default information_gain
automatically discretizes numeric values using so called MDL
algorithm (it’s not this post, so it won’t be covered there). But I’ll go step by step and discretize all the values on my own.
library(FSelectorRcpp)
disc <- discretize(Species ~ ., iris)
head(disc)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 (-Inf,5.55] (3.35, Inf] (-Inf,2.45] (-Inf,0.8] setosa
## 2 (-Inf,5.55] (2.95,3.35] (-Inf,2.45] (-Inf,0.8] setosa
## 3 (-Inf,5.55] (2.95,3.35] (-Inf,2.45] (-Inf,0.8] setosa
## 4 (-Inf,5.55] (2.95,3.35] (-Inf,2.45] (-Inf,0.8] setosa
## 5 (-Inf,5.55] (3.35, Inf] (-Inf,2.45] (-Inf,0.8] setosa
## 6 (-Inf,5.55] (3.35, Inf] (-Inf,2.45] (-Inf,0.8] setosa
Then calculating information_gain
looks like this:
# calling the information_gain on iris
# would give the same result
# information_gain(Species ~ ., iris)
information_gain(Species ~ ., disc)
## attributes importance
## 1 Sepal.Length 0.4521286
## 2 Sepal.Width 0.2672750
## 3 Petal.Length 0.9402853
## 4 Petal.Width 0.9554360
The theory tells us that information gain is defined as \(H(Class) + H(Attribute) - H(Class, Attribute)\) where \(H(X)\) is Shannon’s Entropy and \(H(X, Y)\) is a conditional Shannon’s Entropy for a variable X with a condition to Y.
So now we calculate the information step by step:
# function to calculate entropy
entropy <- function(x) {
n <- table(x)
p <- n/sum(n)
-sum(p*log(p))
}
x <- entropy(disc$Sepal.Length) # H(Attribute)
y <- entropy(disc$Species) # H(Class)
# This step is quite fun, because to calculate conditional entropy you can
# just glue the values together (think a little bit on the equation from wikipedia
# and it will become obvious).
xy <- entropy(paste(disc$Sepal.Length, disc$Species)) # H(Class, Attribute)
So the final information gain is equal to:
x + y - xy
## [1] 0.4521286
Note that conditional entropy is equal to \(H(X) + H(y) = H(x,y)\) when there’s no relation between \(x\) and \(y\) (in this case the information gain will be zero).
entropy(disc$Species)
## [1] 1.098612
set.seed(123)
# sample function used to destroy relation between variables
entropy(paste(sample(disc$Species), sample(disc$Species)))
## [1] 2.178778
Gain ratio and Symmetrical Uncertain
FSelectorRcpp
allows to use two another methods to calculate feature importance based on the entropy and the information gain measure.
- Gain ratio - defined as \((H(Class) + H(Attribute) - H(Class, Attribute)) / H(Attribute)\).
- Symmetrical Uncertain - equal to \(2 * (H(Class) + H(Attribute) - H(Class, Attribute)) / (H(Attribute) + H(Class))\).
Both scales the information gain into \([0,1]\) range (zero when there’s no relation, and 1 for perfect dependability).
information_gain(Species ~ ., disc, type = "gainratio")
## attributes importance
## 1 Sepal.Length 0.4196464
## 2 Sepal.Width 0.2472972
## 3 Petal.Length 0.8584937
## 4 Petal.Width 0.8713692
information_gain(Species ~ ., disc, type = "symuncert")
## attributes importance
## 1 Sepal.Length 0.4155563
## 2 Sepal.Width 0.2452743
## 3 Petal.Length 0.8571872
## 4 Petal.Width 0.8705214
Note that because both values are defined on the \([0,1]\) range they can be a proxy for correlation between two unordered factors (which sometimes is useful).
Other resources:
- https://victorzhou.com/blog/information-gain/ - information gain from the Decision Trees perspective.