Some intuitions behind the Information Gain, Gain ratio and Symmetrical Uncertain calculated by the FSelectorRcpp package, that can be a good proxy for correlation between unordered factors.
I a big fan of using FSelectorRcpp in the exploratory phase to get some overview about the data. The main workhorse is the information_gain function which calculates… information gain. But how to interpret the output of this function?
To understand this, you need to know a bit about entropy.
New release of FSelectorRcpp (0.2.1) is on CRAN. I described near all the new functionality here. The last thing that we added just before release is an extract_discretize_transformer. It can be used to get a small object from the result of discretize function to transform the new data using estimated cutpoints. See the example below.
library(FSelectorRcpp) set.seed(123) idx <- sort(sample.int(150, 100)) iris1 <- iris[idx, ] iris2 <- iris[-idx, ] disc <- discretize(Species ~ .
The main purpose of the FSelectorRcpp package is the feature selection based on the entropy function. However, it also contains a function to discretize continuous variable into nominal attributes, and we decided to slightly change the API related to this functionality, to make it more user-friendly.
EDIT: Updated version (0.2.1) is on CRAN. It can be installed using: install.packages("FSelectorRcpp") The dev version can be installed using devtools:
devtools::install_github("mi2-warsaw/FSelectorRcpp", ref = "dev")