I
have recently been asked the question: “why do I receive a negative percent
variance explained in a random forest regression”. Besides the obvious answer
“because your model is crap” I thought that I would explain the mechanism at
work here so the assumption is not that randomForests is producing erroneous
results. For poorly supported models it is, in fact, possible to receive a
negative percent variance explained.
Generally,
explained variance (R²) is defined as:
R²
= 1 - sum((ลท-mean(y))²) / sum((mean(y)-y)²)
However,
as indicated by Breiman (2001) and the R randomForest documentation the
(regression only) “pseudo R-squared” is derived as:
R²
= 1 – (mean squared error) / var(y)
Which,
mathematically can produce negative values. A simple interpretation of a
negative R² (rsq), is that you are better off predicting any given sample as
equal to overall estimated mean, indicating very poor model performance.