Getting Genetics Done: More on Exploring Correlations in R

Tuesday, August 28, 2012

More on Exploring Correlations in R

About a year ago I wrote a post about producing scatterplot matrices in R. These are handy for quickly getting a sense of the correlations that exist in your data. Recently someone asked me to pull out some relevant statistics (correlation coefficient and p-value) into tabular format to publish beside a scatterplot matrix. The built-in cor() function will produce a correlation matrix, but what if you want p-values for those correlation coefficients? Also, instead of a matrix, how might you get these statistics in tabular format (variable i, variable j, r, and p, for each i-j combination)? Here's the code (you'll need the PerformanceAnalytics package to produce the plot).

The cor() function will produce a basic correlation matrix. 12 years ago Bill Venables provided a function on the R help mailing list for replacing the upper triangle of the correlation matrix with the p-values for those correlations (based on the known relationship between t and r). The cor.prob() function will produce this matrix.

Finally, the flattenSquareMatrix() function will "flatten" this matrix to four columns: one column for variable i, one for variable j, one for their correlation, and another for their p-value (thanks to Chris Wallace on StackOverflow for helping out with this one).

Finally, the chart.Correlation() function from the PerformanceAnalytics package produces a very nice scatterplot matrix, with histograms, kernel density overlays, absolute correlations, and significance asterisks (0.05, 0.01, 0.001):

18 comments:

EvandroAugust 28, 2012 at 6:33 PM
This comment has been removed by the author.
ReplyDelete
Replies
EvandroAugust 28, 2012 at 6:34 PM
Amazing graphic ! I'll put this in my preference list. Thanks for share. :)
ReplyDelete
Replies
ndronenSeptember 2, 2012 at 11:33 PM
Thanks for the post. chart.Correlation is very useful. Here's a little piece I wrote about using the correlation dimension to get a feeling for the distortions caused by groups of highly correlated variables, assuming one is looking for (groups of high) correlations as something to eliminate.
ReplyDelete
Replies
UnknownSeptember 6, 2012 at 2:49 AM
Thanks for sharing Stephen! Your blog is always a solid read.
ReplyDelete
Replies
AnonymousSeptember 14, 2012 at 5:34 AM
Was there any particular reason for reinventing cor.test()?
ReplyDelete
Replies
Stephen TurnerSeptember 14, 2012 at 8:30 AM
cor() gives you correlations for all pairwise numeric vectors, and the cor.prob() function above extends this to give you both the correlation and cor.test() for pairwise combination.
ReplyDelete
Replies
MarkSeptember 17, 2012 at 2:09 PM
Hey, Stephen! This is amazing! I'm a total statistics noob, and I'm confused about what information the plots in the lower half of the circle are actually giving. Any help?
ReplyDelete
Replies
MarkSeptember 18, 2012 at 12:17 PM
Hey, Stephen!

After working a little more with the chart.correlation function, I've got a number of issues that I've encountered:

1. I'm working with a somewhat large matrix of traits, such that when the chart is generated, each cell is super small. Is there any way to increase the absolute size of the chart, so that the data are actually visible?

2. Similarly, some of my traits have long-ish names, and I was wondering whether there would be a way to wrap the text in the histogram cells (I could change the names of the traits, of course)...

3. Finally, I'm noticing that my correlation chart is not a symmetrical matrix in the end (there are several extra columns that don't respond to any additional rows, and it's unclear what trait correlation coefficients are being displayed in them). I'm wondering whether this has something to do with missing data in my dataset (I got several "the standard deviation is zero" warnings, and also an "Error in cor.test.default(x, y) : not enough finite observations" message).

Any help or suggestions that you might have would be extremely appreciated! (And I'd be more than happy to send you my data and/or pictures of the chart I have so far privately).
ReplyDelete
Replies
MarkSeptember 20, 2012 at 12:03 PM
Thanks so much, Stephen! Now I've got a place to start! Much appreciated.
ReplyDelete
Replies
AkshataMay 25, 2014 at 11:42 PM
Great piece of code...thanks!!! Is there a way to show the sign of the correlations instead of absolute values?
ReplyDelete
Replies
Stephen TurnerMay 27, 2014 at 8:34 AM
Hm, you'll probably want to modify the code in the chart.Correlation function in the PerformanceAnalytics package. I didn't write this code.
ReplyDelete
Replies
TeenaFebruary 13, 2015 at 5:14 PM
Thank you for the beautiful scatterplot matrix code! For my agricultural data, I would like to replace the lowess overlay line in the lower left scatterplots with a linear regression line and show the slope (m= ) and intercept (y=). is this possible?
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.