主成分分析 (PCA, principal component analysis) 是一种数学降维方法.
PCA 降维过程;
1) 数据标准化
2) 求协方差矩阵
3) 特征向量排序
4) 投影矩阵
5) 数据转换
将样本数据求一个维度的协方差矩阵,然后求解这个协方差矩阵的特征值和对应的特征向量,将这些特征向量按照对应的特征值从大到小排列,组成新的矩阵,被称为特征向量矩阵,也可以称为投影矩阵,然后用改投影矩阵将样本数据转换.取前 K 维数据即可,实现对数据的降维.
案例 1
创建数据集
用 R 模拟芯片数据矩阵,矩阵为 10000 行(10000 个基因),100 列(100 个样本),生成均值为 0 的正态分布的随机数据.
chip.data < -matrix(rnorm(10000 * 100, mean = 0), nrow = 10000, ncol = 100)
显示结果:
1.jpg
2,在 10000 个基因中,假定有 100 个基因在两组间存在差异,前 50 个上调,另 50 个下调;
1)创建 1000 个 1~1000 的随机数, 作为索引
2)创建 50*10 的正态分布矩阵,均值为 2,通过 sha 上一步的随机数读取 1:50 的数字作为行号,前 10 列,赋值给 chip.data,作为上调数据集.
3)相同方法得到 50 个下调的数据集
PCA 作图
diff.index<-sample(1:1000,1000)
chip.data[diff.index[1:50],1:10]<-rnorm(50*10,mean=2)
chip.data[diff.index[1:50],1:10]<-rnorm(50*10,mean=-2)
princomp 函数使用方法
PCA 统计
Description
princomp performs a principal components analysis on the given numeric data matrix and returns the results as an object of class princomp.
## Default S3 method:
princomp(x, cor = FALSE, scores = TRUE, covmat = NULL,
subset = rep_len(TRUE, nrow(as.matrix(x))), ...)
chip.data<-princomp(chip.data)
显示 chip.data 的数据
显示统计结果
> chip.data
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -8.764830e-01 -2.585436e+00 1.7486665932 0.6825088090 0.8905718598 2.2543743674
[2,] 2.756559e+00 9.191507e-01 1.7224333465 2.5164729313 0.3655551313 0.3940460436
[3,] 9.754316e-01 -9.121371e-01 -0.0534088859 0.4711108467 -0.6567994543 -0.9404594391
[4,] -1.443449e+00 6.328793e-01 0.7067575122 -2.0083705142 -0.0641474431 0.5404051953
[5,] -1.678596e+00 -4.086325e-01 -0.6946972480 0.9941794052 1.9677986393 0.4281278343
[6,] 2.318705e+00 2.574536e+00 2.4483722951 3.7352614791 0.6849518201 2.5269332706
[7,] 1.368299e+00 -6.396757e-01 -0.3016863422 -0.9881343210 0.7250075490 -1.1474935276
[8,] 4.547110e-01 -1.388434e+00 0.5724884590 1.3446862438 0.2708813623 0.0768302649
[9,] -3.320154e-01 1.015236e+00 0.0524039788 0.8327729956 1.5803932962 -1.1469311968
[10,] 1.442150e+00 -1.005228e+00 0.9377764607 1.5061633084 -0.7742683227 -1.9687078752
前 10 个主成分已可以 dad 达到解析 0.99733790 的数据
> summary(chip.data)
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
Standard deviation 3.240085 3.2099856 3.1956557 3.1691590 3.1505363 3.13960683 3.11757677 3.10222437 3.07273039 3.05572866
Proportion of Variance 0.105799 0.1038424 0.1029174 0.1012178 0.1000317 0.09933886 0.09794967 0.09698734 0.09515192 0.09410186
Cumulative Proportion 0.105799 0.2096414 0.3125588 0.4137765 0.5138082 0.61314710 0.71109677 0.80808411 0.90323603 0.99733790
Standard deviation # 标准方差
Proportion of Variance # 贡献度
Cumulative Proportion # 累计贡献度
画图
1)设置两组 100 个差异基因的颜色.可以通过更改,"2""7" 的 1:10 范围的数字,更改两组的颜色
2)plot3d(xlab,ylab,zlab 三维数据集,分组颜色,图形类型,半径)
以下为 type:s,代表图形为球星
显示结果 3D 图,可以使用鼠标进行旋转和方法缩小,直到最清晰角度为止.
colour<-c(rep(2,50),rep(7,50))
library(rgl)
plot3d(chip.data.pca$loadings[,1:3],col=colour,type="s",radius = 0.025)
2.jpg
plot3d(chip.data.pca$loadings[, 1 : 3], col = colour, type = "l", radius = 0.025)
显示线性结果:
3.jpg
案例 2
加载包和数据集
数据集介绍
rm(list=ls())
library(pca3d)
library(rgl)
data(metabo)
head(metabo)
4.jpg
136个观测值,425ge个daixie个代谢变量
Metabolic profiles in tuberculosis. # 肺结核代谢数据集
Description
Relative abundances of metabolites from serum samples of three groups of individuals
# 三组血清样本的相对丰度
Details
A data frame with 136 observations on 425 metabolic variables.
PCA计算
Serum samples from three groups of individuals were compared: tuberculin skin test negative (NEG), positive (POS) and clinical tuberculosis (TB).
#比较三组患者的血清样本:结核菌素皮肤试验阴性(NEG),阳性(POS)和临床结核(TB).
prcomp 函数使用方法
1)去除数据集的第一列行名作为数据集,标准化数据
Principal Components Analysis
Description
Performs a principal components analysis on the given data matrix and returns the results as an object of class prcomp.
## Default S3 method:
prcomp(x, retx = TRUE, center = TRUE, scale. = FALSE,
tol = NULL, rank. = NULL, ...)
2)以数据集的第一列行名作为分组因子
统计计算结果
metabo.pca <- prcomp(metabo[,-1], scale.=TRUE)
groups <- factor(metabo[,1])
作图
> summary(metabo.pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14
Standard deviation 5.86992 5.38923 4.74978 4.11434 3.88969 3.81589 3.30208 3.09675 2.9872 2.9157 2.80259 2.71364 2.60341 2.56392
Proportion of Variance 0.08146 0.06866 0.05333 0.04002 0.03577 0.03442 0.02578 0.02267 0.0211 0.0201 0.01857 0.01741 0.01602 0.01554
Cumulative Proportion 0.08146 0.15012 0.20345 0.24347 0.27924 0.31366 0.33944 0.36211 0.3832 0.4033 0.42187 0.43928 0.45530 0.47084
pca3d 使用方法
pca3d(数据集,分组,是否显示置信区间,显示默认值是 0.95,而椭圆的大小为 95.是否实现分隔平面)
pca2d {pca3d} R Documentation
Show a three- or two-dimensional plot of a prcomp object
Description
Show a three- two-dimensional plot of a prcomp object or a matrix, using different symbols and colors for groups of data
Usage
pca3d(pca, components = 1:3, col = NULL, title = NULL, new = FALSE,
axes.color = "grey", bg = "white", radius = 1, group = NULL,
shape = NULL, palette = NULL, fancy = FALSE, biplot = FALSE,
biplot.vars = 5, legend = NULL, show.scale = FALSE,
show.labels = FALSE, labels.col = "black", show.axes = TRUE,
show.axe.titles = TRUE, axe.titles = NULL, show.plane = TRUE,
show.shadows = FALSE, show.centroids = FALSE, show.group.labels = FALSE,
show.shapes = TRUE, show.ellipses = FALSE, ellipse.ci = 0.95)
pca3d(metabo.pca, group = groups, show.ellipses = TRUE, elle.ci = 0.75, show.plane = FALSE)
显示结果 3D 图,可以使用鼠标进行旋转和方法缩小,直到最清晰角度为止.
5.jpg
取消外包围分隔平面
pca3d(metabo.pca, group = groups, show.ellipses = TRUE, ellipse.ci = 0.75, show.plane = FALSE)
显示结果:
6.jpg
来源: http://www.jianshu.com/p/b6fdd2176ccc