用R语言检验数据正态性
本文为两大块,1.检验单变量正态分布;2.检验多元变量正态性。
本次需要用到数据集hflights,这是关于2011年离开休斯顿机场航班的数据,关于该数据集的描述如下:
This dataset contains all flights departing from Houston airports IAH (George Bush Intercontinental)
and HOU (Houston Hobby). The data comes from the Research and Innovation Technology Administration
at the Bureau of Transporation statistics:
http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&Link=0
先来看下数据概要,可以看到数据集很大,包含21个变量,共227496条数据。变量包含年月日、起飞和到达时间、延误时间、航班是否被取消等等。
str(hflights)
'data.frame': 227496 obs. of 21 variables:
$ Year : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
$ Month : int 1 1 1 1 1 1 1 1 1 1 ...
$ DayofMonth : int 1 2 3 4 5 6 7 8 9 10 ...
$ DayOfWeek : int 6 7 1 2 3 4 5 6 7 1 ...
$ DepTime : int 1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ...
$ ArrTime : int 1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ...
$ UniqueCarrier : chr "AA" "AA" "AA" "AA" ...
$ FlightNum : int 428 428 428 428 428 428 428 428 428 428 ...
$ TailNum : chr "N576AA" "N557AA" "N541AA" "N403AA" ...
$ ActualElapsedTime: int 60 60 70 70 62 64 70 59 71 70 ...
$ AirTime : int 40 45 48 39 44 45 43 40 41 45 ...
$ ArrDelay : int -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
$ DepDelay : int 0 1 -8 3 5 -1 -1 -5 43 43 ...
$ Origin : chr "IAH" "IAH" "IAH" "IAH" ...
$ Dest : chr "DFW" "DFW" "DFW" "DFW" ...
$ Distance : int 224 224 224 224 224 224 224 224 224 224 ...
$ TaxiIn : int 7 6 5 9 9 6 12 7 8 6 ...
$ TaxiOut : int 13 9 17 22 9 13 15 12 22 19 ...
$ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ...
$ CancellationCode : chr "" "" "" "" ...
$ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...
本次分析只用这个数据集的一个子集,即前往纽约肯尼迪国际机场(JFK)的航班,我们感兴趣的变量有2个:按分钟计时的滑行时间(TaxiIn)和离地时间(TaxiOut)。
library(hflights) # 载入hflights包
JFK <- hflights[which(hflights$Dest == 'JFK'), c('TaxiIn','TaxiOut')]
# 提取子集,目的地为JFK机场,需要滑行时间和离地时间两列数据
查看提取出来的子集JFK的前6行
head(JFK)
TaxiIn TaxiOut
57719 6 23
57720 12 11
58343 6 12
58344 9 31
58983 3 12
58984 5 9
单变量正态性
QQ图
par(mfrow = c(1,2)) # 设置图形参数,让生成的图像排成1行2列
qqnorm(JFK$TaxiIn, ylab = 'TaxiIn') # 做滑行时间的QQ图,将y轴命名为TaxiIn
qqline(JFK$TaxiIn) # 增加趋势直线,利于比较
qqnorm(JFK$TaxiOut, ylab = 'TaxiOut') # 做离地时间的QQ图,将y周命名为TaxiOut
qqline(JFK$TaxiOut)
观察QQ图发现,数据没有服从正态分布。
Shapiro-Wilk正态检验
使用shapiro.test函数即可
shapiro.test(JFK$TaxiIn)
Shapiro-Wilk normality test
data: JFK$TaxiIn
W = 0.83869, p-value < 2.2e-16
原假设H0是数据服从正态分布,这里p<0.05,拒绝原假设,数据服从正态分布不成立。
多元变量正态性
需要使用到MVN包,且数据中不能含有缺失值,这一点可以使用na.omit函数解决。
Mardia
mvn的使用方式如下:
mvn(data, subset = NULL, mvnTest = c("mardia", "hz", "royston", "dh",
"energy"), covariance = TRUE, tol = 1e-25, alpha = 0.5, scale = FALSE,
desc = TRUE, transform = "none", R = 1000, univariateTest = c("SW",
"CVM", "Lillie", "SF", "AD"), univariatePlot = "none",
multivariatePlot = "none", multivariateOutlierMethod = "none",
showOutliers = FALSE, showNewData = FALSE)
这里我们只使用其中3个参数:data, mvnTest, multivariatePlot。
- data:数据集
- mvnTest:检验多元变量正态性的方法
- multivariatePlot:作图
JFK_1 <- na.omit(JFK) # 将无缺失值数据保存在JFK_1中
mvn(JFK_1, mvnTest = c("mardia"), multivariatePlot = c("qq"))
$multivariateNormality
Test Statistic p value Result
1 Mardia Skewness 2351.95662019081 0 NO
2 Mardia Kurtosis 124.671344974903 0 NO
3 MVN <NA> <NA> NO
$univariateNormality
Test Variable Statistic p value Normality
1 Shapiro-Wilk TaxiIn 0.8387 <0.001 NO
2 Shapiro-Wilk TaxiOut 0.6452 <0.001 NO
$Descriptives
n Mean Std.Dev Median Min Max 25th 75th Skew Kurtosis
TaxiIn 677 6.939439 3.278159 6 2 41 5 8 2.603622 18.23401
TaxiOut 677 14.521418 8.395277 12 5 86 10 16 3.713008 19.25249
Henze-Zirkler
mvn(JFK_1, mvnTest = c("hz"))
$multivariateNormality
Test HZ p value MVN
1 Henze-Zirkler 42.26252 0 NO
$univariateNormality
Test Variable Statistic p value Normality
1 Shapiro-Wilk TaxiIn 0.8387 <0.001 NO
2 Shapiro-Wilk TaxiOut 0.6452 <0.001 NO
$Descriptives
n Mean Std.Dev Median Min Max 25th 75th Skew Kurtosis
TaxiIn 677 6.939439 3.278159 6 2 41 5 8 2.603622 18.23401
TaxiOut 677 14.521418 8.395277 12 5 86 10 16 3.713008 19.25249
Royston
mvn(JFK_1, mvnTest = c("royston"))
$multivariateNormality
Test H p value MVN
1 Royston 264.1686 4.330916e-58 NO
$univariateNormality
Test Variable Statistic p value Normality
1 Shapiro-Wilk TaxiIn 0.8387 <0.001 NO
2 Shapiro-Wilk TaxiOut 0.6452 <0.001 NO