统计建模与r软件课后习题五 5.1~5.11题

第五章假设检验

问题导向：由正常男子血小板计数均值这句话，容易判断属于对均值进行检验的问题

H0：与正常男子无差异等于225    H1:与正常男子有差异，不等于225
 x=c(220, 188, 162, 230, 145, 160, 238,
+  188, 247, 113, 126, 245, 164, 231, 256, 183, 190, 158, 224, 175)
 t.test(x,mu=225)
 运行结果如下：

        One Sample t-test

data:  x
t = -3.4783, df = 19, p-value = 0.002516
alternative hypothesis: true mean is not equal to 225
95 percent confidence interval:
 172.3827 211.9173
sample estimates:
mean of x 
   192.15 

t.test()函数格式：
t.test(x,y=NULL,alternative=c('two.sided','less','greater',mu=0,conf.level=0.95)
x,y表示向量，alternative=c('two.sided','less','greater')备择假设，默认为'two.sided'，双边检验，mu:均值，默认为零均值，conf.level：显著性水平，以上都是缺省时的默认值

p-value = 0.002516<0.05,拒绝原假设，油漆工人的血小板数与正常男子有差异，并且mean of x =192.15 <225，说明油漆工人的血小板数小于正常男子。

统计建模与r软件课后习题五 5.1~5.11题

问题导向：求概率，求P{X>x}的值，x为1000

x=c(1067 ,919 ,1196 ,785,1126 ,936 , 918, 1156 , 920 ,948)
pnorm(1000,mean(x),sd(x))  #pnorm(x,mean(x),sd(x))求符合正态分布x的分布函数
[1] 0.5087941

#得P值为0.5087941，也就是P{X<x}=0.5087941,则P{X>x}=1-0.5087941=0.4912059

统计建模与r软件课后习题五 5.1~5.11题

补充知识点：原假设和备择假设的选择，原假设一般为我们不希望的结果，是拿来拒绝的，不能轻易否定，等号在原假设中。备择假设是我们希望的结果，一般而言研究什么问题就放在备择假设中，先确定备择假设

该题目研究的是两种方法治疗贫血的效果，比较的是谁好谁坏，备择假设就选两者有差异

H0:两种方法治疗无差异  H1：两种方法治疗有差异
> A=c(113,120,138,120,100,118,138,123)
> B=c(138,116,125,136,110,132,130,110)
> t.test(A,B,paired=TRUE)  #paired=TRUE，表示成对数据，缺省时为FALSE

        Paired t-test

data:  A and B
t = -0.65127, df = 7, p-value = 0.5357
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -15.628891   8.878891
sample estimates:
mean of the differences 
                 -3.375 

p-value = 0.5357>0.05,不能拒绝原假设，两种方法治疗无差异，效果相同

另外一种方法：
> t.test(A-B) #Z=A-B,对Z做单样本均值检验要优于双样本均值检验，成对数据的t检验
结果一样
        One Sample t-test

data:  A - B
t = -0.65127, df = 7, p-value = 0.5357
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -15.628891   8.878891
sample estimates:
mean of x 
   -3.375

统计建模与r软件课后习题五 5.1~5.11题

（1)H0:样本来自正态分布总体 H1:样本不来自正态分布总体

1.正态性W检验方法
> x=c(-0.7,-5.6,2,2.8,0.7,3.5,4,5.8,7.1,-0.5,2.5,-1.6,1.7,3,0.4,4.5,4.6,2.5,6,-1.4)
> y=c(3.7,6.5,5,5.2,0.8,0.2,0.6,3.4,6.6,-1.1,6,3.8,2,1.6,2,2.2,1.2,3.1,1.7,-2)                    
> shapiro.test(x) 

        Shapiro-Wilk normality test

data:  x
W = 0.9699, p-value = 0.7527


> shapiro.test(y)

        Shapiro-Wilk normality test

data:  y
W = 0.97098, p-value = 0.7754

结果：x 和 y的P值都大于0.05，不能拒绝原假设，认为x和y都来自正态分布总体

2.Kolmogorov_Smirnov检验方法

> ks.test(x,"pnorm",mean(x),sd(x)) 
>  #多说一句：如果x服从指数分布ks.test(x,'pexp','指数分布的参数'）

        One-sample Kolmogorov-Smirnov test

data:  x
D = 0.10652, p-value = 0.9771
alternative hypothesis: two-sided

Warning message:
In ks.test(x, "pnorm", mean(x), sd(x)) :
  ties should not be present for the Kolmogorov-Smirnov test


> ks.test(y,"pnorm",mean(y),sd(y))

        One-sample Kolmogorov-Smirnov test

data:  y
D = 0.11969, p-value = 0.9368
alternative hypothesis: two-sided

Warning message:

结果：x和y的P值都大于0.05，有显著的理由不能拒绝原假设，认为x和y都来自正态分布总体

3.pearson拟合优度检验 

> sort(x)  #先对x进行从小到大排序
 [1] -5.6 -1.6 -1.4 -0.7 -0.5  0.4  0.7  1.7  2.0  2.5  2.5  2.8  3.0  3.5
[15]  4.0  4.5  4.6  5.8  6.0  7.1
> cut(x,br=c(-6,-3,0,3,6,9))#将变量分成若干小区间，br相当于在绘制频率分布直方图时，x轴构成的向量
 [1] (-3,0]  (-6,-3] (0,3]   (0,3]   (0,3]   (3,6]   (3,6]   (3,6]   (6,9]  
[10] (-3,0]  (0,3]   (-3,0]  (0,3]   (0,3]   (0,3]   (3,6]   (3,6]   (0,3]  
[19] (3,6]   (-3,0] 
Levels: (-6,-3] (-3,0] (0,3] (3,6] (6,9]
> t=table(cut(x,br=c(-6,-3,0,3,6,9)))#计算随机变量落在某个区间的频数
 
 (-6,-3]  (-3,0]   (0,3]   (3,6]   (6,9] 
      1       4       8       6       1 
> p=pnorm(c(-3,0,3,6,9),mean(x),sd(x))#我不理解这里的c(-3,0,3,6,9)
> p
[1] 0.04894712 0.24990009 0.62002288 0.90075856 0.98828138

> chisq.test(t,p=p)#P分布函数
Error in chisq.test(t, p = p) : probabilities must sum to 1.
报错，我也不明白。。。

（2）两组数据的均值检验

H0:两组数据均值无差异，均值相等 H1:两组数据均值有差异

1.方差相同模型t检验：
> t.test(x,y,var.equal=TRUE)

        Two Sample t-test

data:  x and y
t = -0.64187, df = 38, p-value = 0.5248
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.326179  1.206179
sample estimates:
mean of x mean of y 
    2.065     2.625 


2.方差不同的模型t检验：
> t.test(x,y)  #默认不相同

        Welch Two Sample t-test

data:  x and y
t = -0.64187, df = 36.086, p-value = 0.525
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.32926  1.20926
sample estimates:
mean of x mean of y 
    2.065     2.625 


3.成对数据的t检验：
> t.test(x-y)

        One Sample t-test

data:  x - y
t = -0.64644, df = 19, p-value = 0.5257
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -2.373146  1.253146
sample estimates:
mean of x 
    -0.56

结果：三种检验的P值都大于0.05，不能拒绝原假设，两组数据均值无差异

（3）对方差是否相同进行检验

H0:方差相同 H1：方差不相同

> var.test(x,y)

        F test to compare two variances

data:  x and y
F = 1.5984, num df = 19, denom df = 19, p-value = 0.3153
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.6326505 4.0381795
sample estimates:
ratio of variances 
          1.598361

结果：P值大于0.05，接受原假设，两组数据方差相同。

统计建模与r软件课后习题五 5.1~5.11题

（1） H0:样本服从正态分布 H1:样本不服从正态分布

正态性检验，采用ks检验：

> a=c(126,125,136,128,123,138,142,116,110,108,115,140)
> b=c(162,172,177,170,175,152,157,159,160,162)
> ks.test(a,"pnorm",mean(a),sd(a))  #检验a

        One-sample Kolmogorov-Smirnov test

data:  a
D = 0.14644, p-value = 0.9266
alternative hypothesis: two-sided


> ks.test(b,"pnorm",mean(b),sd(b))  #检验b

        One-sample Kolmogorov-Smirnov test

data:  b
D = 0.22216, p-value = 0.707
alternative hypothesis: two-sided

Warning message:      #不知道为什么有警告信息
In ks.test(b, "pnorm", mean(b), sd(b)) :
  ties should not be present for the Kolmogorov-Smirnov test

结果：p-value = 0.9266>0.05,p-value = 0.707>0.05,有充分的理由不能拒绝原假设，认为两个样本都服从正态分布

（2）对方差进行检验

H0:方差相同 H1：方差不相同

> var.test(a,b)

        F test to compare two variances

data:  a and b
F = 1.9646, num df = 11, denom df = 9, p-value = 0.32
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.5021943 7.0488630
sample estimates:
ratio of variances 
          1.964622

结果：p-value = 0.32>0.05,不能拒绝原假设，认为a,b方差相同

（3）对均值进行检验

H0:均值相同，无差别 H1：均值不相同

> t.test(a,b,var.equal=TRUE)

        Two Sample t-test

data:  a and b
t = -8.8148, df = 20, p-value = 2.524e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -48.24975 -29.78358
sample estimates:
mean of x mean of y 
 125.5833  164.6000

结果：p-value = 2.524e-08<0.05,拒绝原假设，新药组和对照组二者有差别

统计建模与r软件课后习题五 5.1~5.11题

问题导向：二项分布的假设检验

H0: 支持该市老年人口比重14.7%的看法 H1：不支持该市老年人口比重14.7%的看法

> binom.test(57,400,p=0.147)  #binom.test(成功次数，试验总次数，p=原假设概率）

        Exact binomial test

data:  57 and 400
number of successes = 57, number of trials = 400, p-value = 0.8876
alternative hypothesis: true probability of success is not equal to 0.147
95 percent confidence interval:
 0.1097477 0.1806511
sample estimates:
probability of success 
                0.1425

结果：p-value = 0.8876>0.05,支持该市老年人口比重14.7%的看法

统计建模与r软件课后习题五 5.1~5.11题

问题导向：雏鸡分为母雏和公雏，性别比例为1:1，则原来公雏：母雏=1:1，各占1/2，属于二项分布的假设检验

H0: p=0.5 H1：p>0.5

> binom.test(178,328,p=0.5,alternative="greater")  #第一种方法：成功次数，试验的总次数

        Exact binomial test

data:  178 and 328
number of successes = 178, number of trials = 328, p-value = 0.06794
alternative hypothesis: true probability of success is greater than 0.5
95 percent confidence interval:
 0.4957616 1.0000000
sample estimates:
probability of success 
             0.5426829 



> binom.test(c(178,150),p=0.5,alternative="greater") 
>  #第二种方法：c(成功次数，失败次数）

        Exact binomial test

data:  c(178, 150)
number of successes = 178, number of trials = 328, p-value = 0.06794
alternative hypothesis: true probability of success is greater than 0.5
95 percent confidence interval:
 0.4957616 1.0000000
sample estimates:
probability of success 
             0.5426829 
两种方法结果是一样的，表达不同

结果：p-value = 0.06794>0.05,不能拒绝原假设，认为这种处理能增加母鸡的比例。

统计建模与r软件课后习题五 5.1~5.11题

利用pearson卡方检验是否符合特定分布：

H0: 符合自由组合规律 H1：不符合自由组合规律

> chisq.test(c(315,101,108,32),p=c(9/16,3/16,3/16,1/16))#p为特定的分布，默认为均匀分布

        Chi-squared test for given probabilities

data:  c(315, 101, 108, 32)
X-squared = 0.47002, df = 3, p-value = 0.9254

结果：p-value = 0.9254>0.05,认为符合自由组合规律

统计建模与r软件课后习题五 5.1~5.11题

这里用pearson检验，泊松分布的均值就为参数

H0: X服从泊松分布 H1：X不服从泊松分布

> x =c(0, 1, 2, 3, 4, 5)
> y =c(92, 68, 28, 11, 1, 0)
> # 因为y的最后一组的频数小于5，卡方检验为出错，需要把最后两组和前面的合并
> y =c(92, 68, 28, 12)
> # 计算泊松分布的理论分布概率，其中，mean(rep(x,y))为样本均值
> q =ppois(x, mean(c(rep(0, 92), rep(1, 68), rep(2, 28), rep(3, 11), rep(4, 1), rep(5, 0))))
> 
> q
[1] 0.4470879 0.8069937 0.9518558 0.9907271 0.9985500 0.9998094
> chisq.test(c(92, 68, 28, 12), p = c(q[1], q[2] - q[1], q[3] - q[2], 1 - q[3]))#  我一直都不明白为什么要减

        Chi-squared test for given probabilities

data:  c(92, 68, 28, 12)
X-squared = 0.91132, df = 3, p-value = 0.8227



错误的一种做法：
> ks.test(c(92, 68, 28, 11,10,0),'ppois',mean(c(rep(0, 92), rep(1, 68), rep(2, 28), rep(3, 11), rep(4, 1), rep(5, 0))))

        One-sample Kolmogorov-Smirnov test

data:  c(92, 68, 28, 11, 10, 0)
D = 0.83333, p-value = 4.287e-05
alternative hypothesis: two-sided

结果：p-value = 0.8227>0.05,X服从泊松分布

统计建模与r软件课后习题五 5.1~5.11题

用双样本的ks检验

H0: 两分布相同 H1：两分布不相同

#ks检验 两个分布是否相同：
> x=c(2.36,3.14,752,3.48,2.76,5.43,6.54,7.41)
> y=c(4.38,4.25,6.53,3.28,7.21,6.55)
>  ks.test(x,y)

        Two-sample Kolmogorov-Smirnov test

data:  x and y
D = 0.375, p-value = 0.6374
alternative hypothesis: two-sided

结果：p-value = 0.6374>0.05,认为两分布相同，来自同一个总体

统计建模与r软件课后习题五 5.1~5.11题

列联表数据的独立性检验

研究的是使用检测仪对剖腹产有无影响

H0: 二者独立，无影响 H1：二者不独立，有影响

> x = c(358,2492,229,2745)
> dim(x)=c(2,2)#定义维度为2行2列
> chisq.test(x)

        Pearson's Chi-squared test with Yates' continuity correction

data:  x
X-squared = 37.414, df = 1, p-value = 9.552e-10

结果：p-value = 9.552e-10<0.05,拒绝原假设，认为使用检测仪对剖腹产有影响

统计建模与r软件课后习题五 5.1~5.11题

列联表数据的独立性检验

H0: 二者独立 H1：二者不独立

> y=matrix(c(45 ,12 ,10 ,46 ,20, 28, 28,23 ,30 ,11 ,12,35),nrow=4,ncol=3)  #默认为按列存放数据
> 
> y
     [,1] [,2] [,3]
[1,]   45   20   30
[2,]   12   28   11
[3,]   10   28   12
[4,]   46   23   35
> chisq.test(y)

        Pearson's Chi-squared test

data:  y
X-squared = 36.043, df = 6, p-value = 2.705e-06

p-value = 2.705e-06<0.05,拒绝原假设，A与B不独立，有关系。

统计建模与r软件课后习题五 5.1~5.11题

第五章假设检验

继续阅读

生物信息学入门使用 GEO基因芯片数据进行差异表达分析（DEG）——Limma 算法数据代码结果解读

为什么选择R语言为什么选择R语言

Aspera/FTP下载SRA/fastq文件后根据样本信息进行批量重命名

bam格式转换为Fastq/Fasta格式bam格式转换为Fastq/Fasta格式

非参数统计分析

R语言实战 - 基本统计分析（1）- 描述性统计分析

跟着Cell学单细胞转录组分析(十二):转录因子分析

R语言|绘制三维图

数据的归一化（Normalization）、标准化（Standardization）

R语言| 中介效应分析，Mediation包和BruceR包，循环Process函数

【R语言】GARCH模型的应用一、数据来源二、数据分析三、模型建立四、模型优化五、结论六、实现代码七、参考资料

Python TensorFlow循环神经网络RNN-LSTM神经网络预测股票市场价格时间序列和MSE评估准确性

Matlab随机波动率SV、GARCH用MCMC马尔可夫链蒙特卡罗方法分析汇率时间序列

Lagrange插值的R语言实现

R语言-相关性分析函数

拓端tecdat|R语言弹性网络Elastic Net正则化惩罚回归模型交叉验证可视化

统计建模与r软件课后习题五 5.1~5.11题

第五章 假设检验

继续阅读

第五章假设检验