Sampling Distributions and Central Limit Theorem in R(转)
The Central Limit Theorem (CLT), and the concept of the sampling distribution, are critical for understanding why statistical inference works. There are at least a handful of problems that require you to invoke the Central Limit Theorem on every ASQ Certified Six Sigma Black Belt (CSSBB) exam. The CLT says that if you take many repeated samples from a population, and calculate the averages or sum of each one, the collection of those averages will be normally distributed… and it doesn’t matter what the shape of the source distribution is!
I wrote some R code to help illustrate this principle for my students. This code allows you to choose a sample size (n), a source distribution, and parameters for that source distribution, and generate a plot of the sampling distributions of the mean, sum, and variance. (Note: the sampling distribution for the variance is a Chi-square distribution!)
sdm.sim <- function(n,src.dist=NULL,param1=NULL,param2=NULL) {
r <- 10000 # Number of replications/samples - DO NOT ADJUST
# This produces a matrix of observations with
# n columns and r rows. Each row is one sample:
my.samples <- switch(src.dist,
"E" = matrix(rexp(n*r,param1),r),
"N" = matrix(rnorm(n*r,param1,param2),r),
"U" = matrix(runif(n*r,param1,param2),r),
"P" = matrix(rpois(n*r,param1),r),
"C" = matrix(rcauchy(n*r,param1,param2),r),
"B" = matrix(rbinom(n*r,param1,param2),r),
"G" = matrix(rgamma(n*r,param1,param2),r),
"X" = matrix(rchisq(n*r,param1),r),
"T" = matrix(rt(n*r,param1),r))
all.sample.sums <- apply(my.samples,1,sum)
all.sample.means <- apply(my.samples,1,mean)
all.sample.vars <- apply(my.samples,1,var)
par(mfrow=c(2,2))
hist(my.samples[1,],col="gray",main="Distribution of One Sample")
hist(all.sample.sums,col="gray",main="Sampling Distributionnof
the Sum")
hist(all.sample.means,col="gray",main="Sampling Distributionnof the Mean")
hist(all.sample.vars,col="gray",main="Sampling Distributionnof
the Variance")
}
There are 9 population distributions to choose from: exponential (E), normal (N), uniform (U), Poisson (P), Cauchy (C), binomial (B), gamma (G), Chi-Square (X), and the Student’s t distribution (t). Note also that you have to provide either one or two parameters, depending upon what distribution you are selecting. For example, a normal distribution requires that you specify the mean and standard deviation to describe where it’s centered, and how fat or thin it is (that’s two parameters). A Chi-square distribution requires that you specify the degrees of freedom (that’s only one parameter). You can find out exactly what distributions require what parameters by going here:http://en.wikibooks.org/wiki/R_Programming/Probability_Distributions.
Here is an example that draws from an exponential distribution with a mean of 1/1 (you specify the number you want in the denominator of the mean):
sdm.sim(50,src.dist="E",param1=1)
The code above produces this sequence of plots:
You aren’t allowed to change the number of replications in this simulation because of the nature of the sampling distribution: it’s a theoretical model that describes the distribution of statistics from an infinite number of samples. As a result, if you increase the number of replications, you’ll see the mean of the sampling distribution bounce around until it converges on the mean of the population. This is just an artifact of the simulation process: it’s not a characteristic of the sampling distribution, because to be a sampling distribution, you’ve got to have an infinite number of samples. Watkins et al. have a great description of this effect that all statistics instructors should be aware of. I chose 10,000 for the number of replications because 1) it’s close enough to infinity to ensure that the mean of the sampling distribution is the same as the mean of the population, but 2) it’s far enough away from infinity to not crash your computer, even if you only have 4GB or 8GB of memory.
Here are some more examples to try. You can see that as you increase your sample size (n), the shapes of the sampling distributions become more and more normal, and the variance decreases, constraining your estimates of the population parameters more and more.
sdm.sim(10,src.dist="E",1)
sdm.sim(50,src.dist="E",1)
sdm.sim(100,src.dist="E",1)
sdm.sim(10,src.dist="X",14)
sdm.sim(50,src.dist="X",14)
sdm.sim(100,src.dist="X",14)
sdm.sim(10,src.dist="N",param1=20,param2=3)
sdm.sim(50,src.dist="N",param1=20,param2=3)
sdm.sim(100,src.dist="N",param1=20,param2=3)
sdm.sim(10,src.dist="G",param1=5,param2=5)
sdm.sim(50,src.dist="G",param1=5,param2=5)
sdm.sim(100,src.dist="G",param1=5,param2=5) 转自:http://www.r-bloggers.com/sampling-distributions-and-central-limit-theorem-in-r/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+RBloggers+%28R+bloggers%29
Sampling Distributions and Central Limit Theorem in R(转)的更多相关文章
- Sampling Distribution of the Sample Mean|Central Limit Theorem
7.3 The Sampling Distribution of the Sample Mean population:1000:Scale are normally distributed with ...
- 加州大学伯克利分校Stat2.2x Probability 概率初步学习笔记: Section 4 The Central Limit Theorem
Stat2.2x Probability(概率)课程由加州大学伯克利分校(University of California, Berkeley)于2014年在edX平台讲授. PDF笔记下载(Acad ...
- 【概率论】6-3:中心极限定理(The Central Limit Theorem)
title: [概率论]6-3:中心极限定理(The Central Limit Theorem) categories: - Mathematic - Probability keywords: - ...
- Appendix 1- LLN and Central Limit Theorem
1. 大数定律(LLN) 设Y1,Y2,……Yn是独立同分布(iid,independently identically distribution)的随机变量,A = SY /n = (Y1+...+ ...
- Law of large numbers and Central limit theorem
大数定律 Law of large numbers (LLN) 虽然名字是 Law,但其实是严格证明过的 Theorem weak law of large number (Khinchin's la ...
- 中心极限定理(Central Limit Theorem)
中心极限定理:每次从总体中抽取容量为n的简单随机样本,这样抽取很多次后,如果样本容量很大,样本均值的抽样分布近似服从正态分布(期望为 ,标准差为 ). (注:总体数据需独立同分布) 那么样本容量n应 ...
- 中心极限定理 | central limit theorem | 大数定律 | law of large numbers
每个大学教材上都会提到这个定理,枯燥地给出了定义和公式,并没有解释来龙去脉,导致大多数人望而生畏,并没有理解它的美. <女士品茶>有感 待续~ 参考:怎样理解和区分中心极限定理与大数定律?
- 【转载】Recommendations with Thompson Sampling (Part II)
[原文链接:http://engineering.richrelevance.com/recommendations-thompson-sampling/.] [本文链接:http://www.cnb ...
- (main)贝叶斯统计 | 贝叶斯定理 | 贝叶斯推断 | 贝叶斯线性回归 | Bayes' Theorem
2019年08月31日更新 看了一篇发在NM上的文章才又明白了贝叶斯方法的重要性和普适性,结合目前最火的DL,会有意想不到的结果. 目前一些最直觉性的理解: 概率的核心就是可能性空间一定,三体世界不会 ...
随机推荐
- 嵌入javascript脚本的位置
JavaScript脚本可以放在HTML文档任何需要的位置.一般来说,可以在<head>与</head>.<body>与</body>标记对之间按需要放 ...
- css锚点ios不兼容的方法
css锚点的正常方法: <a href="#1f"></a> <a name="1f"></a> ios出现的问 ...
- C++实现的控制台-贪吃蛇
周六终于可以抽出一整段时间了 想了想就写个贪吃蛇吧 第一次写 差不多下了140行 也不算太多吧 以后ACM比赛是在做不来就自己打个贪吃蛇玩 ps:本来想写个项目的 但是为了方便你们阅读 就写在 ...
- 我从现象中学到的CSS
文字溢出隐藏 如果你观察过浮动元素,你会发现这样一个事实,当前一个元素将宽度占满以后,后一个元素就会往下掉,如下所示 代码如下 <style> div,p{ margin:0; } #bo ...
- java多线程基本概述(二)——Thread的一些方法
在Thread类中有很多方法值得我们关注一下.下面选取几个进行范例: 1.1.isAlive()方法 java api 描述如下: public final boolean isAlive() Tes ...
- 秒懂JS对象、构造器函数和原型对象之间的关系
学习JS的过程中,想要掌握面向对象的程序设计风格,对象模型(原型和继承)是其中的重点和难点,拜读了各类经典书籍和各位前辈的技术文章,感觉都太过高深,花费了不少时间才搞明白(个人智商是硬伤/(ㄒoㄒ)/ ...
- PROFINET如何实现实时性
平时我们都听过文艺作品要“源于生活而高于生活”.PROFINET是基于工业以太网的,用文艺范儿的词汇说就是“源于以太网而高于以太网”.那么,PROFINET是怎么做到“高于以太网”的呢? 要做到比普通 ...
- java多线程-消费者和生产者模式
/* * 多线程-消费者和生产者模式 * 在实现消费者生产者模式的时候必须要具备两个前提,一是,必须访问的是一个共享资源,二是必须要有线程锁,且锁的是同一个对象 * */ /*资源类中定义了name( ...
- PMD教程
1.单词 violations outline:错误大纲2.错误级别 红色 很高的错误 橙色 错误 黄色 很高的警告 绿色 警告 蓝色 输出信息3.提示 Avoid excessively long ...
- oracle中varchar、varchar2、char和nvarchar的区别
1.char char的长度是固定的,比如说,你定义了char(20),即使你你插入abc,不足二十个字节,数据库也会在abc后面自动加上17个空格,以补足二十个字节: char是区分中英文的,中文在 ...