Stratified Sampling: Animated Statistics Using R -- Yihui XIE

Divide the population into subsets shareing common characteristics and do SRSWOR within the subsets.

Ideas

Stratified Sampling is commonly used probability method that is superior to random sampling because it reduces sampling error. A stratum is a subset of the population that share at least one common characteristic. Examples of stratums might be males and females, or managers and non-managers. The researcher first identifies the relevant stratums and their actual representation in the population. Random sampling is then used to select a sufficient number of subjects from each stratum. "Sufficient" refers to a sample size large enough for us to be reasonably confident that the stratum represents the population. Stratified sampling is often used when one or more of the stratums in the population have a low incidence relative to the other stratums.

Implementations in R

Again, what we need is also the function sample(). The rest work is just to define strata and do the SRSWOR within strata. If we have organized our data well, we may use tapply() to finish the sampling. For example:

> (dat = data.frame(x = 1:15, stratum = gl(3, 5)))
    x stratum
1   1       1
2   2       1
3   3       1
4   4       1
5   5       1
6   6       2
7   7       2
8   8       2
9   9       2
10 10       2
11 11       3
12 12       3
13 13       3
14 14       3
15 15       3

> attach(dat)
> (tapply(x, stratum, sample, size = 2))
$`1`
[1] 1 4

$`2`
[1]  9 10

$`3`
[1] 12 11

> detach(dat)

I just sampled 2 elements from each stratum in the above example.

Animations for Stratified Sampling

Every rectangle stands for a stratum; I sampled 3 elements from each stratum.

loading animation frames...

80%

Time Interval: 2 seconds;

R code:

x = cbind(rep(1:10, 10), gl(10, 10))
par(mar = rep(0.5, 4), xaxs = "i", yaxs = "i")
for (i in 1:100) {
    plot(x, axes = F, ann = F, type = "n", xlim = c(0.5, 10.5),
        ylim = c(0.5, 10.5))
    rect(rep(0.5, 10), seq(0.5, 10, 1), rep(10.5, 10), seq(1.5,
        11, 1), col = c("beige", "white")[rep(1:2, 5)])
    points(x, pch = 19, col = "blue")
    points(x[as.vector(replicate(10, sample(10, 3))) + rep(seq(0,
        90, 10), each = 3), ], col = "red", cex = 3, lwd = 2)
    Sys.sleep(1)
}

Please take some time to consider what does this mean: as.vector(replicate(10, sample(10, 3))) + rep(seq(0, 90, 10), each = 3).