back to Introduction

Calculating Population Size

Recall that we can estimate the population size $N$ from the number of individuals caught in the first round $M$, the number caught in the second round $C$, and the number of the second round that had already been captured $R$ using the equation $N=\frac{MC}{R}$. If we did a version of our experiment capturing and marking 50 individuals in our first trapping, then 150 in the second trapping, of which 3 were marked, we would calculate the number of individuals as follows:

M <- 50
C <- 150
R <- 3
est_N <- M * C / R
est_N
## [1] 2500

That works, but if we were doing this many times, we could easily make a mistake with the formula at some point, and we would like to avoid that if possible. The best way to do that is to write a new function in R to perform the calculation.

Functions in R

Functions are one of the real strengths of R (and really any programming language), since they allow you to run the same analysis repeatedly with different sets of input data, without having to copy and paste large chunks of code, avoiding copying errors, and without cluttering up your workspace with all of the intermediate results. To define a function, you use the function() command, which takes as its arguments the names of the arguments that your new function will take. The commands that make up the function are then put in curly braces, {}, and can use the arguments you just defined as variables within the function. The last line of your function should usually be a return() statement. Whatever you put in that statement will be the output of the function.

To be more explicit, here is an example of a function that calculates the estimated population size.

recapturePopSize <- function(first, second, recaught){
  N_est <- first * second / recaught
  return(N_est)
}

recapturePopSize(first = 50, second = 150, recaught = 3)
## [1] 2500

Note that just as with the built-in functions, if we give the arguments in the correct order, we don’t actually have to name them when we use the function: recapturePopSize(50, 150, 3) would have worked as well as the more explicit version above.

Simulating the experiment

Now let’s see how well our estimate of the population size from a capture-recapture experiment actually works. Let’s assume that the true population size is exactly 3000 lizards. On the first trip we capture and mark 50. When we rerelease them, then the total population will consist of 50 marked individuals and 2950 unmarked lizards, which we will represent as a vector of strings: 50 with the value of "marked" and the remaining 2950 "unmarked". On our second trip we will capture 150 individuals, just as in the example above. We will simulate this capture using the sample() function, which picks elements from a list randomly without replacement (unless you tell it differently). We will sample 150 individuals from the population vector, then count the number of those that are "marked".

# create the population vector with marked & unmarked 
marked_pop <- rep(c("marked", "unmarked"), c(50, 2950))
# sample from that population
trapped <- sample(marked_pop, 150)
n_marked <- sum(trapped == "marked")
n_marked
## [1] 2
recapturePopSize(50, 150, n_marked)
## [1] 3750

Doing this one time doesn’t tell us much about the distribution of our estimate, so we can write a function to do the simulation, with arguments that specify the total population size and the number of individuals captured in the first and second trappings.

simRecapture <- function(popsize, first, second){
  population <- rep(c("marked", "unmarked"), c(first, popsize - first))
  trapped <- sample(population, second)
  return ( sum(trapped == "marked") )
}

simRecapture(3000, 50, 150)
## [1] 2

Then we can use a function called replicate() to call this function many times with the same arguments. Each time it runs it will choose a different random sample from the population, so we will get different results. Let’s run it 10 times for now, then calculate the estimated population sizes for each sample using our recapturePopSize() function. Notice that this function works just as we might have hoped when we give it a vector rather than a single value, even though we didn’t do anything special when we wrote it.

sim_recaptures <- replicate(10, simRecapture(3000, 50, 150))
sim_recaptures
##  [1] 3 3 4 4 3 0 1 1 4 1
recapturePopSize(50, 150, sim_recaptures)
##  [1] 2500 2500 1875 1875 2500  Inf 7500 7500 1875 7500

It turns out that making the population vector every time is pretty slow, so if we were to do a lot of simulations, this would be pretty slow to calculate. Luckily, we can take advantage of a function built into R that can do the same kind of sampling much more quickly. The second capture is very much like the binomial sampling that we have seen before, in that we are sampling individuals of two kinds (marked and unmarked) from a population where we know what the frequency of each kind, but unlike with a binomial distribution, we are capturing individuals without replacement. This makes it into something called a hypergeometric distribution, which has the following probability function, where the variables are the same as those we defined earlier: To draw randomly from this distribution in R, we use the rhyper() function (r for random, hyper for hypergeometric). This takes four arguments:
nn - the number of times we want to draw from the distribution (the number of trials or simulations)
m - the number of marked individuals ($M$ in the equation above)
n - the number of unmarked individuals ($N-M$ in the equation above)
k - the number of individuals to choose for each sample ($C$ above)

To create a set of 10 samples with this function which is equivalent to the functions above, we would do the following:

sim_recaptured <- rhyper(nn = 10, m = 50, n = 2950, k = 150)
sim_recaptured
##  [1] 4 5 3 4 1 1 5 2 0 1

We can wrap the simulation and estimation into a single function to do the simulation and estimation in one step, then return a data frame with each row representing all of the parameters for each simulation and its results.

simRecapture2<- function(popsize, first, second, reps = 1){
  recaught <- rhyper(nn = reps, m = first, n = popsize - first, k = second)
  pop_est <- recapturePopSize(first, second, recaught)
  return(data.frame(popsize, 
                    first, 
                    second, 
                    recaught, 
                    pop_est)
         )
}
simRecapture2(3000, 50, 150, reps = 5)
##   popsize first second recaught pop_est
## 1    3000    50    150        3    2500
## 2    3000    50    150        4    1875
## 3    3000    50    150        0     Inf
## 4    3000    50    150        0     Inf
## 5    3000    50    150        3    2500

A Better Estimate

It turns out that this simple estimate of the population size is somewhat biased. This is perhaps easiest to see if you think about the case when you don’t capture any marked individuals (you may have some of those in the data you simulated). Then $R = 0$, and the population is estimated to be of infinite size, which is never a good thing. A somewhat better estimator is the Schnabel method:

Write a function to estimate the number of individuals in the population using the Schnabel method.
a. Using as the true population size the number of lizards you estimated were in the box, simulate 1000 experiments where you capture 100 individuals in the first trapping and 100 in the second. Be sure to store the data from these (and the following) simulations, as you will need to use them for later problems as well. Generate a histogram of the estimated population sizes that you calculated with the Schnabel method.
b. How do your results change if you captured 150 individuals in the first trapping? What about if you caught 150 in the second (and 50 in the first)? It may be helpful to combine all of your results into a single data frame, and use ggplot2/qlot() to make all of the histograms together.
c. For each of the three experimental designs above, calculate the mean squared error of the Schnabel estimate ($\mathrm{MSE} = \frac{\sum(\hat{N}_i - N)^2}{n}$, where $\hat{N}_i$ is your estimate from the each simulation, $N$ is the population size you simulated, and $n$ is the number of simulations. Which gave the best estimate of the actual population size? Do you think it is better to put more effort into your first or second sample?

Back to Experimental Design

Using your results from the above simulations, and any more simulations you might want to do (recording all of your results!), decide as a class what sampling scheme we should use. What should we do to get the most accurate results?

Capturing Lizards

Now that the class has agreed on the sampling scheme, lets do it! Each of you will come up and “catch” some lizards, marking each with a single dot on its belly. If we were doing a real experiment, we would be remiss not to take some measurements of the lizards. Measurements on toy lizards are a bit silly, but do record how many of each color you capture. After all of the lizards are marked, we will throw them back in the box and mix thoroughly. Then you can catch the second sample, again recording how many of each color were caught, and how many were marked (by color and total). Do not return these lizards to the box. We will keep them for later experiments.

Estimate the following quantities from the pooled class data:
a. The total number of lizards in the box.
b. The number of lizards of each color in the box.
c. The proportion of lizards that are each color.
d. There are at least two ways to do the second two calculations. Which do you think is more accurate? If you are motivated, design and perform some simulations to test your intuition.

Next

Now that we have an initial estimate of the lizard population, we need to think about how confident we are in that estimate, and ways we might improve our estimate.

Make sure you have saved all of your work from this week. You will need some of your results next time. Error and likelihood