Presentation on theme: "Understanding Species’ Abundance Bikas K Sinha Retired Faculty INDIAN STATISTICAL INSTITUTE KOLKATA ********************* RU Workshop : 18/04/2012."— Presentation transcript:
Understanding Species’ Abundance Bikas K Sinha Retired Faculty INDIAN STATISTICAL INSTITUTE KOLKATA ********************* RU Workshop : 18/04/2012
ABSTRACT Considered is a multi-species assemblage in an infinite population with unknown and possibly heterogeneous species' abundance levels. A random sample of a fixed size (n) has been drawn only to realize a certain species distribution. At this stage, there are two interesting inference problems : (i) Prediction of the unknown proportion of the collective abundance of all hitherto unrealized species (ii) Assessment of the quantum of additional units to be sampled [in terms of the sample size n] to realize a certain number of hitherto unobserved species.
ABSTRACT…..contd. Finite population analogue of this problem is also worth discussing. I propose to review the available literature in this fascinating area of research.
Formulation…. Measurement of biodiversity is an important issue in ecological studies and planning wildlife conservation. The conventional method of measurement involves sampling in various forms such as line transects or quadrats. Many field studies appear to be rather ad hoc in their design, especially with respect to amount of effort put in. A crucial question in this context is how much sampling effort should be considered enough.
Formulation….. Two quantitative aspects of diversity are widely regarded as central to its measurement. These are: Species Richness (number of species of a taxon in a given geographical area) and Species Evenness (differences in relative abundance). The two can be combined in various ways to form diversity indices (see e.g. Patil and Taillie (1982))
Formulation…. Earlier studies (Gore and Paranjpe 1997) indicate that for estimation of diversity indices, a sample of 1000 units should suffice. On the other hand, effort needed to estimate species richness is one order of magnitude higher. Consider a virtually infinite population with an underlying species’ distribution which is naturally unknown. It makes sense to ‘estimate’ the number of [distinct] likely available species PROVIDED ALL ARE ‘Equally Abundant’ ! In the 1950’s – 1980’s there have been studies along this direction.
Formulation…. More realistically, we can put a cap ‘K’ on the number of species and investigate the nature of their unknown and most likely heterogeneous abundance distribution given by P 1, P 2, ….the labels are ‘hypothetical’ in nature…. A random sample of a fixed size (n) has been drawn only to realize a certain species distri- bution. We may ‘re-label’ the realized species as #1, #2,…#k n, say, with a decreasing observed abundance distribution p 1 ≥ p 2 ≥…
Formulation….. At this stage, there are two interesting inference problems : (i) Prediction of the unknown proportion of the collective abundance of all hitherto unrealized species (ii) Assessment of the quantum of additional units to be sampled [in terms of the sample size n] to realize a certain number of hitherto unobserved species.
Approach….. Note : As n tends to infinity, k n tends to K and p i ’s tend to cover all the P i ’s upto a permutation. Finite Sample Study : I(i; s n ) = 1 if species labelled ‘i’ has not been realized in the sample s n of size n [i =1 to K] Unexplored collective species’ abundance parameter is given by Theta(s n ) = ∑ P i I(i; s n ) and we want to predict Theta(s n ), based on the given realization. Note that Theta(s n ) is a sample-dependent random quantity.
Solution….. Concept of Unbiased Prediction…… U(s n ) : Unbiased Predictor of Theta(s n ) iff E[U(s n ) - Theta(s n )] = 0. Note : E[Theta(s n )] = Theta n, say….given by Theta n = ∑ P i (1 – P i ) n Interpretation of Theta n : Having drawn a random sample of size n and having encoun- tered some species, it is the chance that the next observation, randomly drawn from the same population, will produce a NEW Species.
SOLUTION…. Estimation of Theta n : In the exact sense Each summand above is a polynomial of degree (n+1), last term being (-1) n P i (n+1) Impossible to estimate unbiasedly based on a sample of size n. Ignoring each summand’s last term in Theta n, U(s n ) has a routine form… term by term……estimation…..routine task For each ‘i’……Biased Predictor……..may be OK in most cases…..even for moderate n……
Unbiased Prediction : Second Stage Sampling….. Go for additional ‘m’ observations (s* m ) and mix with s n to form ‘updated’ frequency counts of ‘already observed’ and possibly ‘new’ species categories…… Able to define U(s n U s* m ) for unbiased prediction of Theta(s n ) i.e., unbiased estimation of Theta n. Theoretical Results & Illustrative Examples follow.
Theory of Estimation….. Theta n = ∑ P i (1 – P i ) n = ∑ Theta i, n Theta i, n = Polynomial of degree (n+1) in P i Standard formula for unbiased estimation of a polynomial function based on (n+m; m ≥ 1) observations Theta^ i, n = U i (s n Us* m ) for each i Can be combined over all i’s as ∑ i U i (..) and expressed in a nice form as Theta^ n = U(……)
Formulae….. U(…) = ∑ j ≤ m [(m-1)_c_(j-1) V_ j, n+m] / (n+m)_c_ j V_ j, n+m = # of species categories each with frequency ‘j’ in the combined sample of size n+m Starr [1979, Annals of Statistics] Special Case : m = 1 U(..) = V_1, n+1 / (n+1) Robbins [ 1968, AMS] Note : U(..) : UMVUE for Theta n. Clayton & Frees [JASA, 1987]; Nayak [JSPI, 1992]
Illustrative Examples…. K = 50; n = 100; m = 5 addl. units in Stage II Species Freq. Counts Before / After Stage II Labels Before After 1 47 + 2 49 2 33 + 2 35 3 11 + 0 11 4 9 + 0 9 5 - + 1 1
Computations…. V_1, 105 = 1; rest are all 0’s U(..) = 1/105 < 1 % Very little chance of discovery of new species at this stage…..
Back Tracking…… If we have maintained a record of all the 100 incoming observations, we may use the last 5 to develop a prediction formula for the initial sample of 95. Also we can randomly select a subset of 5 and do the same exercise. Repeated sampling will produce a prediction distribution for an additional sample of size 5 and an initial sample of size 95.
Computations…. K = 50; n_ = 95 [deleting last 5] ; n = 100 Species Freq. Counts Labels Total of 95 Total of 100 1 45 + 2 47 2 31 + 2 33 3 11 + 0 11 4 8 + 1 9 V_j, 100 = 0 for all 1 ≤ j ≤ 5 U(..) = 0…….very low chance of discovery of new species after 95 draws…..
Finite Population Search…. Measurement of Species’ Richness [# Species] Prediction of Species’ Abundance [Prop. Unexpl.] Natural Community : Trees / Birds / Mussels Gore, A.P. and Paranjpe, S.A. (2001). A Course in Mathematical and Statistical Ecology, Kluwer Academic Publishers, London. Empirical findings… Sample size : 1000 for Species’ Abundance : 10,000 for Species’ Richness [For Bird Community with 500 Species……]
Prediction of Species’ Abundance: Finite Population Inference N = Size of Units in a Finite Population K = Cap on the Number of Distinct Species N 1, N 2, …, N K …..species-specific sizes n = sample size under SRSWR/SRSWOR sampling s n = sample of n units out of N units With / Without Replacement WOLG : Observed Species 1, 2, …, k n with frequency counts n 1, n 2, …, n k_n so that n = ∑ n i ; n i > 0
Finite Population Inference Unexplored Species Abundance = ∑ j>k_n (N j / N) Prediction of Theta (s n )= ∑ j>k_n (N j / N), excluding abundance of all those k n species captured by s n.
SRSWR (N, n ) : Inference….. Under SRSWR(N, n) : SAME RESULTS HOLD Define P i = N i / N; i = 1, 2, …, K As before….Theta(s n ) = ∑ P i I(i; s n ) and we want to predict Theta(s n ) Unbiased Predictor calls for m [≥1] additional sample units and the Predictor is given by U(…)= ∑ j ≤ m [(m-1)_c_(j-1)V_ j, n+m]/(n+m)_c_ j V_ j, n+m = # of species categories each with frequency ‘j’ in the combined sample of size n+m
SRSWOR (N, n) : Results Use I(i, s n ) =1 if Species # i is not represented in s n Theta(s n ) = ∑ P i I(i, s n ) = combined proportion of units of unobserved species Theta n = E[Theta(s n )] = ∑ P i E[I(i, s n )] SRSWR : E[…] = (1 – P i ) n SRSWOR: E[…] = [(N-N i )_c_n] / [N_c_n] Theta n = ∑ P i [(N-N i )_c_n] / [N_c_n] P[Discovery of New Species in next draw] is given by ∑ [N i /(N-n)][(N-N i )_c_n/N_c_n] =NTheta n /(N-n)= Theta* n, say.
SRSWOR (N, n) : Results Needed additional units based on SRSWOR(N-n, m) Theorem : UMVUE of Theta* n is given by ∑ j ≤ m [(m-1)_c_(j-1)V_ j, n+m]/(n+m)_c_ j V_ j, n+m = # of species categories each with frequency ‘j’ in the combined sample of size n+m Note : Theta n - estimate = (1 - f) Theta* n - estimate f = n/N = sampling fraction
SRSWR (N, n ) : Distinguishable UNITS ? So far….tacit assumption: units within species are indistinguishable... so frequency counts V_ j, n+m’s were Relevant and informative…… For finite populations of Within-Species Distinguishable Units……WR Sampling…scope of repeated units….use of distinct units will improve estimation results. SINHA & SENGUPTA (1993) : CSA Bulletin, 43, 75-84.
SRSWR : Data Analysis….. Notations n = initial SRSWR (N, n) sample size k n = Number of distinct species observed initially [WOLG : 1, 2,.., k n ] m = additional SRSWR (N, m) sample size Theta (s n ) = ∑ j > k_n [N j / N] = Abundance of Unobserved Species U(..) = Unbiased Predictor of Theta(s n ) based on [s n U s* m ] d = Number of distinct units in [s n U s* m ] S d = set of d distinct units
Rao-Blackwellization…. Improved Estimator = E[U(..); given S d ] Recall expression for U(….) given by U = ∑ i U i (s n U s* m ) Conditional Expectation of ith term is given by E[U i (s n U s* m ) ; given S d ] and it is evaluated as ∑ j ≤ m [(m-1)_c_(j-1)][∆ d_i 0 j ] times [∆ (d-d_i) 0 (n+m-j) ]/ ∆ d 0 n+m where ∆ = Delta Operator & d_i = No. of distinct units from ith species category in the combined sample of size n+m
Special Cases…. Below we derive expressions for the ith term m= 1 : ∆ d-1 0 n / ∆ d 0 n+1 if d_i = 1; 0 ow m=2: 2∆ d-2 0 n / ∆ d 0 n+2 if d_i = 2 [∆ d-1 0 n+1 + ∆ d-1 0 n / ∆ d 0 n+2 if d_i=1; 0 ow And so on…..
Illustrative Examples… SRSWR(N, n) : Distinguishable Units within each species Population Size N=1000 K = Cap on the no. of species = 20 Initial Sample Size n = 50 k_n = observed no. of species = 6 Freq. counts of obs. Spc. : 21, 13, 9, 4, 2, 1 Addl. Sample size m = 5 Revised Freq. Counts : 21+2, 13+1, 9+0, 4+0, 2+0, 1+0, 0+1, 0+1 [2 new species are observed]
Computations…. Species-specific distinct units in the combined sample of size 55 : 4, 3, 2, 1, 1, 1, 1 d= 13; d_1 = 4, d_2 = 3, etc We need computations of U_i(…) for i=1 to 7, conditional on the sets holding d_i’s fixed. Case : i=1; d_1 = 4 ∑ j≤ 5 [4_c_(j-1)][∆ 4 0 j ] times [∆ 9 0 (55-j) ]/ ∆ 13 0 55 Etc etc
Open Issues…. Actual Study in Western India…… Peninsular India [Western Ghats]. This biodiversity hotspot (area 200,000 sq.km) is home to some 480 bird species. Earlier study suggests observing about 10,000 birds to estimate species richness of the Western Ghats. But it leaves open the issue of distributing the total effort over time and space……fresh study began with Anil Gore – Environmental Statistician with UniPune….
Improved Sampling Strategies Planning a Field Study on a Smaller Scale for a reliable count of bird species Reference Site : Silent Valley Nat’l Park, Kerala Three Habitats : Evergreen [EV], Semi- Evergreen, [SE] and Teak Plantations [P] Study of Avian Diversity : Typical sampling unit is a transact : Leads to Transact Sampling Coverage : 2 yrs x 12 months x 3 habitats x 2 visits [visits….Morning & Afternoon] = 144 visits
Baseline Data… Every visit : One transect was covered over a period of two hours. This could be either in the morning or in the evening. Thus the total number of transects covered = 144 Total numbers of birds seen is 4898 and in all 180 distinct species were seen. Earlier checklists show that there are 185 species in that area. This constitutes the baseline data or the universe for simulation.
Sampling Strategy…. We have a matrix of 180 rows corresponding to species seen and 144 columns corresponding to transects traversed. Entries in the matrix are number of individuals of the species recorded on the transects. In the simulation study we shall propose different strategies and compare their performance on the basis of estimate of species richness. We try to answer the question “Can we arrive at a good estimate of species richness with less effort than that put in the reference data?”
Check List…. Sample data set will be a list of sample transects out of 144 mentioned above, time of observation and species abundances as recorded on each sampled transact. Based on the data and method of sampling, we will provide an estimate of the species richness and then compare different sampling strategies with the baseline scenario. Target : Coverage of 80% of the Species
SRSWR (144, n ) : 1000 Runs Table 1 Species Richness Estimates Based On Simple Random Sampling (With Replacement) # Transects(n) 24 48 72 96 120 144 Mean 110 137 151 159 165 169 Minimum 84 119 132 138 148 154 Maximum 127 154 166 171 177 179 Stdev 6.33 5.81 5.16 4.78 4.38 4.00 MSE 4936.04 1866.67 883.156 460.282 254.51 146.17
Data Analysis….. Minimum number of species seen increases from 84 (46.67% of 180) with 24 transects to 154 (85.55% of 180) with 144 transects; The maximum number of species seen increases from 127 (70%) at 24, to 179 (99.5%) at 144 transect. SD decreases from 6.3 at 24 transects to 4.0 at 144. Table confirms that there is underestimation of species richness. Extent of bias decreases considerably as effort increases. Our target of 80% is reached easily with 72 transects. This is only half of actual effort put in.
Species Accumulation Curve…. Not much improvement after 72 efforts
Effect of Intra-Day Division of Efforts on Species Count Morning : Evening Mean Richness S.D. 100:0 133 3.8 67:33 152 4.9 50:50 151 5.2 33:67 147 5.1 0:100 144 4.4
Conclusion….. Barring the first row, all other estimates are fairly close to each other. Common practice among ornithologists is to distribute equal efforts between morning and evening. In view of the above results it seems reasonable to stick to the convention.
Choice of Season of the Year It is common to concentrate efforts in migratory season obviously because migratory species are not observable in the other season. If all effort (72 transects) is put in migratory season, we get to see on an average 143 species (S.D.=4.4). We further tried adding a small unit of effort (24 transects) in non-migratory season. This improved the estimate to 158 (S.D. = 4.4). Thus an improvement of 10% was possible. Hence our recommendation is that most effort indeed should be put in the migratory season.
HABITATS…. The issue is how to divide total effort among available habitats. Since the aim is species accumulation, it seems intuitively obvious that allotment of effort should be related to the number of species that use a particular habitat. We happen to have an overall picture of Western Ghats as a whole. This provides relevant data for the three habitats under consideration.
Habitat…. Species Richness by Habitat Type Habitat Type Species Count EVF (Evergreen Forest) 145 MDC (Semi Evergreen) 183 Teak Plantation [Manmade] 215 In the above table numbers of species in these three habitats are in the proportion 27:34:39 (145:183:215). So effort can also be divided in the same proportion.
Sequence to Follow….. What sequence of habitats to follow in this study. We first compare performance of different sequences of habitats. This comparison is based on species accumulation. A sequence is preferred if the corresponding species accumulation curve rises faster and becomes flat quickly. In this case no new species are seen in the last few transects in the habitat observed last. By this criterion, the sequence EVF-MDC- Manmade seems the best. Here in fact new species are hardly seen after 82 transects.
Cycle Sampling : A New Study…. In view of the above considerations we propose the following sampling strategy. List the habitats to be studied. Traverse one transect in each habitat. This completes one cycle of field work. Now take up corresponding analysis. This consists of generating species accumulation curves different sequences of habitats. Our interest is to see if any particular habitat is redundant at any stage [in view of the species already discovered up till the previous stage]
Cycle Sampling….. A habitat will be regarded as redundant if it fails to add any species in this accumulation curve. Next take a cycle of one transect per habitat replacing a redundant habitat by any other habitat left out earlier. Cycle sampling continues till the total number of transects traversed reaches the predetermined limit or accumulation curve reaches a plateau or all habitats are dropped as redundant, whichever happens earlier.
Cycle Sampling….. The cycle sampling strategy was adopted in this study. In each cycle, 4 transects of each chosen habitat were sampled. At the end of first cycle (12 transects), teak plantation yielded 6 new species (1.5/transect). Hence there was no redundancy and the entire sequence was repeated. At the end of second cycle (24 transects), it turned out that teak plantation yielded 3 new species in 4 transects. This was below the threshold of 1 new species / transect. Hence unrewarding habitat, namely, teak plantation was dropped. Now each cycle consisted of 8 transects only.
Cycle Sampling….. At the end of cycle 6, yield from habitat MDC fell below threshold. Hence it was dropped. In the next cycle EVF also failed to remain above threshold. Thus with 60 transects we terminate the sampling exercise. We have observed 158 (88% of 180) species at this effort level, which is 42% of 144 transects traversed in the reference data set.
Cycle Sampling….. We have suggested an adaptive sampling strategy that is dynamic and adjusts decisions at a point of time according to the accumulated information available at that point. This strategy called cycle sampling seems capable of saving effort to a substantial extent.
Thanks…. This is the end of my Technical Presentation…. Bikas K Sinha RU Workshop April 18, 2012