Presentation on theme: "Raymond J. Carroll Department of Statistics Member, Center for Statistical Bioinformatics Director, Institute for Applied Mathematics and Computational."— Presentation transcript:
Raymond J. Carroll Department of Statistics Member, Center for Statistical Bioinformatics Director, Institute for Applied Mathematics and Computational Science Texas A&M University http://stat.tamu.edu/~carroll The Interface of Functional and Longitudinal Data
My Charge Please feel free to talk about anything you wish (Dangerous) Your thinking about longitudinal data and perhaps functional data from a wider perspective Goals of the workshop are to inspire new researchers, and to take stock of where the interface of longitudinal-functional data and dynamics is headed
What I Want to Talk about Mother and joey, Tidbinbilla (outside Canberra), September 2010
What I Want to Talk about Namadji National Park July 2005
What I Will Talk About I will talk about some of the problems I have worked on No technical solutions, the other speakers look to be providing them Investigators think marginally, statisticians think of random effects
Some Observations In my work, there is a tension between Providing answers to my collaborators that they can understand Developing new general methodology publishable in statistics and that can solve more general problems Thinking about parts of the actual problem that my collaborators would not have thought about Its easy to get caught up in either of the 1 st two
Some Observations When I am simply providing answers to stated questions, I find similar themes as the distinction between marginal models such as GEE and nonlinear mixed effects models for longitudinal data GEE is simply easier Most scientists think marginally because they are uncomfortable with the idea of variability
What I Will Talk About Think what the typical smart biologist knows about statistics. t-tests, ANOVA, simple linear regression All the focus is on the mean, none on the variability
Some Observations What we have to do is to deliver the analysis the data collectors can understand, and teach them about variability Pictures work wonders: functions are no harder to understand than histograms, and understanding variability can help investigators tell stories
Some Observations We need to advance the field of statistics Deeper understanding of the underlying process, through random effects modeling, often helps inform future studies and helps investigators tell their story
An Old Colon Carcinogenesis Project Experiment with 2 lipids (fish oil and corn oil) with and without butyrate (a fatty acid) supplementation, with p27 or MGMT repair measured as the response Longitudinal, maybe even dynamic, hierarchical and functional. Hierarchical because each of the treatment groups has multiple samples, and each of them have multiple functions Functional because of the biology
Colon Cancer Data Jeff Morris Ciprian Crainiceanu Ana-Maria Staicu Naisyin Wang Veera B Yehua Li
Functional The colonic crypts have cells, near the bottom (x=0) are the stem cells, near the top (x=1) are the differentiated cells
MGMT Repair Enzyme, 1 crypt MGMT curve in one crypt. Original analysis found large diet effects
MGMT Repair Enzyme, 1 crypt The large diet effects on the MGMT repair enzyme are real. There are also large diet effects on apoptosis
MGMT Repair Enzyme, 1 crypt What do biologists do (define original analysis)? They simplify the data so that they can do ANOVA, duh! They average all the response (p27 or MGMT, about 200 observations in each analysis) in the bottom 1/3 rd, Middle 1/3 rd and top 1/3 rd. Then they run 3 ANOVA.
MGMT Repair Enzyme, 1 crypt They then they tell a story about all the ANOVA they have done. We all smile about this, but my collaborator (Joanne Lupton) just got elected into the U. S. National Academy of Science.
MGMT Repair Enzyme, 1 crypt I like to think that our more nuanced analyses help her tell her stories, which is hopefully not wishful thinking!
MGMT Repair Enzyme, 1 crypt Wavelet functional coefficients for apoptotic index in the top 1/3 of the crypt, for fish oil and for corn oil. From Morris and Carroll (2006): fish-oil-fed animals who had a large amount of apoptosis near their lumenal surface also had high levels of the DNA repair enzyme MGMT near their lumenal surface, meaning that the two major mechanisms for dealing with DNA damage were correlated. This relationship was not so strong for corn-oil-fed animals.
MGMT Repair Enzyme, the stiry We did a full-blown wavelet-based functional mixed model analysis to get these conclusions. Could it have been done marginally? Probably Yes, but then thats dull. However, we (a) know much more about the pattern of variability and (b) we built up methods and software that can be used in a wide variety of settings
Longitudinal Colon carcinogenesis is a localized phenomenon. The crypts closest to one another are highly correlated
Colon Cancer Data The locality hypothesis says that colon cancer starts because of highly localized damage. Longitudinal and hierarchical FDA can tell us many things about this hypothesis, e.g., where is localized damage more likely to occur? While most research focuses on the proximal and distal portions of the colon, FDA reveals that there is as much or more in the middle
Colon Cancer Data Lots of fun fitting this longitudinal, hierarchical functional data set What did the investigators want to know? They were interested in how correlated neighboring crypts are, consistent with the locality hypothesis.
Colon Cancer Data The Bayesian analysis gives them strong point-wise evidence (can supplement with FDR) Allows summary measures
Colon Cancer Data Acknowledging the longitudinal nature led to much more precise inferences. This is the interaction function between diet and treatment: guess which one allows for locality?
Cell Signaling Data Myometrial cells meant to mimic what goes on near birth were either exposed to dioxin (TCDD) or not exposed. They were then exposed to a hormone, oxytocin, that stimulates calcium ion signaling (CA 2+ ) The CA 2+ signal was observed at many pixels of each cell for 512 time points (85 minutes)
Cell Signaling Data Josue Martinez Jianhua Huang
Cell Signaling Data The cells were segmented, and intensity of the signals were obtained for each pixel, each cell and all time points. Roughly 25 cells in each treatment group (control and TCDD) Hierarchical because of pixels within cells within treatments
Cell Signaling Data Functional because pixels are measured over time Possibly different levels of spatial because the cells are in spatial alignment Lots of preprocessing: cell segmentation, adjustment for saturation, and more
Cell Signaling Data First two minutes of the experiment for the TCDD treated plate. Next comes two movies of the data
Cell Signaling Data All cells (Control and TCDD), at a basal state in which the cells were cultured, 0-4 minutes and 40-80 minutes after oxytocin exposure
Cell Signaling Data All cells (Control and TCDD), at a low estrogen state, just before pregnancy (note the delayed response due to TCDD)
Cell Signaling Data All cells (Control and TCDD), at a high estrogen state, near full- term in pregnancy
Cell Signaling Data All cells (Control and TCDD), at a high estrogen state, near full- term in pregnancy, after normalization and registration
Cell Signaling Data All cells (Control and TCDD), at a high estrogen state, near full- term in pregnancy, after normalization and registration. Areas under the curve (p < 0.001)
Cell Signaling Data You should see that in this analysis, we have not made use of the structure of the data. We have thought like GEE people, and indeed reduced the comparison of control and TCDD to single numbers, e.g., peak time and area under the curve. We did lots of dimension reduction (4 weighted SVD) to get here
Cell Signaling Data There was a lot of work to get the data into a format for analysis Question: what can hierarchical, possible spatial FDA do for us here, and given the structure, how should an analysis proceed? I feel that there is a lot more that we can learn about the process by thinking more deeply about the modeling
Bat Chirp Data Bats of the same species, residing in Austin (city bats) and College Station (Aggie bats)
Bat Chirp Data The chirp is mainly composed of frequencies that start at about 40 kilohertz (kHz) and slowly decrease to 20 kHz from 0 to 8 milliseconds into the chirp. The bat then transitions to predominant frequencies at 60 kHz that slowly decrease back down to 40 kHz and then rise up to 60 kHz towards the end of the chirp. Frequencies above 80 kHz are harmonics of the fundamental signal.
Bat Chirp Data It seems clear to me that this is an inherently functional problem. Trying to reduce it to a single number to do a t- test seems difficult to contemplate, but it is not impossible. People have tried t-tests and classification based on measures such as duration, start frequency, end frequency, etc.
Bat Chirp Data One could simply take each pixel of the spectrogram and do t-tests, with FDR control This would ignore the replicate data, would ignore the correlated nature of the data, would do no dimension reduction, etc. What did the biologist want to know? Kisi Bohn
Bat Chirp Data She wanted to know if the bats from the same species (City Bats and Aggie Bats) evolved and have different vocalizations What did we want to do: Answer her question precisely, and let her tell a story (the marginal question, imprecisely framed) Use all the data Understand the variability
Bat Chirp Data We wavelet transformed the spectrograms, fit a 2-D hierarchical WFFM, transformed back, and did analysis of the results (see next)
Bat Chirp Data Difference in mean spectrogram inferred from model. Red favors College Station, Blue favors Austin This could be done without random effects
Bat Chirp Data White areas are those in which the spectrograms differ by 1.5 fold or more, with a global FDR control of 15%. Hard to do legitimately without random effects?
Frequency Agile Lidar Data This is a recent project from Bani Mallicks group Here is a comic describing the process
LIDAR Data Bani Mallick and his student Swarup De Peter Hall and Aurore Delaigle
Frequency Agile Lidar Data There is a transmitted signal There is background There is a received signal, which is then background corrected For each time (100+) and wavelength (19), we see 625 observations across the physical range of observation, i.e., equally spaced functional data with noise.
Four samples at same time and wavelength. Background corrected only
Frequency Agile Lidar Data Four samples and same time and wavelength. Background corrected, truncated at zero and normalized
Data For aerosol type a = 1,2, and sample i=1,…,n within type, we observe background corrected received data Here t = time, w = wavelength and x = distance. This is hierarchical: there are samples within types
Data For aerosol type a = 1,2, and sample i=1,…,n within type, we observe This is functional: there are bivariate space-time curves over distance x and time t It is longitudinal, over wavelength
Approaches For aerosol type a = 1,2, and sample i=1,…,n within type, we observe There are a vast number of approaches possible The fun thing to do is to build a hierarchical, longitudinal, space-time model Doing this is not trivial, will advance the field, will allow sharing of data, will allow understanding of variability, etc.
Approaches The investigators want things far more boring They want to know if there are differences between the two types of samples (biological and non-biological), sigh.
Approaches Both simple questions can be handled by a model-based approach, of course. But they can also be answered by much simpler, ad hoc, dimension reduction-based and not particularly innovative approaches We will have to decide what to do!
Conclusions Functional, hierarchical and longitudinal data are the wave of the future. I have given 4 examples of functional data that are either hierarchical or longitudinal Analyzing data like this is great fun!
Conclusions The questions I have raised are about the goals of such studies. If investigators only think marginally, they miss out. If we do not think marginally, we have less influence
Conclusions Marginal approaches are often much faster to implement, and easier to explain. Id like speaker at this conference to help me by indicating why powerful random effects models are better than marginal approaches.
Advertisement TAMU has an full professor opening in computational statistics as broadly defined. Startup funding is at least $750,000
Other Acknowledgments I gratefully acknowledge financial support from the U. S. National Cancer Institute (R37- CA057030) and King Abdullah University of Science and Technology (KAUST, Award Number KUS-CI-016-04).
Approaches There is a deconvolution aspect to this problem that is fairly unique Along with the received signal, there is a transmitted signal There is thought to be a true signal
Approaches The deconvolution equation is Here, is supposed to be white noise over x Should one use or ?
Approaches It turns out that there are no systematic differences across treatment for or for So differences across treatments in the received signal reflect differences in the true signal, and vice-versa Is deconvolution a good idea? It is a heck of a lot of work, and the model assumptions are stringent
Approaches We think deconvolution here is not only harder than simply using the observed data, but less efficient because of the excess noise induced by deconvolution The Mallick group has made great progress on attacking this in a systematic, functional, hierarchical, Bayesian manner