- Search Menu
- Sign in through your institution
- Advance articles
- Author Guidelines
- Submission Site
- Open Access
- Reasons to Submit
- About Biostatistics
- Editorial Board
- Advertising and Corporate Services
- Journals Career Network
- Self-Archiving Policy
- Dispatch Dates
- Journals on Oxford Academic
- Books on Oxford Academic
Article Contents
1. i ntroduction, 2. m aterial and methods, 3. r esults, 4. d iscussion and conclusions, s upplementary material.
- < Previous
ARSyN: a method for the identification and removal of systematic noise in multifactorial time course microarray experiments
- Article contents
- Figures & tables
- Supplementary Data
Maria j. Nueda, Alberto Ferrer, Ana Conesa, ARSyN: a method for the identification and removal of systematic noise in multifactorial time course microarray experiments, Biostatistics , Volume 13, Issue 3, July 2012, Pages 553–566, https://doi.org/10.1093/biostatistics/kxr042
- Permissions Icon Permissions
Transcriptomic profiling experiments that aim to the identification of responsive genes in specific biological conditions are commonly set up under defined experimental designs that try to assess the effects of factors and their interactions on gene expression. Data from these controlled experiments, however, may also contain sources of unwanted noise that can distort the signal under study, affect the residuals of applied statistical models, and hamper data analysis. Commonly, normalization methods are applied to transcriptomics data to remove technical artifacts, but these are normally based on general assumptions of transcript distribution and greatly ignore both the characteristics of the experiment under consideration and the coordinative nature of gene expression. In this paper, we propose a novel methodology, ARSyN, for the preprocessing of microarray data that takes into account these 2 last aspects. By combining analysis of variance (ANOVA) modeling of gene expression values and multivariate analysis of estimated effects, the method identifies the nonstructured part of the signal associated to the experimental factors (the noise within the signal) and the structured variation of the ANOVA errors (the signal of the noise). By removing these noise fractions from the original data, we create a filtered data set that is rich in the information of interest and includes only the random noise required for inferential analysis. In this work, we focus on multifactorial time course microarray (MTCM) experiments with 2 factors: one quantitative such as time or dosage and the other qualitative, as tissue, strain, or treatment. However, the method can be used in other situations such as experiments with only one factor or more complex designs with more than 2 factors. The filtered data obtained after applying ARSyN can be further analyzed with the appropriate statistical technique to obtain the biological information required. To evaluate the performance of the filtering strategy, we have applied different statistical approaches for MTCM analysis to several real and simulated data sets, studying also the efficiency of these techniques. By comparing the results obtained with the original and ARSyN filtered data and also with other filtering techniques, we can conclude that the proposed method increases the statistical power to detect biological signals, especially in cases where there are high levels of structural noise. Software for ARSyN is freely available at http://www.ua.es/personal/mj.nueda .
Time course microarray (TCM) experiments analyze time-dependent transcriptional changes along one or more series of data. The TCM design is employed when the dynamics of gene expression changes are to be studied as a response to a drug treatment, for its association with a genetic background or simply as a consequence of development or aging. If a second factor, such as diversity of treatments, strains, or environment, is present in the study, we are dealing with a multifactorial time course microarray (MTCM) experiment. Examples of such controlled multifactorial experiments can be found in the fields of toxicology (Heijne and others , 2003), agronomy (Brumós and others , 2009), biomedicine (Agudo and others , 2008), and ecology (Svendsen and others , 2008), to cite just a few. Although recent advances in sequencing technologies have created alternatives to microarrays for transcriptome profiling, the relatively high costs of sequencing platforms rule out their use in complex transcriptomics experiments such as the MTCM in which a large number of conditions and samples are required. In these circumstances, microarrays continue to be the preferred option to address genome-wide gene expression analysis. Typically, in MTCM designs, time constitutes one factor, a variable of quantitative nature, while the other factors are either quantitative or qualitative (dosis/level, treatment, strain, etc.). Statistical analysis of this kind of data is more complicated than that of simple control–cases studies. In MTCM, not only significant changes at different factor levels and interactions are sought but also the identification of patterns of transcriptional regulation is frequently pursued. Several methodologies for the analysis of TCM have been proposed so far (Conesa and others , 2006), ( Tai and Speed, 2006 ), (Storey and others , 2005) that apply different statistical strategies for the modeling of time-dependent gene expression and the identification of significant changes.
One of the aspects that has received most attention in methodological studies of microarray data analysis is the treatment of noise. Although microarrays have greatly improved technical quality and reproduction over the years, microarray data are still highly noise prone and are affected by random and systematic sources of error that obscure the transcriptional signal. The first step in the analysis of microarray data is usually normalization, whose aim is to adjust data from different arrays to a common baseline and distribution. This data treatment addresses sources of technical variation such as hybridization efficiency, starting messenger RNA concentrations, or different physical properties of labeling molecules. Normalization methods have been established over the last decade ( Do and Choi, 2006 ). However, not all sources of technical noise are removed by normalization. This is due to the fact that most of the current normalization methods are designed to center and scale the data assuming general invariability for all observations and ignoring the particular sample hybridized in each array (Yang and others , 2002). When exploring normalized microarray data using common clustering techniques, it is still not infrequent to observe artifacts associated to identifiable factors such as the array type, the lab, or the date of execution generally referred to as “batch effects.” Moreover, other types of systematic biases that are not as traceable as the batch effects might also be embedded in the data. All these elements represent sources of structured noise that reduce statistical power when assessing differential expression.
Batch effects are present in many data sets, and this can seriously hinder statistical analysis. This technical problem has been recently reviewed within the framework of the MAQC-II Project that studied the quality of microarray data for their application as a molecular prediction tool ( MAQC-Consortium, 2010 ). This project resulted in an extensive evaluation of the batch effect and of the existing batch-removal strategies (Luo and others , 2010). Some methodologies for removing batch effects require large batch sizes, such as singular value decomposition (Alter and others , 2000) and distance weighted discrimination (Benito and others , 2004). Empirical Bayes methods have been claimed to be more flexible and robust to outliers since the batch bias is considered common across all genes in each batch (Johnson and others , 2007). A requirement for the application of all these strategies is the previous identification of the batches, generally understood as the group of samples affected by the same noise level, and this is not always possible. When systematic noise is associated with an array or spatial effects, the experiment design may be the key for correcting this noise ( Leek and Storey, 2007 ). Moreover, the co-regulation mechanism that underlies gene expression implies that transcriptomics data have an inherent correlation structure. Taking this covariance structure into account is, likewise, an effective way to enhance data analysis.
In this paper, we propose a novel strategy named ARSyN (ASCA [ANOVA simultaneous component analysis] removal of systematic noise). ARSyN is based on the ASCA model developed by Smilde and others (2005) to remove structural noise from microarray data sets. ASCA combines analysis of variance (ANOVA) and principal components analysis (PCA) to analyze multifactorial omics data sets. So far, ASCA has been used for exploratory analysis (Jansen and others , 2005), (Brumós and others , 2009) and for the identification of responsive genes in transcriptomics (Nueda and others , 2007). In the present work, we take advantage of the data decomposition provided by the ASCA model to develop a novel statistical framework for the preprocessing of microarray data. In brief, ARSyN uses the PCAs of the ANOVA parameters and residuals in the ASCA model to identify and separate noise from signal in microarray data. After this decomposition, the data elements of interest are joined back together to reconstruct a filtered gene expression matrix which is free of structural biases. The filtered matrix has 2 main advantages:
Extracts the relevant gene expression variation related to the controlled variables in the experimental design. This is obtained from the main principal components (PCs) of the ANOVA parameters.
Is free of structural noise that can be associated to batch effects or to other nontraceable sources of variation. This is identified in the main PCs of the residuals of the ANOVA model.
Although ARSyN relies on the ASCA model, it is a different methodology in scope and statistical realization. While ASCA has been used for descriptive analysis and for the identification of differentially expressed genes and focuses on the analysis of the ANOVA parameters, ARSyN is a preprocessing strategy that renders a noise-reduced expression matrix. The processed data can then be submitted to statistical analysis with any dedicated methodology for (M)TCM.
We have analyzed how ARSyN improves the performance of 3 time course methods: maSigPro (Conesa and others , 2006), timecourse ( Tai and Speed, 2006 ), ( Tai and Speed, 2009 ), and EDGE (Storey and others , 2005). We have employed synthetic data to investigate the effects of the proposed methodology on different types of noise and relationships between samples. Our results demonstrate that ARSyN effectively removes structural (but not random) noise in both independent and longitudinal multifactorial data sets. Finally, we assess the usability of the filtering approach from a biological point of view through the application to 2 experimental scenarios. Furthermore, we compare ARSyN with current batch-removal methods: ComBat (Johnson and others , 2007) and surrogate variable analysis (SVA) ( Leek and Storey, 2007 ).
2.1 The ASCA model
2.2. arsyn: the filtering strategy, 2.3 data sets, 2.3.1 simulated data.
ARSyN has been evaluated in 11 different scenarios in order to resemble situations with different types and magnitude of noise. These scenarios are simulated as independent TCM experiments (6 cases) and also longitudinal TCM experiments (5 cases). A detailed description of these simulated data has been included in the supplementary material , available at Biostatistics online, and a brief description in Table 1 .
Description of simulated data sets
2.3.2 Experimental data
Two real transcriptomics examples were chosen to evaluate the biological consistency of the proposed method. The first was the toxicogenomic study by Heijne and others (2003), which investigates the effect of the hepatotoxicant bromobenzene in rats. This data set consists of 3 time points (6, 12, and 48 h after administration of the drug), 5 experimental groups (1 untreated group; 1 placebo, corn oil; and 3 different doses of bromobenzene: low, medium, and high), and 2665 genes. The second was a stress study in plants which investigates the transcriptional response to 3 different abiotic stressors (salt, cold, and heat) in the potato using the National Science Foundation (NSF) 10k potato array (Rensink and others , 2005). This data set has 4 series (1 control and 3 types of stress: heat, salt, and cold), 3 time points, 3 replicates per experimental condition, and 9993 genes.
2.4 The evaluation approach
The general strategy for evaluating the performance of ARSyN was to apply a statistical method for TCM data (maSigPro, EDGE, and timecourse ) to the different data sets before and after ARSyN filtering and to compare results in terms of feature selection. The maSigPro approach (Conesa and others , 2006) is a regression-based method that uses a polynomial model to fit gene expression dynamics and dummy variables to differentiate between experimental groups or series. timecourse is based on the empirical Bayes procedure to study one- and two-sample longitudinal series ( Tai and Speed, 2006 ), and recently, the method has been adapted to multiple conditions ( Tai and Speed, 2009 ). Finally, EDGE (Storey and others , 2005) uses B-splines–based models to analyze both independent and longitudinal data. These methods are described in the supplementary material , available at Biostatistics online. In the case of simulated data, we have used sensitivity (true positives detected/real true positives) and specificity (true negatives detected/real true negatives) as measures of quality. A good selection of genes is obtained when both measures are close to 1. In the case of experimental data, as the truly differentially expressed genes are unknown, these metrics cannot be used. Instead, we have applied a functional enrichment (FE) analysis (Al-Shahrour and others , 2007) to evaluate the biological consistency of the results. FE assesses whether specific cellular functions are overrepresented within a set of significant genes and is a well-established methodology for interpreting and evaluating transcriptomic data. Additionally, we have compared our results to those obtained by current batch-removal methods. We have chosen ComBat (Johnson and others , 2007) and SVA ( Leek and Storey, 2007 ) which were recently recommended by Luo and others (2010) and Leek and others (2010), respectively. Both methods were applied to the simulated studies, and ComBat was also applied to the toxicogenomic experimental data. ComBat could not be applied to NSF potato stress data because it has not a defined batch effect. SVA was not applied to real data sets to simplify results as the M(TCM) methods used in this paper cannot be directly applied with SVA.
3.1 Simulation studies
Several data sets were generated for each one of the analysis scenarios designed in each simulation study. In order to highlight the balance between signal and noise introduced in each analysis scenario, we show in Table 2 the amount of variation simulated and explained in each ASCA submodel. In general, we observe that residual variation increased as higher noise was modeled. The percentage of explained variance in X a and X b + a b submodels was more or less constant across scenarios, whereas the explained variance in the X a b g submodel was strongly associated to the presence of structural noise. This result confirms the ability of the X a b g submodel in capturing the systematic noise embedded in the data. Next, we simulated 50 data sets for each scenario, obtained filtered data by ARSyN, and applied maSigPro and timecourse to all the data sets. Only 10 simulations were run with EDGE as this software is only accessible from a graphical user interface and could not be integrated in high-performing scripting pipelines. However, the stability of the results in all cases made this simplification acceptable.
Percentage of variation simulated, and explained with ASCA, in each submodel for different scenarios from one of the simulated independent (a) and one of the longitudinal (b) data sets
Figure 1 shows the sensitivity average with the original and filtered data for each method and type of time course data. The details of this analysis are shown in the supplementary material , available at Biostatistics online, in terms of false positives, false negatives, sensitivity, specificity averages, and their correspondent confidence intervals. Performance analysis indicated that specificity was high and similar in all cases and that differences were revealed by the sensitivity indicator. We explain these differences in detail below. maSigPro . Performance indicators showed that, in all scenarios, the selection of genes by applying maSigPro to the ARSyN filtered data was equal or better than that obtained by applying maSigPro to the original data. In scenarios where no systematic noise was introduced (Scenarios 1 and 2 of independent and longitudinal data studies), maSigPro was efficient with respect to both the original and the filtered data, and ARSyN did not affect the good performance of the statistical method. On the other hand, in scenarios with high structural noise, ARSyN clearly improved sensitivity, while specificity was unaffected.
Sensitivity plot. The height of the bars represents the average sensitivity obtained in 50 simulations (10 for EDGE) with the original and filtered data of (a) independent simulated data and (b) longitudinal simulated data.
timecourse . The analysis of the simulated independent data sets revealed that, in scenarios without structural noise, performance indicators were similar with and without ARSyN filtering. In Scenarios 4 and 5, a slight improvement in sensitivity was observed when ARSyN was applied, while Scenarios 3 and 6 clearly showed the higher sensitivity of the filtered data. In contrast, no significant performance differences between original and ARSyN data were observed when longitudinal data were analyzed by timecourse .
EDGE . The study of the independent data with the EDGE methodology showed that the number of false negatives was 100 in many cases. These were largely genes simulated with a pattern of parallel gene-expression profiles among series, which are hard to detect by this method. In general, we observed that sensitivity of EDGE was lower than in the other 2 methods. Preprocessing of data by ARSyN improves detection capacity in some scenarios, although sensitivity values continued to be low.
Considered as a whole, the simulation study revealed that ARSyN is an efficient preprocessing technique for improving the detection of differentially expressed genes in scenarios with high structural noise but has no effect when noise is low. We have also demonstrated that the combination of ARSyN and maSigPro is the analysis strategy with the best overall performance.
Comparison with other filtering techniques. ARSyN was compared to 2 other noise-removal methodologies: ComBat and SVA. ComBat outputs, as ARSyN, a filtered data set that can be further analyzed by TCM methods. However, ComBat cannot be applied in situations where the batch is not identified, which we consider a limitation with respect to ARSyN. When the batch is known (Scenarios 3–6), ComBat rendered a higher number of false positives, in comparison of ARSyN, whereas the number of false negatives slightly decreased, except in Scenarios 5 and 6 of independent data set ( Supplementary Tables 4 and 5 available at Biostatistics online). In contrast, SVA performed poorer on independent data whereas produced similar results as ComBat and ARSyN in the longitudinal study ( Supplementary Table 6 available at Biostatistics online). However, a major disadvantage of SVA is that it does not give a filtered data matrix, so TCM methods cannot be directly applied. Altogether these results point to ARSyN as a more robust and versatile solution for noise removal than other approaches.
3.2 Toxicogenomics data set
ASCA analysis of this data set decomposed data into 3 submodels: “time,” “treatment + treatment × time,” and “residuals.” After component selection by ARSyN, 1, 5, and 2 components, respectively, were retained for each submodel. This component selection explained 75% of time variation, 78% of treatment plus interaction variation, and 48% of residual variation.
Exploratory analysis of the 2 first PCs of the original data set revealed a considerable batch effect ( Figure 2(a) ) that was removed with ARSyN ( Figure 2(b) ). The origin of this structural bias was identified as a dye effect since the experiment had a dye-swap design and the dye used in each array was known. This effect can also be treated by simply centering genes with the corresponding dye average ( Figure 2(c) ) as in Conesa and others (2006), where maSigPro was applied to the analysis of this data set (note that this basic centering preprocessing is the comparing scenario in this toxicogenomics example). ComBat filtering was also effective in removing the dye bias ( Figure 2(d) ). Interestingly, ComBat preprocessing resulted in very similar PC plots as dye centering ( Figure 2(c) ). From this analysis, we concluded that ARSyN filtering and also other batch-removal approaches removed the dye bias from the data and revealed the differences between the high doses of bromobenzene and the remaining doses. Furthermore, ARSyN preprocessing resulted in an increase of the number of genes that obtained low p -values in maSigPro and EDGE analysis ( Figure 3 ), which is consistent with a general removal of noise from the data. Gene selection obtained with the different methods and comparisons are shown in the Supplementary Figure 3 available at Biostatistics online.
PCA of (a) original data, (b) ARSyN filtered data, (c) centered data by dye, and (d) ComBat filtered data. Cy3 and Cy5 are green and red dyes. Experimental groups: untreated (UT), corn oil (CO), low (LO), medium (ME), and high (HI) doses of bromobenzene.
Distribution of p -values obtained by maSigPro (first step) and EDGE on the toxicogenomics data set before and after ARSyN filtering. ARSyN filtering increases the number of genes with low p -values which is consistent with a decrease in noise levels.
Finally, we investigated the gain in biological interpretability of the filtered data by analyzing the number and types of enriched gene ontology (GO) terms in the selected genes in comparison to those obtained from unfiltered data. In general, the number of enriched GO terms and the size of the term within the pool of selected genes were greater in ARSyN filtered data than without filter ( Supplementary Table 7 available at Biostatistics online), indicating that the noise-removal procedure enhanced the detection of coordinated gene sets. Furthermore, GO functions revealed by the filtered data were related to processes of the cellular detoxification response (Heijne and others , 2003). For example, “glutathione transferase activity” (found in maSigPro–ARSyN and EDGE–ARSyN analysis) is the major cellular activity that targets bromobenzene for degradation, whereas “heme binding” (maSigPro–ARSyN results) refers to redox enzymes involved in this process. Similarly, “nitric oxide signal transduction” (enriched in timecourse–ARSyN) points to a detoxification mechanism associated to the response to xenobiotic compounds (Morán and others , 2010), (Farina and others , 2011). These results show the biological relevance of the new genes uncovered by the filtering procedure. ComBat preprocessing did not add new relevant functional conclusions to the analysis of these data.
3.3 NSF potato stress data set
The ARSyN analysis for this data set resulted in a model with 1, 3, and 2 components for submodel time, treatment + treatment × time, and residuals, respectively. This component selection explains 100% of time variation, 80% of treatment plus interaction variation, and 28% of residual variation. Gene selection obtained with the different methods and comparisons are shown in the Supplementary Figure 3 , available at Biostatistics online.
When considering the functional analysis ( Supplementary Table 8 available at Biostatistics online), again, the number of enriched GO terms obtained by maSigPro and timecourse analysis was higher when data were preprocessed by ARSyN and also the specificity of the identified functions which included hormone “signalling cascades,” “diverse enzymatic binding activities,” and “defined metabolic functions.” Notably, EDGE analysis on these data did not result in a relevant number of significant results, regardless of the filtering option.
This paper describes the methodology ARSyN that uses a model-based multivariate projection technique such as ASCA for the removal of systematic biases in microarray data. The rational of the methodology is the extraction of the relevant shared behavior and the identification and removal of structured noise that cannot be associated with the experimental factors included in the design of the transcriptomics study. This structural noise is habitually referred to as the “batch” effect. It is a result of dye, lab, experimentalist, etc., factors and can affect the data of the related arrays both globally and locally. ASCA uses ANOVA to identify signals associated with experimental factors and PCA to separate structured and random variation in these signals. By removing the nonstructured part of the experimental factor signals (the noise within the signal) and the structured variation of the ANOVA errors (the signal of the noise) from the original data set, we create a filtered data set that is enriched in the information of interest and retains only the random noise needed for inferential analysis. This procedure offers the advantage of not requiring previous knowledge of the nature of the “batch effect.” Any possible structural noise is identified in the signal of the residuals of the ASCA model.
The efficacy of this filter was analyzed in 2 simulation studies in which independent and longitudinal data, respectively, were mimicked. The proposed ARSyN method targets the systematic noise in gene expression data sets, different types—systematic and random—and magnitude—high or low—of noise were introduced into the synthetic data. Additionally, we assessed whether or not the filter was generally valid, irrespective of the inference methodology used to identify differentially expressed genes. Therefore, we tested the filter with 3 available methods for the analysis of TCM data that follow very different statistical strategies: maSigPro applies polynomial regression, timecourse is based on empirical Bayes, and EDGE uses B-splines to model the dynamics of gene expression.
The results showed that ARSyN significantly improves gene selection when a high quantity of structural noise is present and has no effect when only random noise affects the expression signals. Although this pattern was observed with each of the 3 statistical methodologies employed, maSigPro was clearly the method on which ARSyN had the greatest impact and which yielded the best end results. Sensitivity improvement with timecourse and EDGE was not as pronounced as with maSigPro, and the amount of differential expression detected when these 2 methodologies were applied to ARSyN data never reached the sensitivity levels obtained by the maSigPro analysis. This result can be explained by the nature of the maSigPro method, a univariate gene-by-gene regression that considers a normal distribution of the error. Given that the ARSyN filter exploits the co-expression of genes through the PCA on the estimates of the ANOVA parameters, the synergy with the inferential approach is likely to be maximal. However, both timecourse and EDGE use empirical methods to determine the statistical significance of statistics, which implies the consideration of possible structural noise in all data. On the other hand, timecourse employs shrinking covariance estimates and therefore takes into account the relationships within expression values. In this way, these methods consider aspects that are also considered by the proposed filter, and therefore the effect obtained is expected be of a lower magnitude than that observed with maSigPro. Even so, the ARSyN filter improves the sensitivity of timecourse and EDGE in some scenarios. We hypothesize that this is related to the more refined treatment of variation by ASCA as it imposes an ANOVA model prior to component analysis. This decomposition allows for an experimental factor–focused analysis of covariance that is more efficient than the design-blind analysis of correlation structures that characterize timecourse and EDGE methods. Finally, it should be mentioned that EDGE is based on B-splines models, which work well with series of more than 10 time points; the present study was restricted to short series of 3–5 time points, which may be the reason for the poor results obtained with this technique.
When applied to experimental data sets, preprocessing by ARSyN improved the significance of the statistical tests, the identification of the transcriptionally regulated biological processes, and the number of significant genes contained in selected functional categories. The better performance of the ARSyN data in the FE analysis could not be simply the consequence of an increase in the number of genes declared significant as this occurred with maSigPro but not with timecourse. We argue that ARSyN preprocessing, which modifies gene expression according to the correlation structure of the data set, helps to reveal the coordinated regulation of genes in the same functional class, thereby improving the detection of enriched functions. Additionally, ARSyN would eliminate noisy or poorly correlated genes that reduce statistical power of FE analysis.
Supplementary material is available at http://biostatistics.oxfordjournals.org .
Spanish MICINN Project (BIO2008-04368-E and DPI2008-06880-C03-03/DPI).
Conflict of Interest: None declared.
Google Scholar
Email alerts
Citing articles via.
- Biostatistics Blog
- X (formerly Twitter)
- Recommend to your Library
Affiliations
- Online ISSN 1468-4357
- Print ISSN 1465-4644
- Copyright © 2024 Oxford University Press
- About Oxford Academic
- Publish journals with us
- University press partners
- What we publish
- New features
- Open access
- Institutional account management
- Rights and permissions
- Get help with access
- Accessibility
- Advertising
- Media enquiries
- Oxford University Press
- Oxford Languages
- University of Oxford
Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide
- Copyright © 2024 Oxford University Press
- Cookie settings
- Cookie policy
- Privacy policy
- Legal notice
This Feature Is Available To Subscribers Only
Sign In or Create an Account
This PDF is available to Subscribers Only
For full access to this pdf, sign in to an existing account, or purchase an annual subscription.
Discovering gene expression patterns in time course microarray experiments by ANOVA-SCA
Affiliation.
- 1 Departamento de Estadística e Investigación Operativa, Universidad de Alicante, Apartado 03080, Alicante, Spain. [email protected]
- PMID: 17519250
- DOI: 10.1093/bioinformatics/btm251
Motivation: Designed microarray experiments are used to investigate the effects that controlled experimental factors have on gene expression and learn about the transcriptional responses associated with external variables. In these datasets, signals of interest coexist with varying sources of unwanted noise in a framework of (co)relation among the measured variables and with the different levels of the studied factors. Discovering experimentally relevant transcriptional changes require methodologies that take all these elements into account.
Results: In this work, we develop the application of the Analysis of variance-simultaneous component analysis (ANOVA-SCA) Smilde et al. Bioinformatics, (2005) to the analysis of multiple series time course microarray data as an example of multifactorial gene expression profiling experiments. We denoted this implementation as ASCA-genes. We show how the combination of ANOVA-modeling and a dimension reduction technique is effective in extracting targeted signals from data by-passing structural noise. The methodology is valuable for identifying main and secondary responses associated with the experimental factors and spotting relevant experimental conditions. We additionally propose a novel approach for gene selection in the context of the relation of individual transcriptional patterns to global gene expression signals. We demonstrate the methodology on both real and synthetic datasets.
Availability: ASCA-genes has been implemented in the statistical language R and is available at http://www.ivia.es/centrodegenomica/bioinformatics.htm.
Supplementary information: Supplementary data are available at Bioinformatics online.
Publication types
- Research Support, Non-U.S. Gov't
- Analysis of Variance*
- Computational Biology / methods*
- Computer Simulation
- Data Interpretation, Statistical
- Gene Expression Profiling / methods*
- Models, Genetic
- Models, Statistical
- Oligonucleotide Array Sequence Analysis / methods*
- Principal Component Analysis
- Time Factors
- Transcription, Genetic
Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
- View all journals
- Explore content
- About the journal
- Publish with us
- Sign up for alerts
- Review Article
- Published: 01 August 2002
Design issues for cDNA microarray experiments
- Yee Hwa Yang 1 , 2 &
- Terry Speed 1 , 3
Nature Reviews Genetics volume 3 , pages 579–588 ( 2002 ) Cite this article
2394 Accesses
535 Citations
6 Altmetric
Metrics details
Microarray experiments are widely used to quantify and compare gene expression on a large scale, it is therefore important that they are designed with care as aspects of their design will affect the validity of their results and experimental efficiency.
Several scientific and practical issues (for example, the amount of material available) will affect the choice of experimental design.
A key choice in microarray design is whether to use direct or indirect comparisons; that is, whether to make comparisons within or between slides.
Dye-swap experiments allow the experimenter to minimize the systematic bias that comes from the systematic differences between green and red intensities.
Different experimental designs are appropriate in different experimental contexts; examples of single-factor and multifactorial designs are discussed.
Replication of microarray experiments is important as it reduces variability, and data obtained from replicated experiments can be analysed using formal statistical methods. Not all forms of replication are equal, and technical and biological replicates are compared.
The issue of sample size is problematic in microarray experiments because the variance of the relative expression levels across hybridizations varies greatly from gene to gene.
Microarray experiments are used to quantify and compare gene expression on a large scale. As with all large-scale experiments, they can be costly in terms of equipment, consumables and time. Therefore, careful design is particularly important if the resulting experiment is to be maximally informative, given the effort and the resources. What then are the issues that need to be addressed when planning microarray experiments? Which features of an experiment have the most impact on the accuracy and precision of the resulting measurements? How do we balance the different components of experimental design to reach a decision? For example, should we replicate, and if so, how?
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
195,33 € per year
only 16,28 € per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Controlling technical variation amongst 6693 patient microarrays of the randomized MINDACT trial
The role and robustness of the Gini coefficient as an unbiased tool for the selection of Gini genes for normalising expression profiling data
reString: an open-source Python software to perform automatic functional enrichment retrieval, results aggregation and data visualization
Hughes, T. R. et al. Functional discovery via a compendium of expression profiles. Cell 102 , 109–126 (2000).
Article CAS Google Scholar
Van 't Veer, L. J. et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 415 , 530–536 (2002).
Chu, S. et al. The transcriptional program of sporulation in budding yeast. Science 282 , 699–705 (1998).
Spellman, P. T. et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 9 , 3273–3297 (1998).
Callow, M. J., Dudoit, S., Gong, E. L., Speed, T. P. & Rubin, E. M. Microarray expression profiling identifies genes with altered expression in HDL-deficient mice. Genome Res. 10 , 2022–2029 (2000).
Redfern, C. H. et al. Conditional expression of a Gi-coupled receptor causes ventricular conduction delay and a lethal cardiomyopathy. Proc. Natl Acad. Sci. USA 97 , 4826–4831 (2000).
Kerr, M. K. & Churchill, G. A. Experimental design for gene expression microarrays. Biostatistics 2 , 183–201 (2001). The first paper to present the statistical principles of experimental design in the context of microarray experiments. Analysis involves a linear model for log intensities. Loop designs are introduced and compared with common reference designs.
Fisher, R. A. The arrangement of field experiments. J. Min. Agric. Gr. Br. 33 , 503–513 (1926).
Google Scholar
Cox, D. R. Planning of Experiments (Wiley, New York, 1958). A classic book about the statistical design of experiments.
Box, G. E. P., Hunter, W. G. & Hunter, J. S. Statistics for Experimenters: An Introduction to Design, Data Analysis and Model Building (Wiley, New York, 1978). A modern classic on the statistical design and analysis of experiments.
Alizadeh, A. A. et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403 , 503–511 (2000).
Tseng, G. C., Oh, M. K., Rohlin, L., Liao, J. C. & Wong, W. H. Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Res. 29 , 2549–2557 (2001).
Yang, Y. H. et al. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 30 , E15 (2002).
Article Google Scholar
Jin, W. et al. The contributions of sex, genotype and age to transcriptional variance in Drosophila melanogaster . Nature Genet. 29 , 389–395 (2001). The authors carry out a mixed model analysis of variance on single channel log intensities, including age, sex and strain. Of these, age effects were estimated within arrays, whereas sex and strain effects were estimated between arrays. No single-channel between-slide normalization was carried out. The authors found strong evidence for differential dye effects.
Kerr, M. K. & Churchill, G. A. Statistical design and the analysis of gene expression microarrays. Genet. Res. 77 , 123–128 (2001). The authors apply classical statistical experimental design to cDNA microarray experiments and keep Cy3 and Cy5 spot intensities separate in the analysis. The study assumes global normalization is adequate.
CAS PubMed Google Scholar
Yates, F. The Design and Analysis of Factorial Experiments Technical Communication 35 (Commonwealth Bureau of Soils, Harpenden, Herts, 1937). A classic book on factorial experiments.
Glonek, G. F. V. & Solomon, P. J. Factorial Designs for Microarray Experiments Technical Report (Department of Applied Mathematics, University of Adelaide, South Australia, 2002). The first careful treatment of optimal design for factorials.
Lee, M. L., Kuo, F. C., Whitmore, G. A. & Sklar, J. Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. Proc. Natl Acad. Sci. USA 97 , 9834–9839 (2000).
Black, M. A. & Doerge, R. W. Calculation of the minimum number of replicate spots required for detection of significant gene expression fold changes for cDNA microarrays. Bioinformatics (in the press). The authors limit their discussions to replicate spots within a slide.
Dudoit, S., Yang, Y. H., Callow, M. J. & Speed, T. P. Statistical methods for identifying genes with differential expression in replicated cDNA microarray experiments. Statist. Sincia 12 , 111–139 (2001).
Wolfinger, R. D. et al. Assessing gene significance from cDNA microarray expression data via mixed models. J. Comp. Biol. 8 , 625–638 (2001).
Zien, A., Fluck, J., Zimmer, R. & Lengauer, T. Microarrays: How Many do you Need? Proceedings of RECOMB 2002 (Association for Computing Machinery, New York, 2002). Using non-standard power analysis, this paper answers the question posed in its title.
Friddle, C. J., Koga, T., Rubin, E. M. & Bristow, J. Expression profiling reveals distinct sets of genes altered during induction and regression of cardiac hypertrophy. Proc. Natl Acad. Sci. USA 97 , 6745–6750 (2000).
Pritchard, C. C., Hsu, L., Delrow, J. & Nelson, P. S. Project normal: defining normal variance in mouse gene expression. Proc. Natl Acad. Sci. USA 98 , 13266–13271 (2001).
Boldrick, J. C. et al. Stereotyped and specific gene expression programs in human innate immune responses to bacteria. Proc. Natl Acad. Sci. USA 99 , 972–977 (2002).
Hinkelmann, K. & Kempthorne, O. Design and Analysis of Experiments Vol. 1 Introduction to Experimental Design (Wiley, New York, 1994).
Bingham, D. & Sitter, R. R. Design issues for fractional factorial split-plot experiments. J. Quality Technol. 33 , 2–15 (2001).
The chipping forecast. Nature Genet. 21 (Suppl.) (1999).
Youden, W. J. in Precision Measurement and Calibration: Statistical Concepts and Procedures Vol. 1 of Special Publication 300 (ed. Ku, H. H.) 146–151 (National Bureau of Standards, United States Department of Commerce, Washington, DC, 1969).
Download references
Acknowledgements
We thank S. Dudoit and N. Thorne for discussions and assistance during the course of this review. We also thank M. J. Callow from the Lawrence Berkeley National Laboratory and members of J. Ngai's lab — D. Lin, E. Diaz and J. Scolnick — for providing the data used in the figures. In addition, we are grateful to D. Bowtell for feedback on many design issues. This work was supported in part by the National Institutes of Health.
Author information
Authors and affiliations.
Department of Statistics and Program in Biostatistics, 367 Evans Hall, 3860, University of California, Berkeley, 94720-3860, California, USA
Yee Hwa Yang & Terry Speed
Department of Epidemiology and Biostatistics, Box # 0560, University of California, San Francisco, 94143-0560, California, USA
Yee Hwa Yang
Division of Genetics and Bioinformatics, The Walter and Eliza Hall Institute of Medical Research, Post Office, Royal Melbourne Hospital, Parkville, 3050, Victoria, Australia
Terry Speed
You can also search for this author in PubMed Google Scholar
Corresponding author
Correspondence to Terry Speed .
Supplementary information
Online box: calculation of variances (pdf 62 kb), related links, further information.
Terry Speed's lab
A mixture of differently labelled target cDNA fragments that are hybridized together in the presence of a common probe or collection of probes.
The logarithm, usually to the base 2, of the ratio of the measured signal intensities in the two channels of a two-colour microarray experiment. If we denote these two signals by R (red channel) and G (green channel), then their log ratio is log 2 ( R / G ).
cDNA microarrays have paired hybridization intensity measurements that are taken from two wavelength bands after laser excitation at two wavelengths. These two sources of data are known as channels. By contrast, measurements of radiolabelled hybridization products are single channel, as are the Affymetrix microarrays.
The most common statistical measure of variability of a random quantity or random sample about its mean. Its scale is the square of the scale of the random quantity or sample. The square root of the variance is known as the standard deviation.
A design that involves mRNA samples labelled 1, 2, 3,..., n , hybridized together in pairs (1,2), (2,3), ..., ( n − 1, n ), ( n ,1).
A numerical summary of some aspect of an experiment, typically an estimate of a parameter.
A calculation that leads to the probability that a null hypothesis that is being tested will be rejected in favour of the alternative, under specified assumptions that imply that the alternative hypothesis is true.
The middle value in a set of numbers ordered in value from smallest to largest. If there are an even number of numbers, the median is the average of the middle two after ordering.
Rights and permissions
Reprints and permissions
About this article
Cite this article.
Yang, Y., Speed, T. Design issues for cDNA microarray experiments. Nat Rev Genet 3 , 579–588 (2002). https://doi.org/10.1038/nrg863
Download citation
Issue Date : 01 August 2002
DOI : https://doi.org/10.1038/nrg863
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
This article is cited by
Investigating the molecular basis of multiple insecticide resistance in a major malaria vector anopheles funestus (sensu stricto) from akaka-remo, ogun state, nigeria.
- Seun M. Atoyebi
- Genevieve M. Tchigossou
- Rousseau Djouaka
Parasites & Vectors (2020)
Intratype variants of the E2 protein from human papillomavirus type 18 induce different gene expression profiles associated with apoptosis and cell proliferation
- Alma Mariana Fuentes-González
- J. Omar Muñoz-Bello
- Marcela Lizano
Archives of Virology (2019)
Approximate theory-aided robust efficient factorial fractions under baseline parametrization
- Rahul Mukerjee
Annals of the Institute of Statistical Mathematics (2016)
Gene expression profiling of the human natural killer cell response to Fc receptor activation: unique enhancement in the presence of interleukin-12
- Amanda R. Campbell
- Kelly Regan
- William E. Carson
BMC Medical Genomics (2015)
Assessment of copy number variations in the brain genome of schizophrenia patients
- Miwako Sakai
- Yuichiro Watanabe
- Hiroyuki Nawa
Molecular Cytogenetics (2015)
Quick links
- Explore articles by subject
- Guide to authors
- Editorial policies
Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.
An official website of the United States government
Official websites use .gov A .gov website belongs to an official government organization in the United States.
Secure .gov websites use HTTPS A lock ( Lock Locked padlock icon ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.
- Publications
- Account settings
- Advanced Search
- Journal List
MultiBaC: an R package to remove batch effects in multi-omic experiments
Manuel ugidos, maría josé nueda, josé m prats-montalbán, alberto ferrer, sonia tarazona.
- Author information
- Article notes
- Copyright and License information
To whom correspondence should be addressed. [email protected] or [email protected]
Received 2021 Nov 5; Revised 2022 Jan 19; Accepted 2022 Mar 1; Collection date 2022 May 1.
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License ( https://creativecommons.org/licenses/by-nc/4.0/ ), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]
Batch effects in omics datasets are usually a source of technical noise that masks the biological signal and hampers data analysis. Batch effect removal has been widely addressed for individual omics technologies. However, multi-omic datasets may combine data obtained in different batches where omics type and batch are often confounded. Moreover, systematic biases may be introduced without notice during data acquisition, which creates a hidden batch effect. Current methods fail to address batch effect correction in these cases.
In this article, we introduce the MultiBaC R package, a tool for batch effect removal in multi-omics and hidden batch effect scenarios. The package includes a diversity of graphical outputs for model validation and assessment of the batch effect correction.
Availability and implementation
MultiBaC package is available on Bioconductor ( https://www.bioconductor.org/packages/release/bioc/html/MultiBaC.html ) and GitHub ( https://github.com/ConesaLab/MultiBaC.git ). The data underlying this article are available in Gene Expression Omnibus repository (accession numbers GSE11521 , GSE1002 , GSE56622 and GSE43747 ).
Supplementary information
Supplementary data are available at Bioinformatics online.
1 Introduction
While omic platforms are widely accessible, data generation from a large number of samples and/or assays is still costly. For large data collections, sample acquisition may be distributed in time and space or complemented with data already available at public repositories. This results in datasets that are affected by a technical variability component associated to each acquisition event, i.e. the batch effect. Batch effects may represent the major source of variability in the combined omic dataset and compromise the detection of the underlying biological signal by standard methods ( Kupfer et al. , 2012 ).
Several Batch Effect Correction Algorithms (BECAs) have been proposed which are available as R packages ( Leek et al. , 2012 ; Risso et al. , 2014 ; Ritchie et al. , 2015 ). Usually, BECAs require adequate experimental designs where the batch is known and not confounded with the effects of interest (treatment, dose, time point, etc.). However, there are few BECAs providing data corrected for systematic noise coming from an unknown source. This may happen when laboratory practices introduce an un-noticed bias that affects a subset of samples. While this variability behaves as a batch effect, it is not identified as such and therefore is invisible to many traditional BECAs. Another scenario not yet covered by BECAs is multi-omic experiments. When multi-omic datasets are created by combining different omics modalities obtained asynchronously, batch effects are confounded with the ‘omic type effect’, what hampers their removal from the data.
Here, we introduce the MultiBaC R package ( Fig. 1 ), a general tool for batch effect removal in omic data that successfully addresses these difficult cases. The MultiBaC R package integrates two different batch effect correction methods: ARSyN, a flexible approach for the correction of systematic biases in single omic datasets for both declared (batches) or hidden sources of technical noise ( Nueda et al. , 2012 ), and MultiBaC, the first batch effect correction algorithm for multi-omic data ( Ugidos et al. , 2020 ). The MultiBaC R package is available at Bioconductor.
Overview of BECAs in MultiBaC R package
2 Materials and methods
Supplementary Figure S1 depicts the general scheme of ARSyNbac and MultiBaC algorithms. Basically, ARSyNbac uses ANOVA to decompose the omics signal into experimental variables and residual noise. Should the source of unwanted variation be known, the method estimates the batch effect and removes it from data. When the source of batch effect is unknown, the residual noise is analyzed by PCA to detect any systematic component which is subsequently removed from the data. It is also possible to simultaneously correct the data for both types of noise. MultiBaC operates when a multi-omic dataset is created by combining multi-omic data from different laboratories provided one omic type is shared across labs (common omic). Partial Least Squares Regression (PLS) is used to model and predict the non-common omic data as a function of the common omic type, which allows the subsequent combination of complete multi-omic data structures and removal of batch effects with ARSyNbac.
3 Implementation
The MultiBaC R package organizes all the input and output data in a MultiAssayExperiment object, a type of Bioconductor container for multi-omic studies. The createMbac() function in MultiBaC package creates a MultiAssayExperiment object for each batch from the original matrices, and generates an mbac data structure, a S3 list class of MultiAssayExperiment objects to be used as input by ARSyNbac() or MultiBaC() functions. The resulting batch effect corrected matrices are also added to the mbac structure and several plots can be generated to validate PCA or PLS models applied by MultiBaC ( Fig. 2a–c ), and to assess the batch effect magnitude ( Fig. 2d ) or the correction performance ( Supplementary Fig. S2 ). A reduced version of a yeast multi-omic dataset described in ( Ugidos et al. , 2020 ) is included in the package. More details about the package can be found in the Bioconductor vignette and Supplementary Section S4 .
Usage of MultiBaC method on the complete yeast multiomics dataset. ( a ) Q2 plot shows the number of PLS components needed to reach a good predictive ability. ( b ) Explained variance plot shows the number of components retained by the ARSyN model. ( c ) Inner relation plot checks the requirement of linearity between X and Y PLS components. ( d ) Distribution of the magnitude of batch effects
4 Results and discussion
Supplementary Section S3.1 summarizes and discusses the qualitative comparison of methods included in the MultiBaC package to the most popular BECAs. Briefly, in the single-omic case, limma and ComBat cannot handle the unknown batches situation. ARSyN, RUV and SVA can estimate such noise effects, but SVA does not provide a corrected dataset and instead returns surrogate variables to be included in differential expression models. ARSyNbac can easily and simultaneously correct for both known and unknown sources of noise. To the best of our knowledge, MultiBaC is the only BECA for correcting multi-omic batch effects.
The ARSyN algorithm performance for noise reduction mode was validated on both real and simulated data in ( Nueda et al. , 2012 ) but not provided in a publicly available R package. The known batch option is a straightforward adaptation of the original ARSyN methodology. A very simple implementation of these two ARSyN versions was first included in the NOISeq R package ( Tarazona et al. , 2015 ) and now their functionality has been quite improved in the ARSyNbac function, with simultaneous correction of unwanted effects from both known or unknown sources, performance plots and more flexible options for PCA models. Supplementary Section S3.2 shows that ARSyNbac overperforms other popular BECAs listed in Supplementary Table S1 .
The MultiBaC strategy has been also extensively validated in Ugidos et al. (2020) . Figure 2 and Supplementary Figure S2 shows how MultiBaC successfully removes batch effects when four yeast omics types (RNA-seq, GRO-seq, RIBO-seq and PAR-clip) ( Ugidos et al. , 2020 ), obtained in different labs, are combined. MultiBaC returns a harmonized multi-omics dataset ready to be used in downstream analyses aiming to infer regulatory patterns across omics layers.
This work was funded by the Generalitat Valenciana through PROMETEO grants program for excellence research groups [PROMETEO 2016/093] and by the Spanish MICINN [PID2020-119537RB-I00]. Funding for open access charge: Universitat Politècnica de València.
Conflict of Interest : none declared.
Supplementary Material
Contributor information.
Manuel Ugidos, Gene Expression and RNA Metabolism Laboratory, Instituto de Biomedicina de Valencia, Consejo Superior de Investigaciones Científicas, Valencia 46010, Spain; Multivariate Statistical Engineering Group, Department of Applied Statistics, Operations Research and Quality, Universitat Politècnica de València, Valencia 46022, Spain.
María José Nueda, Department of Mathematics, Universidad de Alicante, Alicante 03690, Spain.
José M Prats-Montalbán, Multivariate Statistical Engineering Group, Department of Applied Statistics, Operations Research and Quality, Universitat Politècnica de València, Valencia 46022, Spain.
Alberto Ferrer, Multivariate Statistical Engineering Group, Department of Applied Statistics, Operations Research and Quality, Universitat Politècnica de València, Valencia 46022, Spain.
Ana Conesa, Institute for Integrative Systems Biology, Consejo Superior de Investigaciones Científicas, Valencia 46980, Spain.
Sonia Tarazona, Multivariate Statistical Engineering Group, Department of Applied Statistics, Operations Research and Quality, Universitat Politècnica de València, Valencia 46022, Spain.
- Kupfer P. et al. (2012) Batch correction of microarray data substantially improves the identification of genes differentially expressed in rheumatoid arthritis and osteoarthritis. BMC Med. Genomics, 5, 1–12. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Leek J.T. et al. (2012) The SVA package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics, 28, 882–883. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Nueda M.J. et al. (2012) ARSyN: a method for the identification and removal of systematic noise in multifactorial time course microarray experiments. Biostatistics, 13, 553–566. [ DOI ] [ PubMed ] [ Google Scholar ]
- Risso D. et al. (2014) Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechnol., 32, 896–902. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Ritchie M.E. et al. (2015) limma powers differential expression analyses for RNA-sequencing and microarray studies. NAR, 43, e47. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Tarazona S. et al. (2015) Data quality aware analysis of differential expression in RNA-seq with NOISeq R/Bioc package. NAR, 43, e140. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Ugidos M. et al. (2020) MultiBaC: strategy to remove batch effects between different omic data types. Stat. Methods Med. Res., 29, 2851–2864. [ DOI ] [ PubMed ] [ Google Scholar ]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
- View on publisher site
- PDF (221.8 KB)
- Collections
Similar articles
Cited by other articles, links to ncbi databases.
- Download .nbib .nbib
- Format: AMA APA MLA NLM
IMAGES
VIDEO
COMMENTS
Multiple series time- course (MSTC) microarray experiments are designed experimental setups in which gene expression is measured at various points of a given time interval on samples that correspond to different levels of other experimental factor(s), such as treatment, tissue or strain.
In this work, we focus on multifactorial time course microarray (MTCM) experiments with 2 factors: one quantitative such as time or dosage and the other qualitative, as tissue, strain, or treatment. However, the method can be used in other situations such as experiments with only one factor or more complex designs with more than 2 factors.
We additionally propose a novel approach for gene selection in the context of the relation of individual transcriptional patterns to global gene expression signals. We demonstrate the methodology on both real and synthetic datasets.
However, there is currently no established method for identifying differentially expressed genes in a time course study. Here we propose a significance method for analyzing time course microar-ray studies that can be applied to the typical types of comparisons and sampling schemes.
We have proposed a significance method to identify differentially expressed genes in time course microarray experiments and applied it to studies involving two types of sampling and both types of temporal differential expression.
In this work, we focus on multifactorial time course microarray (MTCM) experiments with 2 factors: one quantitative such as time or dosage and the other qualitative, as tissue, strain, or...
In this work, we focus on multifactorial time course microarray (MTCM) experiments with 2 factors: one quantitative such as time or dosage and the other qualitative, as tissue, strain, or treatment. However, the method can be used in other situations such as experiments with only one factor or more complex designs with more than 2 factors.
We illustrate these issues with a discussion of short time-course experiments below. Time-course experiments. In time-course experiments, the design choices depend on the...
In this article, we introduce the MultiBaC R package, a tool for batch effect removal in multi-omics and hidden batch effect scenarios. The package includes a diversity of graphical outputs for model validation and assessment of the batch effect correction.
Multi-series time-course microarray experiments are useful approaches for exploring biological processes and in studying gene expression changes along time and in evaluating trend differences between the various experimental groups.