Bayesian
Analysis of Gene Expression Levels
Version 4.1, July 2009
Zhang Zhang, Colin Meiklejohn and
Jeffrey Townsend
Bayesian Analysis of Gene Expression Levels (BAGEL) is a program that allows statistical inferences to be made regarding differential gene expression between two or more samples measured on spotted (twochannel) microarrays. BAGEL makes these inferences from normalized ratio data, on a genebygene basis. The advantages of BAGEL include ease of use, straightforward interpretation of results, statistical robustness, flexibility in accepting different experimental designs, and that it is free. BAGEL was written by Jeffrey Townsend, who periodically updates and improves the program, and to whom bugs should be reported. BAGEL is available for Windows, Mac OS9, Mac OSX, and Linux. BAGEL can be downloaded from the Townsend Lab web site, http://www.yale.edu/townsend/software.html.
Please cite:
Townsend, J.P., and D.L. Hartl. 2002. Bayesian analysis of gene expression levels: statistical quantification of relative mRNA level across multiple strains or treatments. Genome Biology 3 (12): research0071.10071.16.
A number of factors can influence the signal intensity of labeled DNA hybridizing to a microarray spot, such as hybridization efficiency or concentration of target sequences in the spot. Any such factors that will be shared by samples hybridizing to the same spot will be eliminated by considering the ratio of the two signal intensities. BAGEL explicitly takes this into account by using ratio measurements, not singlechannel signals, coming from twodye competitive hybridizations. BAGEL makes transitive comparisons across ratios, for example inferring the ratio of sample A to sample C across a set of hybridizations that directly compare sample A to sample B and sample B to sample C. Data that are appropriate for analysis by BAGEL must therefore have the following properties:
1) The data should be collected in such a way that pairs of samples share sources of variation that are nontrivial and that are not of interest to the researcher; and the relative magnitude of some metric between the two samples is the measurement of interest. The originally envisioned use for BAGEL, twochannel microarray data, is an obvious example of data with this structure, but in principle BAGEL could be used to analyze any other kind of data that fit these criteria. For this reason, tiled microarray platforms such as Affymetrix do not lend themselves easily to BAGEL analysis.
2) All genotypes, tissues, treatments etc. (hereafter referred to as ‘expression nodes’) to be analyzed must be connected to all other nodes through direct or indirect comparisons. For example, given four nodes, a set of microarray experiments that competitively hybridized node 1 vs. node 2, node 2 vs. node 3, and node 3 vs. node 4 would permit the estimation of relative expression levels of genes across all four nodes. On the other hand, a set of experiments that competitively hybridized node 1 vs. node 2 and node 3 vs. node 4 would permit the estimation of gene expression levels between nodes 1 and 2 and between 3 and 4, but no estimates could be made regarding comparisons between 1 and 3. Any experimental design incoporating a reference sample will necessarily fulfill this criterion, as all nodes are connected through their direct comparisons with the reference.
3) There must be a sufficient number of measurements (replicate experiments) to estimate the parameters. In principle, this means as many measurements as nodes, or half as many hybs as nodes (when estimating a single variance parameter, see below). In practice, requirement 2 will usually necessitate more than this many hybs, and of course, the greater the replication, the more precise the estimates of gene expression. Experience suggests that an experimental design providing at least three measurements for each node is a good target number for providing reasonable statistical power.
Formally, the statistical model employed by BAGEL assumes that the measured fluorescence intensity for one channel is a function of (1) the true quantity of the labeled mRNA species; (2) some number of multiplicatively and/or additively confounding factors that are specific to the spot in question but shared by the measured intensity from the other channel; and (3) some number of unbiased, randomly distributed error terms (for example, reverse transcription or labeling efficiency). If the error terms contribute additively, the observed ratios of gene expression between the ith and the jth samples (z_{ij}) should be similar to the ratio of two normal distributions, and can be approximated by the function
_{}
where m_{i} is the expression level and s^{2}_{i} is the variance of a gene in sample i. If the error terms contribute multiplicatively, then the ratio z_{ij}_{ }can be approximated by the ratio of two lognormal distributions using the function
_{}
BAGEL can implement both kinds of error models.
This approach requires the estimation of 2n –1 parameters (n1 expression levels and n variances) for each gene. One way to reduce the number of parameters is to assume that, for a given gene, all the nodes have the same error variance. This reduces the number of parameters to n, and consequently reduces the number of replicates required for statistical power. Similarly, one can assume that, for a given gene, all samples have a constant relationship with the expression level, i.e. that they have a common coefficient of variation (n = s_{i}/m_{i} for all i). This approach also requires estimating n free parameters.
BAGEL explores the likelihood function derived from either of the ratio formulas above for all nodes using a markov chain monte carlo (MCMC) approach in a Bayesian framework. This method starts with a random vector of parameters and then changes two of the parameters by small, random steps. At each step the likelihood of the data given the model above and the new parameter values is calculated. If the new parameters give a better fit to the data, then the new values are accepted. If the new parameters give a worse fit to the data, then the new values are accepted with a probability proportional to their likelihood. In this way the markov chain searches the parameter space, finding combinations of relative gene expression levels that produce the greatest likelihood, and samples from the chain are used to construct the Bayesian posterior probability of the parameters given the data.
Figures 13 show an example of a markov chain from BAGEL analysis of a single gene with four nodes (gene BcDNA:LD09936 from the example dataset discussed below). The relative expression levels for the four nodes are indicated on the y axis in the upper panel (note that they have not yet been normalized such that the node with the lowest expression is set to one) and the step in the chain is shown on the x axis. The lower panel shows the likelihood of the data given the parameters (relative expression levels) at that step in the chain. Figure 1 shows the first 200 steps and Figure 2 shows the first 1000 steps in the chain. Note that the relative expression levels quickly move away from their (random) initial values towards greater or lesser expression levels more consistent with the data, as indicated by the jumps in likelihood. This initial exploration of parameter space, while important for finding the peak of greatest likelihood, should not contribute to the data sampled from the chain, or it will not converge on the posterior probability of the Bayesian formulation of the functions shown above. Put another way, we don’t want BAGEL to make inferences from the data in regions of low likelihood simply as a result of the (random) initial parameter settings. For this reason, BAGEL runs the chain for a “burnin” period before beginning to sample parameter values for the posterior probability. The default length of this burnin is 20,000 steps in the chain; you can see from Figures 1 and 2 that the parameter values move rather quickly to a region of parameter space and stay there. Figure 3 shows the last 1000 steps sampled from the analysis, following a 20,000 step burnin. The relative expression levels for all four nodes have stabilized around the most likely values. One thing to note is that Figures 1 and 2 show the parameter values at each step of the chain, whereas Figure 3 shows every 20^{th} step in the chain (so although 1000 steps are shown, the data actually spans 20,000 steps). This is known as the period of sampling from the chain, and is important because it removes correlations between successive steps (the plateaus for a given node in Figures 1 and 2) which otherwise compromise the ability of the markov chain to explore the full range of potential parameter values.
BAGEL infers relative expression levels and statistical significance from the parameter values it samples from the chain. The relative expression level of a node for a given gene is the median value across the samples from the chain, normalized relative to the lowest median expression level. The 95% credible intervals are the values within which 95% of the samples from the chain are bounded, and Pvalues for the “hypothesis” that node A > node B are simply the proportions of samples from the chain where node A’s expression level was greater than node B’s.
BAGEL does not perform normalization of raw microarray data (for example, to account for systematic differences in signal intensity between the two fluorophores), and an appropriate normalization method should therefore be implemented prior to BAGEL analysis. Following normalization, the data must be formatted in a way that BAGEL can use. The appropriate format is a tabdelimited text file that contains (normalized) ratio data for all the relevant genes and hybridizations, as well as some header rows and columns (see below).
The details for loading and running BAGEL will differ depending on your platform. See the readme file that comes with the version of BAGEL you have downloaded for specific instructions.
Upon executing BAGEL, the following text will appear on your terminal:
B.A.G.E.L.
Acceptable files for Unix B.A.G.E.L. are tabdelimited text
files with three header rows. The second and third rows must containing unique
names for each experimental expression node and reference expression node,
followed by any number of data rows for each gene of interest:
[Your Notes] [Your
Notes] [Label1] [Label2] [Label3]...
[Your Notes] [Channel1] Exp1 Exp2 Exp3 ...
[Your Notes] [Channel2] Ref1 Ref2 Ref3 ...
ORF1
CommonName1 Ratio1 Ratio2 Ratio3 ...
...
... ... ... ...
Please type the exact name of a text file of microarray
ratio results to analyze:
This text shows you the tabdelimited text format in which BAGEL expects your input file. There are three header rows to a properly formatted BAGEL input file. Square brackets indicate information that may be in your input file for your own reference only. All unbracketed entries must be present. There is no need for the experiments (Label1, Label2, Label3...) to be in any particular order, and there is no inherent difference between experimental (Exp) and reference (Ref) samples. In fact, in any experimental design wisely incorporating dyeswaps, sample names will presumably appear in both row 2 and row 3. It is, however, essential that a sample name be exactly consistent across all columns, or else BAGEL will infer two different samples when there is in fact only one.
At the prompt following the example data format, type the name of the input file, including the directory path. For example, in UNIX, directory pathnames look something like
/Users/jeff/DOCUMENTS/RESEARCH/Software/BAGEL/Datafilename.unx
NOTE: Keep Datafilename and your experimental node names short. If the Datafilename is too long, BAGEL has to truncate it and use a far less intutitive name for your output file.
BAGEL then asks you to verify the number of hybs and the names of the expression nodes (sample•treatments) in your experimental design. Press RETURN to verify, or press ‘q’ to quit and correct your input file.
You are then presented with a menu of options:
Current MCMC settings:
(E)rror
Model: Additive errors, estimating/constraining Coefficient of Variation terms
(C)onstrained
Coefficient of Variation: True
(I)nitial
values:
Mu[M12] := 1.00 Coefficient of
Variation[M12] := 0.2000
Mu[M28] := 1.00 Coefficient of
Variation[M28] := 0.2000
Mu[M57] := 1.00 Coefficient of
Variation[M57] := 0.2000
Mu[M78] := 1.00 Coefficient of
Variation[M78] := 0.2000
(M)u step size: 0.50
(V)ariance/CV
step size: 0.500
(B)urn in, # generations: 20000
(P)eriod
of sampling from the Markov chain: 20
(G)enerations
to be sampled: 10000
(F)ull
output of the chain: False
(T)uning
depth maximum: 8
(E)rror Model: This option allows you to choose between additive and multiplicative errors, and whether you wish to estimate or constrain the variance or the CV of the gene expression levels, as was described above. All the models BAGEL uses work fairly well, and results are usually quite similar (see below), so unless you have reason to, it’s probably best to use the default settings (additive errors, constraining coefficients of variation).
(C)onstrained Variance/Coefficient of Variation: If TRUE, variances/coefficents of variation for all expression nodes are assumed to be the same. With this option, you must have as many measurements as expression nodes. If FALSE, variances for all expression nodes are separately estimated. In this case, you must have at least twice as many measurements as expression nodes (minus one). Unless your data is very highly replicated, using constrained variance is recommended. With underreplicated data, estimation of variances for each sample is very imprecise and can lead to misleading results. When a design is wellreplicated, to the extent it has been tested so far, it seems that estimating each variance independently changes BAGEL estimates of the expression levels very little.
(I)nitial values: The starting Mu and Sigma Squared parameter values for the Markov Chain. In some applications of the MCMC method, it is very important to try many different initial starting values to ensure that the chain does not get stuck in one region of the state space. This is not much of an issue with the BAGEL models. With moderately decent microarray data, it does not get stuck in local peaks.
(M)u step size: The step size is a very important parameter in the MCMC implementation, which BAGEL automatically tunes for you as long as (T)uning Depth maximum, below, is greater than one. Note that BAGEL uses information from genes previously analyzed in your dataset to help it guess the right step size.
(V)ariance/CV step size: See (M) above.
(B)urn in, # generations: The default burnin (20000 iterations) is rather excessive for most data sets. However, it is nice to feel confident that stationerity in the chain has been reached. Decreasing this parameter will make BAGEL run slightly faster. It is up to you to ensure that BAGEL is reaching stationerity. One way to do this is to run multiple chains and check that they are converging on the same results.
(P)eriod of sampling from the Markov chain: This is how many iterations are performed until the current state is sampled from the chain to construct posterior distributions. When the period is greater than one, this is referred to as “thinning the chain”. It subdues correlations that are present between subsequent states of the chain. Decreasing this period substantially decreases computation time, but compromises the independence of the samples and thereby the adequate mixing of the chain.
(G)enerations to be sampled: Ten thousand generations yields accuracy to about three digits. Increasing the number of generations (iterations) increases the number of digits of accuracy. The product P * G largely determines the amount of time necessary for a BAGEL run for a gene.
(F)ull output of the chain: FALSE. Keep it that way, unless your input file has only one or two genes and you really like BAGEL to talk about everything it’s doing. If you run with Full Output, BAGEL saves the posterior distributions for Mu for every sample. On a genomewide data set, it could rapidly fill your hard drive.
(T)uning depth maximum: How hard (in MCMC runs) you wish BAGEL to try to find an optimal step size for a gene. In the newest versions of BAGEL (>3.0), an optimal step size is discovered in a few tries, and almost always in six or seven, so there is little point in changing this from the default value (8 tuning iterations). A gene for which no optimal step size is found is marked FALSE under "Acceptable?" in the output file, but this just about never happens.
When you are done changing settings, or if (more likely) you have not changed them at all, press RETURN, and BAGEL will begin to work on your data set. BAGEL currently takes a long time to run, say, a minute per gene or more. Thus, it is frequently convenient to set it going on a computer you won’t need for the night, and leave it alone.
BAGEL also provides a commandline option to run the program without manual intervention, which is suitable for submitting jobs to the queues for processing. The command line is parsed as follows:
USAGE: BAGEL [options] InputDataFile
Where:
InputDataFile = Expression data
Options:
e ErrorModel 
valid options are 'AM', 'AC', 'EM', or 'EC' 
c 
estimate individual variances. If this flag is not given, the default (single variance) is used 
i Mu1,Mu2,... 
initial mu estimates for each expression node (comma delimited, no spaces) 
I Var1,Var2... 
initial sigmasquared value (if the c flag is not given) or a list of initial sigmasquared values for each each expression node (if the c flag is given). 
m MuStepSize 
step size for varying estimates 
v VCVStepSize 
step size for varying variances 
b BurnIn 
# of generations of burnin prior to sampling 
p SamplingPeriod 
period of sampling from the Markov chain 
g ChainSampleSize 
# of generations to be sampled 
f FullOutput 
if this flag is given, the full output of the MCMC chain will be written (c.f. the warning in the BAGEL tutorial) 
t MaxTuningDepth 
how hard BAGEL should try to tune the step size 
q 
Dry run: parse input file and command line, but abort before starting the BAGEL run. 


(the following parameters are not covered in the previous BAGEL tutorial) 

r RandSeed 
Seed for random number generator. If not given, the random number generator will be seeded from the system clock (so that multiple BAGEL runs will perform independent samplings of the parameter space). 
o ResultsFile 
Filename for the BAGEL results file. If not given, a default name will be generated based on the name of the input file (c.f. the BAGEL tutorial). 
The options may be given in any order. If an option is repeated, the last value of that option will be used. If an option is omitted, the default value (as documented in the BAGEL tutorial) will be used.
The commandline BAGEL was provided by the courtesy of Dr. Mark Voorhies.
The microarray data used here is taken from Ranz et. al. 2003. These experiments analyzed mRNA levels for ~5000 genes in adult males and females of Drosophila melanogaster and a closely related species, Drosophila simulans. This dataset therefore has four nodes, and the experimental design and level of replication is shown in Figure 4. For simplicity of illustration, four genes from the full dataset were selected.
1) Execute BAGEL
If you doubleclick on the
application, a window comes up with the following text in it:
B.A.G.E.L.
Acceptable files for
B.A.G.E.L. are tabdelimited text files
with three header rows. The
second and third rows must containing unique names
for each experimental
expression node and reference expression node,
followed by any number of data
rows for each gene of interest:
[Your
Notes] [Your
Notes] [Label1] [Label2] [Label3] ...
[Your
Notes] [Channel1] Exp1 Exp2 Exp3 ...
[Your
Notes] [Channel2] Ref1 Ref2 Ref3 ...
ORF1 CommonName1 Ratio1 Ratio2 Ratio3 ...
... ... ... ... ...
This text is meant to give
you the tabdelimited text format by which BAGEL expects your input file to be
formatted. There are three header rows to a properly formatted BAGEL input
file. Square brackets indicate information that may be in your input file for
your own reference only. All unbracketed entries must
be present. There is no need for the experiments (Label1, Label2, Label3...) to
be in any particular order, and there is no inherent difference between
experimental (Exp) and reference (Ref) samples. In fact, in any experimental
design wisely incorporating dyeswaps, sample names will presumably appear in
both row 2 and row 3. It is, however, essential that a
sample name be exactly consistent across all columns, or else BAGEL will infer
two different samples when there is in fact only one.
2) Input file name and verify file format
Verifying Input File Dros_spp
Dros_spp DYE 1A 1B 17B 27A
6A 6B 18A 18B
28B 19A 21A
21B 26A
26B 28A
20B 22A
22B 29A
29B 24B
dataset
Initializing ExpressionNodeNameList...
Assigning names to expression nodes.....................
Number of Hybs: 21
Press RETURN to verify or q to quit:
gene name
Assigning more names to expression
nodes.....................
Please verify that 4 expression nodes are desired,
that all desired nodes are are listed
below, and
that each of the following are unique expression nodes.
CS
CSf
Sim
Simf
Press RETURN to verify or q to quit:
3) Choose parameter settings
Assigning experimental node names to hyb
list.....................
Assigning reference node names to hyb
list.....................
File Dros_spp header rows
verified.
Current MCMC settings:
(E)rror
Model: Additive errors, estimating/constraining Coefficient of Variation terms
(C)onstrained
Coefficient of Variation: True
(I)nitial
values:
Mu[CS] := 1.00 Coefficient of Variation[CS] := 0.2000
Mu[CSf] := 1.00 Coefficient of Variation[CSf]
:= 0.2000
Mu[Sim] := 1.00 Coefficient of Variation[Sim]
:= 0.2000
Mu[Simf] := 1.00
Coefficient of Variation[Simf] := 0.2000
(M)u step size: 0.50
(S)igma /
mu step size: 0.500
(B)urn in, # generations: 20000
(P)eriod
of sampling from the Markov chain: 20
(G)enerations
to be sampled: 10000
(F)ull
output of the chain: False
(T)uning
depth maximum: 8
Enter a letter to change a parameter, q to quit, or RETURN
to go on:
4) BAGEL is off and running
PGRPSC1b(GH07464)
Constructing Comparison Matrix:
CS CSf Sim Simf
CS 1.29 1.48 0.67 0.53 0.15 0.63 0.35
CSf 1.05 1.38
0.35 0.40 0.46
Sim 7.27 1.88
1.29 0.55 1.14
Simf 3.33 2.19 1.06 2.04
V/CV A.R.: 0.14...
Here BAGEL takes all of the available data for the first gene (PGRPSC1b) and starts to run the markov chain. The line at the bottom (V/CV A.R. . . ) shows BAGEL’s attempts to tune the jump size in the chain to produce an appropriate acceptance rate (between 0.15 and 0.5). Once the chain has run successfully, BAGEL shows you the results:
Mu Acceptance ratio: 0.40 True
Coefficient of Variation Acceptance ratio: 0.40 True
Mu SS: 0.500 V/CV SS: 0.1498
PGRPSC1b(GH07464), CS: 0.32 1.00 0.34
PGRPSC1b(GH07464), CSf: 0.35 1.17
0.35
PGRPSC1b(GH07464), Sim: 0.39 2.37
0.38
PGRPSC1b(GH07464), Simf: 0.40 2.85 0.38
Density of Best Likelihood: 6.496307351635393e06
Logmean MuStepSize: 0.50
Logmean VCVStepSize: 0.15
Genes Examined: 1
The D. simulans female expression level for PGRPSC1b, for instance, is estimatdd to be 2.85, with a 95% credible interval between 2.45 and 3.23. All expression levels are estimated in comparison to the sample with the lowest expression level (here, male D. melanogaster).
BAGEL then moves on to the next gene . . .
BcDNA:LD09936(LD09936)
Constructing Comparison Matrix:
CS CSf Sim Simf
CS 1.01 8.9120.68 0.42 0.47 0.42 0.51
CSf 0.08 0.07
2.08 1.55 1.39
Sim 1.94 1.35
36.0649.9537.74
Simf 0.67 0.51 0.02 0.01
Mu Acceptance ratio: 0.25 True
Coefficient of Variation Acceptance ratio: 0.15 True
Mu SS: 0.500 V/CV SS: 0.1498
BcDNA:LD09936(LD09936), CS: 1.06 11.35 0.96
BcDNA:LD09936(LD09936), CSf: 0.94 1.78
1.24
BcDNA:LD09936(LD09936), Sim: 1.53 22.10 1.16
BcDNA:LD09936(LD09936), Simf: 0.80 1.00 1.01
Density of Best Likelihood: 4.584386846285838e11
Logmean MuStepSize: 0.50
Logmean VCVStepSize: 0.15
Genes Examined: 2
CG12200(LD30246)
Constructing Comparison Matrix:
CS CSf Sim Simf
CS 0.11 0.15 0.87 0.99
1.04
CSf 5.76 4.34
9.42 7.04 5.92
Sim 0.85 1.12
1.31 2.06
Simf 0.12 0.19 0.55 0.80
V/CV A.R.: 0.07...
Mu Acceptance ratio: 0.20 True
Coefficient of Variation Acceptance ratio: 0.32 True
Mu SS: 0.500 V/CV SS: 0.0299
CG12200(LD30246), CS: 0.20 1.28 0.21
CG12200(LD30246), CSf: 0.48 7.13
0.45
CG12200(LD30246), Sim: 0.28 1.43
0.33
CG12200(LD30246), Simf: 0.19 1.00 0.19
Density of Best Likelihood: 4.516871022806865e01
Logmean MuStepSize: 0.50
Logmean VCVStepSize: 0.09
Genes Examined: 3
qtc(SD06355)
Constructing Comparison Matrix:
CS CSf Sim Simf
CS 0.93 3.64 2.06 0.90 1.13 0.82 0.96
CSf 0.30 0.39
0.85 0.88 0.97
Sim 0.89 1.09
1.83 2.61
Simf 1.05 0.85 0.52 0.28
Mu Acceptance ratio: 0.22 True
Coefficient of Variation Acceptance ratio: 0.18 True
Mu SS: 0.500 V/CV SS: 0.0875
qtc(SD06355), CS:
0.19 2.59 0.18
qtc(SD06355), CSf: 0.18
1.00 0.18
qtc(SD06355), Sim: 0.19
2.65 0.19
qtc(SD06355), Simf: 0.18
1.11 0.17
Density of Best Likelihood: 1.481589125998954e+03
Logmean MuStepSize: 0.50
Logmean VCVStepSize: 0.09
Genes Examined: 4
Press RETURN when ready.
And
it’s done.
1) BAGEL results
BAGEL results output is a tabdelimited text file with estimates for each expression node, additions for 95% upper bounds, and subtractions for 95% lowerbounds. These are formatted such that creation of an EXCEL column or bar graph should be very easy. Other columns let you know of the Mu and Variance/CV step acceptance rate as well as an “Acceptable?” column which discloses whether BAGEL found acceptable acceptance rates (between 0.15 and 0.5) for both parameters.
The results output from the example datafile looks like this:
gene name
Unique ID Common
Name CS CSf Sim Simf ...
PGRPSC1b GH07464 1 1.17 2.37 2.85 ...
BcDNA:LD09936 LD09936 11.35 1.78 22.10 1 ...
CG12200 LD30246 1.28 7.13 1.43 1 ...
qtc SD06355 2.59 1 2.65 1.11 ...
...
()97.5%[CS] ()97.5%[CSf] ()97.5%[Sim] ()97.5%[Simf]
0.32146 0.34871 0.38685 0.39713 ...
1.055 0.94196 1.52557 0.79782 ...
0.20477 0.48235 0.28134 0.18666 ...
0.18859 0.17951 0.18848 0.17595 ...
...
2) BAGEL Pvalues
Pvalues for whether expression level is greater in one sample than another are saved as separate files according to expression node names. You should be aware that these Pvalues are not "corrected" in any way for multiple tests; you should have the appropriate scientific skepticism and look carefully for corroborating biologically consistent evidence. Furthermore, the Pvalue is only as precise as the number of samples taken from the chain. If 10,000 samples were taken, then a Pvalue of zero really means that in none of the samples taken from the chain did the comparison hold, and so the Pvalue is really P < 0.0001. The Pvalue text files are on the same drive in the same folder as your input file. The output filename will be the same as the original Datafilename, but will have the characters “.pvalue” appended.
The Pvalue from the example datafile looks like this:
Unique ID Common
Name P(CS>CSf) P(CS>Sim) ...
PGRPSC1b GH07464 0.2543 0.0001 ...
BcDNA:LD09936 LD09936 1 0 ...
CG12200 LD30246 0 0.1591 ...
qtc SD06355 1 0.3336 ...
...
If BAGEL is halted midrun, a number of files with ".BAM" suffixes may be found in the BAGEL folder. These files temporarily store the sampled Mu values and may be summarily deleted.
Figure 5 shows the results of BAGEL analysis of the four genes under different error models. Each error model was run twice in order to assess variation between independent runs within a model. With the level of replication shown in Figure 4, changing between additive and multiplicative error models, and constraining variances or CVs makes very little difference to the results. For three of the genes, choosing unconstrained variances also has little effect on the results, aside from increasing the width of the 95% credible intervals. However, for BcDNA:LD09936, choosing an unconstrained variance model does have a significant impact on the relative expression levels inferred by the markov chain. This result, given the well replicated nature of these experiments, underscores the need to be cautious when choosing an unconstrained variance/CV model.
Please report bugs or suggestions for changes to Jeffrey Townsend (Jeffrey.Townsend@yale.edu).
Copyright © 2003, 2004. The Regents of the
Permission
to use, copy, modify, and distribute this software and its documentation for
educational, research, and notforprofit purposes, without fee and without a
signed licensing agreement, is hereby granted, provided that the above
copyright notice, this paragraph and the following two paragraphs appear in all
copies, modifications, and distributions. Contact The Office of Technology
Licensing, UC Berkeley,
IN NO
EVENT SHALL REGENTS BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL,
INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF
THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF REGENTS HAS BEEN
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
REGENTS SPECIFICALLY DISCLAIMS ANY
WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE
SOFTWARE AND ACCOMPANYING DOCUMENTATION, IF ANY, PROVIDED HEREUNDER IS PROVIDED
"AS IS". REGENTS HAS NO OBLIGATION TO
PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
This tutorial was compiled by Zhang Zhang. Its contents were constructed from the original tutorial by Colin Meiklejohn, the original publication describing BAGEL (Townsend and Hartl 2002), a subsequent report elaborating the error models (Townsend 2003), Mark Voorhies’ commandline help file, and Jeffrey Townsend’s readme file that accompanies the BAGEL download. Questions regarding the use of BAGEL and this tutorial can be addressed to Jeffrey Townsend (Jeffrey.Townsend@yale.edu).
1) Original publication describing BAGEL:
Townsend
JP, Hartl DL, 2002. Bayesian analysis of
gene expression levels: statistical quantification of
relative mRNA level across multiple strains or treatments. Genome
Biol. 3(12): RESEARCH0071.
2) Extension of the original additive model
to include multiplicative errors and constrained CVs, as well as power tests of
the models in BAGEL are described in:
Townsend
JP, 2004. Resolution of large and small differences in gene expression using
models for the Bayesian analysis of gene expression levels and spotted DNA
microarrays. BMC
Bioinformatics 5: 54.
3) Research reports
that have used BAGEL and might provide ideas about postBAGEL analyses of
microarray data include:
AubinHorth N, Letcher BH, Hofmann HA,
2005. Interaction of
rearing environment and reproductive tatic on gene
expression profiles in Atlantic salmon. Journal of Heredity 96(2): 118.
Lemos B, Meiklejohn CD, Hartl DL, 2004. Regulatory evolution across the protein interaction network. Nature Genetics
36(10): 10591060.
Ranz JM, Namgyal K, Gibson G, Hartl DL,
2004. Anomalies in the
expression profile of interspecific hybrids of Drosophila melanogaster and Drosophila simulans. Genome
Research 14: 373379.
Grozinger CM, Sharabash NM, Whitfield CW, Robinson, GE, 2003. Pheromonemediated gene expression in the honey bee brain. PNAS 100 (suppl 2):1451914525.
Whitfield CW, Cziko
A, Robinson, GE, 2003. Gene expression
profiles in the brain predict behavior in individual bees. Science
302:296299.
Silverman N, Zhou
R, Erlich RL, Hunter M, Bernstein E, Schneider D, Maniatis T, 2003. Immune
activation of NFkappaB and JNK requires Drosophila
TAK1. J. Biol. Chem.
278(49):4892848934.
Ranz JM, CastilloDavis
CI, Meiklejohn CD, Hartl
DL, 2003.
Sexdependent gene expression and evolution of the Drosophila transcriptome. Science 300(5626):17425.
Meiklejohn CD, Parsch J, Ranz
JM, Hartl DL, 2003. Rapid evolution
of malebiased gene expression in Drosophila. PNAS
100(17):98949899.
Townsend
JP, Cavalieri D, Hartl DL,
2003. Population genetic variation in genomewide gene expression. Mol. Biol. Evol. 20(6):95563.
4) Introductory
reviews on the design of Bayesian multifactorial
microarray experiments:
Townsend, J.P., 2003. Multifactorial
experimental design and the transitivity of ratios with spotted DNA microarrays.
BMC Genomics 4: 41.
Townsend, J.P., and J.W. Taylor, 2005. Designing experiments using spotted microarrays to detect gene regulation differences within and among species. In “Methods in molecular evolution: producing the biochemical data”, Zimmer, E., ed., Academic Press: Methods in Enzymology 224.