MAGNAMWAR is a software package for bacterial genome wide association (GWA). Relative to standard approaches for GWA, e.g. in humans, bacterial genomes and phenotyping experiments have unique characteristics that suggest the use of different variant calling and statistical approaches may improve the association analysis (reviewed in Power, et al, 2017; Pubmed ID 27840430). MAGNAMWAR enables GWA based on bacterial gene presence or gene variant, permits the use of different statistical modeling approaches, and incorporates population structure into models based on user-defined parameters. Genes are clustered into orthologous groups (OGs) using the OrthoMCL gene clustering software, and can be statistically associated with raw or aggregate (e.g. mean) phenotype data. MAGNAMWAR is especially useful for performing associations when the phenotypes of phylogenetically disparate taxa are analyzed, and the calling of fine-scale variants (e.g. SNPs) is challenging or inappropriate. On the other hand, OrthoMCL gene clustering software may lack resolution when comparing phenotypes between strains from the same bacterial species, and OrthoMCL becomes computationally prohibitive as the number of sampled taxa increases above several hundred isolates. For such implementations, we recommend users consider other existing software reviewed in Power, et al. 2017 that are designed for processing hundreds or thousands of samples; or for performing SNP-based comparisons where it may be more appropriate to consider factors such as linkage. Users may be especially interested in the bacterial GWA software packages SEER (Pubmed ID 27633831), SCOARY (Pubmed ID 27887642), and PhenoLink (Pubmed ID 22559291), which have different strengths relative to MAGNAMWAR. Additionally, while MAGNAMWAR can be used to associate shotgun metagenomic sequencing data with corresponding sample phenotypes, nonsaturating sequencing of a sample could lead to false positive or false negative results.
This vignette describes a recommended workflow to take full advantage of MAGNAMWAR. The data are from a study on bacterial determinants of Drosophila melanogaster triglyceride content, and are representative of any number of datasets that associate one phenotype with one bacterial species. The bacterial genotypes were called based on individually cultured and sequenced bacterial species, obviating potential complications of non-saturating sequencing depth. The data included in the package and used in documentation examples are highly subsetted (2 orders of magnitude smaller) from the original dataset to increase speed of the example analyses. Most analyses can be run on a standard laptop computer, although we recommend a desktop computer with at least 16GB RAM for large (>500 phenotype measures) datasets. The functions are presented in their recommended order of operations using the case study fruit fly triglyceride data included in the package.
The following table outlines the specific input and outputs per function in the order each function should be used. Only the essential functions are listed; optional functions are discussed in their relevant sections.
Function | Purpose | Input | Output | Case Study Input | Case Study Output |
---|---|---|---|---|---|
FormatMCLFastas | prep amino acid fasta files for OrthoMCL analysis | 4-letter abbreviated amino acid fasta files of every host-mono-associated organism | concatenated fasta file of all inputs, removes any duplicate ids | fastas in extdata/fasta_dir/ | MCLformatted_all.fasta |
FormatAfterOrtho | format output of OrthoMCL clusters to be used with analyses | groups file from OrthoMCL | Parsed data contained in a list of 2 objects (presence absence matrix, protein ids) | extdata/groups_ example_r.txt | after_ortho_format_grps |
AnalyzeOrthoMCL | main analysis of data | FormatAfterOrtho output; phenotypic data | a matrix with 7 variables (cluster group, p-value, corrected p-value…) | after_ortho_format_grps; pheno_data | mcl_mtrx_grps |
JoinRepSeq | appends representative sequences with AnalyzeOrthoMCL matrix | FormatAfterOrtho output; fasta files; output matrix of AnalyzeOrthoMCL | a data frame of the joined matrix | after_ortho_format_grps; extdata/fasta_dir/; mcl_mtrx_grps | joined_mtrx_grps |
The first step is to assign each gene to a cluster of orthologous
groups. This pipeline uses OrthoMCL for clustering, either through a
local install (requires extensive RAM; see supplementary data) or
web-based executable (http://orthomcl.org/orthomcl/;
max 100,000 sequences). OrthoMCL software requires specific fasta
sequence header formats and that there are no duplicate protein ids. The
fasta header format contains 2 pipe-separated pieces of information: a
unique 3-4 alphanumeric taxon designation and a unique protein ID
classifier, e.g. >apoc|WP_000129691. Two functions aid in file
formatting: FormatMCLFastas
, which formats genbank files,
and RASTtoGBK
, which formats files from RAST.
FormatMCLFastas
converts amino-acid fasta files to an
OrthoMCL compliant format and combines them into a concatenated file
called “MCLformatted_all.fasta”. All amino acid fasta files must first
be placed in a user-specified directory and given a 3-4 letter
alphanumeric name, beginning with a non-numeric character
(e.g. aac1.fasta). Example fasta files are included in the MAGNAMWAR
package. The output “MCLformatted_all.fasta” file is the input for the
OrthoMCL clustering software.
For fasta files that are not in the NCBI format, they should be
converted in a separate folder to an NCBI-compatible format before
running FormatMCLFastas
. For example, files in the RAST
format have the initial header >fig|unique_identifer; not
>ref|xxxx|xxxx|unique_identifier|annotation. To convert these or any
other files to NCBI-compatible format, use the RASTtoGBK
function to merge the annotation using the unique identifier as a
lookup, specifying the name of the fasta file (with a 3-4 letter
alphanumeric name beginning with a non-numeric character), path to the
reference file containing the annotation, and the output folder. If the
reference file bearing the annotation is from RAST the lookup file can
be downloaded as the ‘spreadsheet (tab separated text format)’. If the
reference file is from a different source, format it so that the unique
identifier is in the 2nd column and the reference annotation is in the
8th column.
lfrc_fasta <- system.file('extdata', 'RASTtoGBK//lfrc.fasta', package='MAGNAMWAR')
lfrc_reference <- system.file('extdata', 'RASTtoGBK//lfrc_lookup.csv', package='MAGNAMWAR')
lfrc_path <- system.file('extdata', 'RASTtoGBK//lfrc_out.fasta', package='MAGNAMWAR')
RASTtoGBK(lfrc_fasta,lfrc_reference,lfrc_path)
After formatting fasta files, run OrthoMCL clustering software either locally or online. OrthoMCL documentation describes how local installations can vary the inflation factor parameter to adjust the resolution of gene clustering.
More information about OrthoMCL clustering software can be found at: http://www.orthomcl.org/.
The FormatAfterOrtho
function reformats the direct
output from OrthoMCL for the MAGNAMWAR pipeline. To use the command,
call FormatAfterOrtho
, specifying the location of the
OrthoMCL clusters file (online: OrthologGroups file; local: output of
orthomclMclToGroups command) and whether the software was run online
(“ortho”; default) or locally (“groups”). The following sample data were
derived using local clustering.
# file_groups is the file path to your output file from OrthoMCL
parsed_data <- FormatAfterOrtho(file_groups, "groups")
The new stored variable is a list of matrices. The first matrix is a
presence/absence matrix of taxa that bear each cluster of orthologous
groups (OGs). The second matrix lists the specific protein ids within
each cluster group. The data in each can be accessed by calling
parsed_data[[1]]
or parsed_data[[2]]
,
respectively, although this is not necessary for subsequent pipeline
steps. Because it contains the specific protein ids for each OG, this
list is a key input for most subsequent functions.
a subset of resulting variable
parsed_data
:
parsed_data[[1]][,1:13]
## aace apap apas atro bsub dmos ecol efao efav ehor geur gfra ghan
## ex_r00000 "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## ex_r00001 "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## ex_r00002 "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## ex_r00003 "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "0" "1"
## ex_r00004 "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## ex_r00005 "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## ex_r00006 "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## ex_r00007 "1" "1" "1" "1" "1" "0" "1" "1" "1" "1" "1" "1" "1"
## ex_r00008 "0" "1" "1" "0" "1" "0" "1" "1" "1" "1" "1" "0" "0"
## ex_r00009 "1" "1" "0" "0" "1" "0" "1" "0" "0" "1" "0" "0" "1"
## ex_r00010 "0" "1" "0" "1" "1" "0" "1" "1" "1" "1" "1" "1" "1"
parsed_data[[2]][,1:2]
## ex_r00000 ex_r00001
## V2 "aace|ZP_08699022.1" "bsub|NP_389730.2"
## V3 "apap|940282.4.peg.2315" "bsub|NP_390261.1"
## V4 "apas|ZP_16282535.1" "aace|ZP_08696506.1"
## V5 "atro|ZP_08644184.1" "apap|940282.4.peg.1501"
## V6 "bsub|NP_391359.1" "apas|ZP_16281966.1"
## V7 "dmos|ZP_08468863.1" "atro|ZP_08644394.1"
## V8 "ecol|NP_415408.1" "dmos|ZP_08469662.1"
## V9 "efao|YP_005708185.1" "ecol|NP_414920.1"
## V10 "efav|NP_815059.1" "efao|YP_005708912.1"
## V11 "ehor|ZP_08496675.1" "efav|NP_816072.1"
## V12 "geur|ZP_08902778.1" "ehor|ZP_08497259.1"
## V13 "gfra|ZP_11376413.1" "geur|ZP_08902509.1"
## V14 "ghan|ZP_06835021.1" "gfra|ZP_11376782.1"
## V15 "gobo|ZP_08896945.1" "ghan|ZP_06833288.1"
## V16 "gxyl|YP_004867578.1" "gobo|ZP_08898428.1"
## V17 "lani|ZP_08549469.1" "gxyl|YP_004868985.1"
## V18 "lbre|ZP_03940694.1" "lani|ZP_08548797.1"
## V19 "lbuc|YP_004398648.1" "lbuc|YP_004398815.1"
## V20 "lcas|YP_006751102.1" "lcas|YP_006752056.1"
## V21 "lfal|ZP_08312985.1" "lfal|ZP_08312677.1"
## V22 "lfer|ZP_03944435.1" "lfer|ZP_03944015.1"
## V23 "lfru|ZP_08652182.1" "lfru|ZP_08652638.1"
## V24 "llac|1358.26.peg.2055" "llac|1358.26.peg.706"
## V25 "lmai|ZP_09447100.1" "lmai|ZP_09447014.1"
## V26 "lmal|ZP_09441330.1" "lmal|ZP_09441246.1"
## V27 "lpla|YP_004888740.1" "lpla|YP_004888561.1"
## V28 "lrha|YP_003170666.1" "lrha|YP_003171609.1"
## V29 "lver|ZP_09442552.1" "lver|ZP_09443795.1"
## V30 "pbur|ZP_16361594.1" "pbur|ZP_16363149.1"
## V31 "pput|YP_001266156.1" "pput|YP_001270272.1"
## V32 "smar|AMF-0004114" "smar|AMF-0001686"
## V33 "spar|YP_006309044.1" "spar|YP_006310046.1"
## V34 "swit|YP_001264509.1" "swit|YP_001264948.1"
## V35 "efao|YP_005709159.1" "efao|YP_005707743.1"
## V36 "efav|NP_816368.1" "efav|NP_814698.1"
## V37 "lfer|ZP_03944497.1" "llac|1358.26.peg.28"
## V38 "lver|ZP_09443619.1" ""
The AnalyzeOrthoMCL
function performs the statistical
tests to compare the phenotypes of taxa bearing or lacking each OG. It
requires the R object output of the FormatAfterOrtho
function and a phenotype matrix containing the variables for the
statistical tests. Seven different tests are supported, each deriving
the significance of OG presence/absence on a response variable, and
specified by the following codes:
To run AnalyzeOrthoMCL
, the following parameters are
required:
FormatAfterOrtho
In order to create a data frame for your phenotypic data use the
following command where file_path
is the path to your
phenotypic data file: (the file_path
used in this case
study is not available because pheno_data
exists as an .rda
file)*
pheno_data <- read.table(file_path, header = TRUE, sep = ',')
A subset of the phenotypic file for TAG content
pheno_data
is shown below:
## Treatment RespVar Vial Experiment
## 1 apoc 2.821429 A A
## 2 apoc 5.500000 B A
## 3 apoc 8.464286 C A
## 4 jacg 5.392857 A A
## 5 jacg 2.821429 B A
To run AnalyzeOrthoMCL
use the headers within
pheno_data
to specify variables:
It is not necessary to specify the OG presence/absence variable,
which is automatically populated from the FormatAfterOrtho
output. All other variables must be specified using column names in the
phenotype matrix.
We will thus populate AnalyzeOrthoMCL
using the headers
within pheno_data
to specify variables:
For example, to test the effect of gene presence/absence on fat content using a mixed model with nested random effects, the following command should be used:
mcl_matrix <- AnalyzeOrthoMCL(parsed_data,
pheno_data,
model = 'lmeR2nest',
species_name = 'Treatment',
resp = 'RespVar',
rndm1 = 'Experiment',
rndm2 = 'Vial')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
An optional but highly recommended parameter of
AnalyzeOrthoMCL
is the princ_coord
parameter,
which allows the user to incorporate population structure. The options
for this parameter are 1, 2, 3, or a decimal. Numbers 1-3 specify how
many principal coordinates to include as fixed effects in the model and
a decimal specifies to use as many principal coordinates as needed for
that decimal percentage of the variance to be accounted for. When a user
specifies one of these options, the software automatically calculates a
principal coordinates matrix from the FormatAfterOrtho
presence absence matrix, and incorporated the specified number of
principal coordinates into the model See below in the QQ Plot section to
see an example of plotted analyses using principal coordinates in the
model, and how incorporating population structure into the models can
improve the statistical distribution of the results.
The resulting output matrix contains 7 columns:
mcl_mtrx[,1:3]
Showing the first three columns of
mcl_matrix
.
## OG pval1 corrected_pval1
## [1,] "ex_r00000" "0.0680152958575646" "0.748168254433211"
## [2,] "ex_r00001" "0.0680152958575646" "0.748168254433211"
## [3,] "ex_r00002" "0.000169433281139719" "0.00186376609253691"
## [4,] "ex_r00003" "0.217474910965987" "1"
## [5,] "ex_r00004" "0.448263443170021" "1"
## [6,] "ex_r00005" "0.0899794360125483" "0.989773796138032"
## [7,] "ex_r00007" "0.999790587202527" "1"
## [8,] "ex_r00008" "0.0066972584318008" "0.0736698427498088"
## [9,] "ex_r00009" "0.186956618369625" "1"
## [10,] "ex_r00010" "0.00824936635032048" "0.0907430298535252"
AnalyzeOrthoMCL
also provides for analysis using a
survival model. A survival analysis is a method for calculating the
difference between two treatments on time-series data, which may not
always be normally distributed and is most likely to be relevant to
host, but perhaps not bacterial, phenotypes. Users should provide a data
frame of their phenotypic data, similar to the data frame described
above for the other models. For more information see the
survival
package, which includes several useful guides and
vignettes (https://CRAN.R-project.org/package=survival).
Briefly, for each individual in a population that reaches a benchmark event, such as death, the time of death is recorded in the ‘time’ column, and a ‘1’ is recorded in the ‘event’ column, indicating the individual died at time X. If an individual left an experiment prematurely (e.g. moved away from a location where a study was conducted, a fly escaped from a vial before death), the event column is labelled ‘0’ to indicate it survived until that point in the experiment. Other metadata, including the sample name, are included on the same row as these 2 data points under other columns in a matrix.
## EXP VIAL BACLO TRT t1 t2 event
## 1 1 01A-6 1 lbrc 0 26.33333 1
## 2 1 01A-7 1 lbrc 0 19.00000 1
## 3 1 01A-8 1 lbrc 0 26.73077 1
Because each individual is specified individually, survival analyses
are often conducted on large datasets with thousands of measurements. To
expedite the use of survival analyses in iterative BGWA testing, our
survival analysis can be multi-threaded to run on multiple cores, using
the multi
option. The survmulticensor
option
is included to break up the tests for further parallelization. This
function allows the user to optionally input a startnum
and
stopnum
signifying which and how many tests to run. The
output of those certain tests can then be written into the
output_dir
where the SurvAppendMatrix
function
can be used to concatenate all small tests together.
JoinRepSeq
randomly selects a representative protein
annotation and amino acid sequence from each OG and appends it to the
AnalyzeOrthoMCL
output matrix. The purpose is to identify
OG function. The result is a data frame bearing 4 additional
variables:
Four inputs are specified for JoinRepSeq
:
FormatAfterOrtho
AnalyzeOrthoMCL
dir <- system.file('extdata', 'fasta_dir', package='MAGNAMWAR')
dir <- paste(dir,'/',sep='')
joined_matrix <- JoinRepSeq(after_ortho_format_grps, dir, mcl_mtrx_grps, fastaformat = 'old')
##
## picking and writing representative sequence for PDG:
## 9.09%__18.18%__27.27%__36.36%__45.45%__54.55%__63.64%__72.73%__81.82%__90.91%__
joined_matrix[1,8:10]
An example of three of the
appended columns are shown below:
## rep_taxon rep_id rep_annot
## 1 efao YP_005708185.1 thioredoxin-disulfide reductase
MAGNAMWAR statistical analysis can be exported into either tab-separated matrices or into graphical elements.
Allows the user to output all protein sequences in a specified OG in fasta format.
When survival models are used in the MGWA a single .csv file is
printed for each test. SurvAppendMatrix
joins each
individual file into one complete matrix.
Writes the matrix result from AnalyzeOrthoMCL
into a
tab-separated file.
MAGNAMWAR also offers several options for graphical outputs. The several ways to visualize data are explained in detail below.
PDGPlot
visualizes the phenotypic effect of bearing a
OG. The phenotype matrix is presented as a bar chart, and gene
presence/absence is represented by different bar shading.
Calling PDGPlot
requires the same
pheno_data
variable as AnalyzeOrthoMCL
and
similarly takes advantage of the user specified column names from the
pheno_data
data frame. It also requires the
mcl_matrix
object and specification of which OG to
highlight.
For example, to visualize the means and standard deviations of the taxa which are present in OG “ex_r00002”, the OG with the lowest corrected p-value (“0.00186”). Green and gray bars represent taxa that do or do not contain a gene in the OG, respectively.
PDGPlot(pheno_data, mcl_matrix, OG = 'ex_r00002', species_colname = 'Treatment', data_colname = 'RespVar', ylab = "TAG Content")
The different taxa can be ordered alphabetically, as above, or by
specifying the order with either the tree
parameter (for
phylogenetic sorting) or the order
parameter, which takes a
vector to determine order. For example, the taxa can be ordered by
phylogeny by calling a phylogenetic tree file with the taxon names as
leaves (note any taxa in pheno_data
that are not in the
tree file will be omitted).
tree <- system.file('extdata', 'muscle_tree2.dnd', package='MAGNAMWAR')
PDGPlot(pheno_data, mcl_matrix, 'ex_r00002', 'Treatment', 'RespVar', ylab = "TAG Content", tree = tree)
PhyDataError
is similar to PDGPlot and adds
visualization of phylogenetic tree with the phenotypic means and
standard deviation.
Calling PhyDataError
requires the following
parameters:
pheno_data
variable as
AnalyzeOrthoMCL
(and its column names)mcl_matrix
objectPhyDataError(tree, pheno_data, mcl_matrix, species_colname = 'Treatment', data_colname = 'RespVar', OG='ex_r00002', xlabel='TAG Content')
PDGvOG
produces a histogram of the number of OGs in each
phylogenetic distribution group (PDG). A PDG is the set of OGs present
and absent in the exact same set of bacterial taxa. The main purpose of
the graph is to determine the fraction of OGs that are present in unique
or shared sets of taxa, since the phenotypic effect of any OGs in the
same PDG cannot be discriminated from each other.
To run PDGvOG
, provide the output of
FormatAfterOrtho
. The num
parameter (default
40) is used to specify the amount of OGs per PDG that should be included
on the x axis.
For example, in the graph below there are 11 PDGs only contain one OG, meaning that 11 PDGs have one group that has a unique distribution of taxa present and absent.
Because this data is an extreme subset, it isn’t very informative. A full data set is shown below. This shows us that in this particular OrthoMCL clustering data, 4822 PDGs exist with only 1 unique OG in the PDG, and around 500 PDGs exist with 2 OGs that share the same presence and absence of taxa and so on.
A simple quartile-quartile plot function that is generated using the
matrix of AnalyzeOrthoMCL
. To run QQPlotter
,
provide the output matrix of AnalyzeOrthoMcl
.
Using this function, we provide a visualization of the use of the
princ_coord
parameter in AnalyzeOrthoMCL
.
Notice how each additional principal coordinate reduces statistical
inflation. In practice it is usually computationally inexpensive to run
several iterations of the analysis with a different number of principal
coordinates to test how their incorporation improves the statistical
fit.
ManhatGrp
produces a manhattan plot for visual analysis
of the output of AnalyzeOrthoMCL
. In a traditional genome
wide association study, a Manhattan plot is a visualization tool to
identify significant p-values and potential linkage disequilibrium
blocks. A traditional Manhattan plot sorts p-value by the SNPs along the
23 chromosomes. In our Manhattan plot, we sort by taxa instead of by
chromosome number as shown below. The lines that emerge in our plot show
the clustered proteins across taxa (which because they are in the same
OG would have the same p-value).
To run ManhatGrp
, provide the output of
FormatAfterOrtho
and the output of
AnalyzeOrthoMCL
as shown below.
## ordering by protein id within taxa...
## creating manhattan plot...
## finished.
Again because this data is an extreme subset, it isn’t very informative. Another full data set example is shown below.