Over the years, we have invited faculty, postdocs, graduate students, and others at the University of Maryland and elsewhere to use The Lattice Project for their research projects.
Working with various researchers, helping them organize and submit their jobs, and listening to their feedback has helped us continually improve the system, and has shown us where more work is needed. Taken together as a whole, the body of projects we have supported is extremely diverse. The Lattice Project is cited in a number of publications that accompany these studies.
Here we provide general information about a few projects that have made use of BOINC through The Lattice Project. For more information, visit our research page on the Lattice web site.
Phylogenetic Analysis - GARLI
The Cummings Laboratory and others are using GARLI to infer phylogenetic trees from nucleotide or amino acid data. Various nucleotide, codon and amino acid models are implemented in GARLI for maximum likelihood (ML) estimates. Multiple searches for the ML tree and calculation of bootstrap support values are parallelized by The Lattice Project at the level of individual heuristic searches; i.e., every computing node carries out at least one complete heuristic search. This parallelization is particularly useful for large quantities of relatively short calculations, as is typical for nucleotide model bootstrap analyses with large numbers of replicates.
The Leptree project investigates evolutionary relationships within the insect order Lepidoptera (moths and butterflies), in particular of higher taxa, such as families, superfamilies and infraorders. This molecular "backbone phylogeny" is based on the analysis of up to 26 protein-coding nuclear genes (~19kb) for several hundred taxa. The chief method of analysis used in this study is a nucleotide model ML search in GARLI. The most commonly applied model is the generalized time-reversible model with a gamma distribution of rates and a proportion of invariant sites (GTR+G+I). The Leptree project relies heavily on the computational resources provided by The Lattice Project, as the sheer number of heuristic searches is not feasible to run on an individual desktop machine. The bulk of these heuristic searches consist of bootstrap replicates (up to 2,000 per analysis), but in addition, due to the heuristic nature of the search, multiple searches (up to 500) are required for confidence in having found the ML tree. For the Leptree project, many analyses of these types are carried out, e.g., for individual and combined genes, synonymous and non-synonymous data partitions, and with and without topological constraints for subsequent hypothesis testing.
Protein Sequence Comparison - HMMPfam
hmmpfam is part of the HMMER package. The HMMER package uses profile
hidden Markov models (HMMs) to characterize regions of similar
amino-acid sequence in protein families, groups of proteins with similar
function found in related organisms. The hmmpfam program searches the
protein sequences of proteins with unknown function against a carefully
curated set of HMM models, called Pfam, from well-understood protein
families. Protein sequences are assigned to one or more protein families
on the basis of a statistically significant match to a Pfam HMM.
HMMPfam and RMIDb:
The Edwards lab provides the Rapid Microorganism Identification Database
(RMIDb), a freely available web-resource and database
for the identification of bacteria and viruses using mass spectrometry.
The RMIDb searches protein sequences from all of the major protein
sequence repositories, plus computational protein sequence predictions
from sequenced bacterial genomes, for mass matches with experimental
masses from mass spectra. Protein sequences are carefully categorized
according to strain, species, and other taxonomic groupings, and
according to protein function, cellular location, and biological process
using the Pfam assignments computed by hmmpfam and their associated gene
ontology (GO) classifications. The functional classification of protein
sequences must be recomputed using hmmpfam because each of the sources
of protein sequence uses different, sometimes conflicting, criteria for
Pfam assignment, or provides no assignment at all. Functional
classification of protein sequences makes it possible to analyze only
the most likely to be observed proteins for mass matches, which
decreases search time and increases the statistical significance of
HMMPfam for RMIDb on BOINC:
The Edwards laboratory is using the HMMPfam service to compute Pfam
assignments for all bacterial, plasmid, and virus protein sequences from
Swiss-Prot, TrEMBL, GenBank, RefSeq, and TIGR's CMR, plus an inclusive
set of all plausible Glimmer predictions from RefSeq bacterial genomes.
These protein sequences, and their Pfam assignments, are used in RMIDb.
The HMMPfam service is also being used as a model for 'data-heavy'
bioinformatics applications on the Lattice grid infrastructure, a
collaboration between the Cummings and Edwards laboratories.
Conservation Reserve Network Design - MARXAN
MARXAN is a decision support system for the design of conservation reserve networks.
It is useful for selecting a reserve system from a large number of potential sites
that satisfies a number of ecological, social and economic criteria. For example, certain species or conservation features must be well protected within the reserve system, or the reserve system must not include more than a specified number of sites. The user translates their criteria into representation targets for the conservation features to be protected (i.e. number of populations of each species or percentage of each habitat type to be included in the reserve system), and optionally a cost threshold or desired level of site compactness. MARXAN will produce reserve network solutions that meet these design constraints while simultaneously minimizing the cost of the design (i.e. number of sites required to meet all representation targets).
Biased Data and the Selection of Conservation Reserve Networks:
Joanna Grand, Maile Neel, Michael Cummings (University of Maryland), Taylor Ricketts (World Wildlife Fund), and Tony Rebelo (South African National Biodiversity Institute) are collaborating on a project that uses MARXAN to quantify the impacts of basing the selection of conservation reserve networks on incomplete and biased species distribution data. Most species distribution data are biased in some way (i.e. higher sampling intensity closer to roads or within current reserves); however, they are commonly used to select sites for inclusion in reserve networks because they are considered to be the best data available. The ability of reserve networks to adequately protect biodiversity when sites are selected based on incomplete and biased data is poorly understood.
The first set of analyses compared the efficiency and effectiveness of MARXAN reserve network solutions generated from biased and complete species data. We used data from a virtually exhaustive survey of the Proteaceae family of flowering plants in the Cape Floristic Region of South Africa as our baseline for “complete” data. To produce a sufficient range of solutions for comparison with the complete data solution, we simulated 1000 biased and random incomplete datasets from the full Proteaceae dataset. We then ran MARXAN 1000 times for each dataset. This study design required 1.2002 x 107 separate MARXAN runs which was possible to complete in only a few weeks by running them asynchronously in parallel on the Lattice grid system.
Currently, we are investigating how well reserve networks protect species when their design is based on detailed species distribution data which are often incomplete and biased, versus coarser environmental data which are easier to acquire and unaffected by the issue of sampling bias. We will compare MARXAN solutions generated with complete, biased, and random species data, to those generated with environmental data (vegetation classes), and combinations of both data types. This analysis will require over 7.6 x 107 separate MARXAN runs and will again rely on the Lattice grid system to make this enormous amount of processing feasible.