From which species are sequences in BFGR are taken?
The sequences in BFGR are derived from several different groups of species. BFGR uses the official gene sets from the five plant species that have publicly available genome sequences: Arabidopsis thaliana, Oryza sativa, Populus trichocarpa, Sorghum bicolor and Vitis vinifera. These species are all included in BFGR because their complete genomes and complete gene sets can be used for comparative analyses with biofuel species.
Filtered transcript assembly data has been used for eighteen species within BFGR: Brachypodium distachyon, Cenchrus ciliaris, Helianthus annuus, Hordeum vulgare, Leymus cinereus x Leymus triticoides, Medicago sativa, Medicago truncatula, Panicum virgatum, Picea glauca, Picea sitchensis, Pinus taeda, Populus tremula x Populus tremuloides, Secale cereale, Sorghum propinquum, Triticum aestivum, Triticum turgidum subsp. durum, Triticum monococcum and Zea mays.
Seven additional species are included in our species overview pages: Eleusine coracana, Miscanthus x giganteus, Oryza granulata, Pennisetum glaucum, Saccharum officinarum, Setaria italica and Sorghum halepense. Most of these species do not have PUT transcript assemblies. The PUT transcript assemblies from Saccharum officinarum will be included in the next major release of BFGR.
What are the sources of the sequences that are annotated within BFGR?
For the fully sequence genomes, the official gene sets have been utilized within BFGR. For Oryza sativa, the MSU release 6 gene models and pseudomolecules have been used. The TAIR8 gene models and pseudomolecules have been used for Arabidopsis thaliana. The JGI gene models and pseudomolecules have been analyzed for both Populus trichocarpa (Poptr1_1) and Sorghum bicolor (Sorbi1_4). The Genoscope Grapevine Genome Project gene models and pseudomolecules have been used for Vitis vinifera. All the PUT sequences were obtained from PlantGDB database.
PUT transript assemblies were filtered to remove uninformative sequences. Any PUT sequences that contained more than 10 N's or that were shorter than 250 nucleotides were excluded. The reason for this is that there is great variability in the quality of the EST and mRNA sequences from which the PUTs are derived. The presence of more than 10 N's in a PUT is taken as an indication that the PUT was assembled from at least one poor quality sequence. Empirical testing has shown that PUTs that are shorter than 250 nucleotides are much less likely to contain a conserved coding sequence, and so, those sequences are also excluded from analysis.
How are functional annotations assigned to sequences in BFGR?
For Arabidopsis thaliana and Oryza sativa genes, the functional annotations for the official gene sets that are available from TAIR and the MSU Rice Genome Annotation Project have been used.
Functional annotations for the official gene sets for Populus trichocarpa, Sorghum bicolor and Vitis vinifera are not readily available. For the genes from those species and for the transcript assembly PUTs, functional annotation assignments were made using a two step process. All sequences were aligned against a database of combined UniRef50 sequences and all UniRef100 sequences from members of Embryophyta (higher plants). All sequences were also aligned to Pfam domains (pfam_23). Beginning with the best alignment, the top 15 UniRef alignments with e-values less than 1e-10 were examined. If the UniRef sequence with the best alignment had a functional annotation that could be parsed down to a sensible description, that parsed annotation was assigned as the functional annotation for the query sequence. If the functional annotation from the best alignment was not usable, then functional annotation from the next significant UniRef alignment was checked. If there were no UniRef alignments or if none of the UniRef alignments had usable funtional annotation, then the best significant (e-value < 1e-10) Pfam domain alignment was used as the functional annotation for the query sequence. With the genes from Populus trichocarpa, Sorghum bicolor and Vitis vinifera, if a gene had significant sequence similarity with a UniRef sequence but it was not possible to automatically extract a useable description from any UniRef sequence and the gene did not have sequence similarity with a Pfam sequence, the functional annotation was assigned as "Conserved gene of unknown function". For these species, if no UniRef or Pfam sequence similarity was observed, the functional annotation was assigned as "Gene of unknown function". For the PUT transcript assemblies, the corresponding terms were "Conserved expressed gene of unknown function" and "Expressed gene of unknown function".
How accurate are the ESTScan protein predictions?
While ESTScan has been used in an attempt to determine the correct peptide sequence from each PUT sequence, we know that some of the predicted protein sequences are incorrect. About 5% of PUT sequences have predicted protein sequences that do not produce significant BLASTP alignments to model genome proteomes but do have significant BLASTX alignments of their nucleotide sequence to the model genome proteomes. This result indicates that for about 5% of the PUT sequences, the ESTScan predicted protein sequence was incorrect. Although ESTScan is designed to account for frameshift errors, poor quality sequences and short sequences, it is likely that too many of these types of errors can affect ESTScan's performance.
How are SSRs identified?
SSRs identified here are repeats of at least 10 mononucleotides, 6 dinucleotides, 5 trinucleotides, 5 tetranucleotides, 5 pentanucleotides, or 5 hexanucleotides. SSRs analysis was performed on all PUT transcript assembly sequences from PlantGDB. We also designed primers to amplify these SSRs using Primer3. The pipeline used to identify SSR's is as follows

How often will BFGR be updated?
BFGR will be updated approximately twice a year provided that there are new sequence data sets available. Species with sequences that are not currently handled by BFGR but for which new sequences become available will be added to BFGR during major releases. If there is a species with sequence that you believe should be included in BFGR, please contact us.
How are SNPs identified?
Single nucleotide polymorphisms were predicted within PUT transcript assemblies. Multiple sequence alignments from PlantGDB PUT assemblies were examined for positions that had a minimum of 4 overlapping transcript sequences and at least 2 bases at that position that differed from the consensus base call.
Minimum depth: The minimum number of sequences in the multiple sequence alignment at the SNP position (minimum 4).
Minimum SNP prevalence percentage: The percentage of sequences that have the alternate base (default 25%).
*SNPs are not available for Triticum aestivum because the multiple sequence alignments for T. aestivum PUTs have not been made available by PlantGDB. For Arabidopsis thaliana, Oryza sativa, Populus trichocarpa, Sorghum bicolor and Vitis vinifera, official gene models for those species are used within BFGR, and transcript assemblies for those species were not analyzed.
How often are sequence summary and species overview pages updated?
The species overview and sequence summary pages will be updated every week.
Where can I download model genome and PUT sequences?
All the sequences that are used in BFGR can be downloaded from our ftp site.
How do I cite the Biofuel Feedstock Genomics Resource?
As soon as we have published a manuscript describing the resources that can be found in BFGR, we will post that information here. Until then, this website can simply be cited using its url http://bfgr.plantbiology.msu.edu/.
Funding provided by DOE grant DE-FG02-08ER64631 and USDA grant CSREES 2008-04232.
Web template provided by Designs By Darren.
Photos courtesy of Shawn Kaeppler, Kevin Childs, Hugo.arg, and the USDA Photo Gallery.