Chado Data Storage
Contents |
Chado Canonical Gene
gene | |- part_of mRNA | |---- part_of exon | |---- derives_from polypeptide
Chado Pseudogene
pseudogene | |- part_of pseudogenic_transcript | |---- part_of pseudogenic_exon | |---- derives_from polypeptide
Gene Model
Qualifier Storage
Note
-stored as a FeatureProp with CvTerm = comment
codon_start
This is loaded as phase in the FeatureLoc table.
phase = 0 => codon_start = 1;
phase = 1 => codon_start = 2;
phase = 2 => codon_start = 3
Similarity
e.g.: /similarity="fasta; SWALL:O85168 (EMBL:AF047828);Pseudomonas syringae; syringomycin synthetase; syrE;length 9376 aa; id=31.93%; ungapped id=35.04%;E()=1.5e-105; ; 6198 aa overlap; query 36-6020 aa; subject 2593-8452 aa"
analysis ------ fasta | | | analysisfeature ---- raw score (null), evalue, id | | | |---featureprop---ungapped id (35.04) | | Matchfeature ---| / | | / | |---featureprop---overlap (6198) rank=0 / | (srcfeature_id=product / | FeatureId) / | (subject 2593-8452)featureloc featureloc (query 36-6020) srcfeature_id=queryFeatureId rank=1 | | | | featuredbxref | | (AF04782) | | \ | | \ | | \ | | \| | (dbxref=O85168) | | (seqlen=9376)feature feature (polypeptide if protein match | transcript if nucleotide match) /|\ / | \ / | \ / | \ / | \ / | \ / featureprop \ featureprop | featureprop | | | | product | | (syringomycin s) | | | | | organism gene(syrE) (Pseudomonas syringae)
N.B. For now the match feature is entered as CvTerm = 'region'. The Cv 'genedb_misc' is used for Cvterms like 'ungapped id' found in /similarity.
Controlled Vocabulary Qualifiers
These qualifiers are all FeatureCvTerm's.
GO
/GO="aspect=;GOid=;term=;qualifer=;evidence='db_xref=;with=;date="
GO annotation can be attached at different levels of the heirarchy. The GeneDB loader attaches it by default to the polypeptide as that seems to be the most typical case.
Each GO entry has a CvTerm and a DbXRef associated with it. The GO term should be looked up by its DbXRef i.e. GO:123456, to get the correspontding CvTerm. A FeatureCvTerm links this CvTerm to the Feature. The FeatureCvTerm may well exist so needs to be looked up. The qualifier NOT is treated specially, as a field in FeatureCvTerm, becauses it reverses the meaning of the assignment, rather than adding more details as most qualifiers do. The FeatureCvTerm may have a number of associated FeatureCvTermProp's. This is general storage for the GO evidence code, extra qualifiers and the date of the assignment. (A hack for the evidence code would be possible, using a CvTerm to represent key and evidence code, but it wouldn't work for the date). One or more FeatureCvTermDbXRef's can be associated with the FeatureCvTerm which corresponds to the WITH/FROM column in GO. The dbxref value in this case correspond to publications, so the primary Pub is linked to the FeatureCvTerm.pub_id. One or more FeatureCvTermPub's can be associated with the FeatureCvTerm which corresponds to any ID's after the pipe symbol in the publication column.
controlled_curation
/controlled_curation="term=;cv=; qualifier=;evidence=;db_xref=;residue=; attribution=;date="
Storage is similar to GO. The db_xref is stored either as a FeatureCvTermDbXref or a FeatureCvTermPub:
- if the value is a PMID:12345 then it is stored in the pub table. a dummy dbxref is created with the 'accession' = 12345. a pubdbxref is created to link the pub with the dbxref.
- if the value is other database like UNIPROT:23456 then it is stored in the dbxref table with accession=23456 and an entry is also created in the feature_cvterm_dbxref table to link the featurecvterm and the dbxref
product
This is stored as a FeatureCvTerm with a CvTerm from the 'genedb_products' Cv.
class (Riley classification)
e.g. /class=6.2.2
These are linked to the Feature as below:
Feature | FeatureCvTerm--name='anti sigma factor' | CvTerm | ----------- | | RILEY--Cv DbXRef--accession=6.2.2 | Db-name=RILEY
Dbxref
- stored as a FeatureDbXRef
EC_number
- stored as a FeatureProp
literature
- stored as FeaturePub
Search and Results files
The following are stored as FeatureProp's:
/blast_file
/blastn_file
/blastp+go_file
/blastp_file
/blastx_file
/fasta_file
/fastx_file
/tblastn_file
/tblastx_file
/clustalx_file
/sigcleave_file
/pepstats_file
Synonyms
The following qualifiers are loaded in the Synonym table:
/reserved_name
/synonym
/primary_name
/protein_name
/systematic_id
/temporary_systematic_id
and these are linked to the Feature via FeatureSynonym's. These Synonym's are in the 'genedb_synonym_type' Cv table. FeatureSynonym.is_current is used to store previous/obsolete synonyms.
colour
Presumably a FeatureProp (at least for now). Additional qualifiers are being preposed status (containing information about functional annotation and whether the annotation is manual or automatic) and evidence. These are likely to be FeatureProp's.
ortholog/paralog/cluster
Orthologue/paralogues cluster are stored in a similar way to /similarity.
As input we have:
a) manually curated orthologues which simply list other genes'
systematic ids and the relationship type
b) auto-generated clusters of genes which also have associated data like clustering method, cut-off/score etc.
However, they features are linked to each other by
feature_relationship's (rather than featureloc which are used with
/similarity). The feature_relationship's are given the type_id =
'orthologous_to' or 'paralogous_to'. For manually curated
ortholog/paralog data the analysisfeature and analysis are not required
and are not added.
analysis | + analysisfeature | feature (type_id == protein_match) | +----+-------+------------+ | | | | | | feature1 feature2 feature3 gene gene gene
The bottom links are feature_relationships of SO type orthologous_to.