Chado Data Storage

Contents

Chado Canonical Gene

gene
|
|- part_of mRNA
     |
     |---- part_of exon
     |
     |---- derives_from polypeptide

Chado Pseudogene

pseudogene
|
|- part_of pseudogenic_transcript
     |
     |---- part_of pseudogenic_exon
     |
     |---- derives_from polypeptide

Gene Model

Image:Chado_gene_model.gif

Qualifier Storage

Note

-stored as a FeatureProp with CvTerm = comment

codon_start

This is loaded as phase in the FeatureLoc table.

phase = 0 => codon_start = 1;
phase = 1 => codon_start = 2;
phase = 2 => codon_start = 3

Similarity

e.g.: /similarity="fasta; SWALL:O85168 (EMBL:AF047828);Pseudomonas syringae; syringomycin synthetase; syrE;length 9376 aa; id=31.93%; ungapped id=35.04%;E()=1.5e-105; ; 6198 aa overlap; query 36-6020 aa; subject 2593-8452 aa"

                             analysis ------ fasta
                                 |
                                 |
                                 |
                          analysisfeature ---- raw score (null), evalue, id
                                 |
                                 |
                                 |          |---featureprop---ungapped id (35.04)
                                 |          |
                            Matchfeature ---| 
                             /   |          |
                            /    |          |---featureprop---overlap (6198)
rank=0                     /     |
(srcfeature_id=product    /      |
FeatureId)               /       |
(subject 2593-8452)featureloc  featureloc (query 36-6020) srcfeature_id=queryFeatureId rank=1
                |                |
                |                |
  featuredbxref |                |
     (AF04782)  |                |
            \   |                |
             \  |                |
              \ |                |
               \|                |
(dbxref=O85168) |                |
(seqlen=9376)feature          feature (polypeptide if protein match |   transcript if nucleotide match)
               /|\          
              / | \   
             /  |  \    
            /   |   \
           /    |    \
          /     |     \       
         / featureprop \       
 featureprop    |     featureprop
     |          |           |  
     |       product        |  
     | (syringomycin s)     |
     |                      |
     |                      | 
 organism                  gene(syrE) 
(Pseudomonas syringae)

N.B. For now the match feature is entered as CvTerm = 'region'. The Cv 'genedb_misc' is used for Cvterms like 'ungapped id' found in /similarity.

Controlled Vocabulary Qualifiers

These qualifiers are all FeatureCvTerm's.

GO


/GO="aspect=;GOid=;term=;qualifer=;evidence='db_xref=;with=;date="

GO annotation can be attached at different levels of the heirarchy. The GeneDB loader attaches it by default to the polypeptide as that seems to be the most typical case.

Each GO entry has a CvTerm and a DbXRef associated with it. The GO term should be looked up by its DbXRef i.e. GO:123456, to get the correspontding CvTerm. A FeatureCvTerm links this CvTerm to the Feature. The FeatureCvTerm may well exist so needs to be looked up. The qualifier NOT is treated specially, as a field in FeatureCvTerm, becauses it reverses the meaning of the assignment, rather than adding more details as most qualifiers do. The FeatureCvTerm may have a number of associated FeatureCvTermProp's. This is general storage for the GO evidence code, extra qualifiers and the date of the assignment. (A hack for the evidence code would be possible, using a CvTerm to represent key and evidence code, but it wouldn't work for the date). One or more FeatureCvTermDbXRef's can be associated with the FeatureCvTerm which corresponds to the WITH/FROM column in GO. The dbxref value in this case correspond to publications, so the primary Pub is linked to the FeatureCvTerm.pub_id. One or more FeatureCvTermPub's can be associated with the FeatureCvTerm which corresponds to any ID's after the pipe symbol in the publication column.

controlled_curation

/controlled_curation="term=;cv=; qualifier=;evidence=;db_xref=;residue=; attribution=;date="

Storage is similar to GO. The db_xref is stored either as a FeatureCvTermDbXref or a FeatureCvTermPub:

  1. if the value is a PMID:12345 then it is stored in the pub table. a dummy dbxref is created with the 'accession' = 12345. a pubdbxref is created to link the pub with the dbxref.
  2. if the value is other database like UNIPROT:23456 then it is stored in the dbxref table with accession=23456 and an entry is also created in the feature_cvterm_dbxref table to link the featurecvterm and the dbxref

product

This is stored as a FeatureCvTerm with a CvTerm from the 'genedb_products' Cv.

class (Riley classification)

e.g. /class=6.2.2

These are linked to the Feature as below:

              Feature
                 |
           FeatureCvTerm--name='anti sigma factor' 
                 |
              CvTerm
                 |
             -----------
             |         |
      RILEY--Cv       DbXRef--accession=6.2.2
                       |
                       Db-name=RILEY

Dbxref

- stored as a FeatureDbXRef

EC_number

- stored as a FeatureProp

literature

- stored as FeaturePub

Search and Results files

The following are stored as FeatureProp's:

/blast_file

/blastn_file

/blastp+go_file

/blastp_file

/blastx_file

/fasta_file

/fastx_file

/tblastn_file

/tblastx_file

/clustalx_file

/sigcleave_file

/pepstats_file

Synonyms

The following qualifiers are loaded in the Synonym table:

/reserved_name

/synonym

/primary_name

/protein_name

/systematic_id

/temporary_systematic_id

and these are linked to the Feature via FeatureSynonym's. These Synonym's are in the 'genedb_synonym_type' Cv table. FeatureSynonym.is_current is used to store previous/obsolete synonyms.

colour

Presumably a FeatureProp (at least for now). Additional qualifiers are being preposed status (containing information about functional annotation and whether the annotation is manual or automatic) and evidence. These are likely to be FeatureProp's.

ortholog/paralog/cluster

Orthologue/paralogues cluster are stored in a similar way to /similarity. As input we have:
a) manually curated orthologues which simply list other genes' systematic ids and the relationship type
b) auto-generated clusters of genes which also have associated data like clustering method, cut-off/score etc.
However, they features are linked to each other by feature_relationship's (rather than featureloc which are used with /similarity). The feature_relationship's are given the type_id = 'orthologous_to' or 'paralogous_to'. For manually curated ortholog/paralog data the analysisfeature and analysis are not required and are not added.

analysis
 |
 + analysisfeature
    |
    feature (type_id == protein_match)
      |
 +----+-------+------------+
 |            |            |
 |            |            |
feature1    feature2    feature3
gene        gene        gene

The bottom links are feature_relationships of SO type orthologous_to.