Data Analysis and Curation

MAJ-DAAC (Data analysis and curation)

Sequence alignment is considered the first step in establishing sequence-structure-function relationships. Both nucleotide and protein alignment provide basic inference for the identification of conserved regions showing structural, functional or evolutionary relationships. Moreover, sequence alignment is rigorously used in target identification and validation through homology modeling in the early steps of the Drug discovery process. Finally, it should be noted that relying solely on sequence alignment might not yield significant results due to the redundant genetic and structural information inherently found in the sequence.

  • Sequence similarity searching using BLAST, FASTA, Smith-Waterman,
  • Similarity search using sequence translation, including full genome search based on nucleotide or amino acid sequence request.
  • Sequence alignment and subsequent analysis including % similarity, % sequence identity, the number of gaps, etc. (for nucleotide and amino acid sequences)

Proteins contain a myriad of residues, some of which are essential for the proper structure and function, while others can be readily replaced. Therefore, identifying and predicting these functionally important residues from the protein sequence or from structural similarity plays an important role in correctly identifying conserved recurring motifs through which structural prediction and protein design is possible.

  • Protein Structure prediction:
    • Prediction of protein domain composition and domain architecture.
    • Homology modeling
    • Secondary structure prediction
    • Structure-Structure Alignment

Genome Annotation is the process of labeling a string of 4-letter code in order to draw inferences and relations about normal cellular activities and how are they altered in different diseases. At Majecules, we adopt a hierarchical workflow for genome annotation, starting from identification of protein coding regions in genomic DNA sequences to organizing the sequence into regions of interest including but not limited to gene regulatory regions, promoters, SNPs, CpG sites, DMRs and protein binding sites. Finally, these regions of interest are clustered into groups with shared structural or functional properties. This includes metabolic pathway analysis and regulation, comparative genomics, Identification of functionally important motifs,  clustering of functionally related genes and proteome analysis.

  1. Structural Annotation:
    1. ORFs localization
    2. Gene structure
    3. protein coding regions
    4. Regulatory sequences
  2. Functional Annotation:
    1. Gene expression
    2. Biochemical function
    3. Protein-product interaction analysis
  3. Database Search
    1. General Database search (NCBI-NR, EST, UniGene, GEO, etc..)
    2. Specialized Database search ( Pfam, SMART, CDD, Sparcle, etc..)
    3. Genome-oriented database search (COGs, Assembly, SRA, etc…)