This tutorial is a step-by-step guide for searching for motifs for the SET domain, which I have taught for epigenetics students.
“For example, a protein called Clr4 from S. pombe contains the SET domain. How could you find mammalian homologous of Clr4? Let’s assume that you find 8 proteins in human database containing SET domain. How close are they? Can we draw a tree out of it? Can we align all these protein sequences together and compare their similarity, and find the most conserved motif (like GXGNA) shared with all these proteins? If I would like to know where this motif located in 3D structure, can we look at it on the published protein structure database?”
Here are some things to get started.
This is a cleaner example if you use Swiss-Prot rather than RefSeq. That way you don’t have to deal with the different human isoforms. The Swiss-Prot protein sequence for Schizosaccharomyces pombe Clr4 is O60016.2. You can run a BLAST search against human Swiss-Prot records to find potential homologs.
Set the Expect threshold to something stringent say 1e-6.
The results show some protein matches to the chromodomain at the N-terminal end of clr3 and several matches to the SET domain region in the C-terminal half of the protein
I can select the first 11 of these and send them to COBALT to generate a multiple sequence alignment.
The colored region in the multiple alignment is the region of containing the Pre-SET and SET domains. The red residues are identical across the alignment. I’ve colored the alignment by identity in the screen shot below.
I can easily define motifs of various lengths and complexities that are present in this domain; for example GWG or an extended version of that G-W-G-(X)11-F-(X)3-Y-X-G. There is also D-(X)4-G-N-(X)5-NH-X-C-P-N. You could write more sophisticated patterns that incorporate columns with conservative substitutions. The ProSite resource at ExPASy provides patterns similar to the ones above as signatures for certain domains, for example DEAD/DEAH box helicases (http://prosite.expasy.org/PDOC00039 ). The SET domain is rather more complex. In this case ProSite offers a logo showing the pattern of conservation (http://prosite.expasy.org/cgi-bin/prosite/sequence_logo.cgi?ac=PS50280). (Of course ProSite also offers a conserved domain profile or soring matrix similar to the one in NCBI’s conserved database as I describe below.)
Notice that the above multiple alignment only includes the human proteins and not the fission yeast one. I can use the “Edit and Resubmit” link at the top of the COBALT results to add O60016.2 to the alignment.
This shows the same basic pattern of conservation within the SET domain even with the fission yeast protein included. I can show a similar output from a conserved domain search with any of these proteins that contain the SET domain, for example SUV91_HUMAN (O43463.1).
I’ve checked the box to include the consensus sequence and colored by identity in the alignment from the conserved domain output below.
Some identities are now missing with the inclusion of more diverse members. I can display a structure (as long as I have Cn3D installed) for the SET domain from the CDD record by expanding the “Structure” section and clicking the “Structure View” button.
The amino acids in the structure (PDB: 3OOI MMDB: 87491) of the Human Histone-lysine N-methyltransferase Nsd1 SET Domain in Complex With S-adenosyl-l-methionine are colored by conservation. Notice that many of the most conserved residues, the ones involved in the motifs we found, are the ones that form the pocket containing the S-adenosyl methionine near the probable active site.
Reused with permission from Peter Cooper, Ph.D., The National Center for Biotechnology Information.
Sarah Jeong, Research & Instruction Librarian for Science, Wake Forest University
We welcome your comments and suggestions. If you have a resource that you would like to see highlighted please leave us a comment.