Sequence landmark patterns
Homologous proteins are usually recognisable by a high level of amino acid identity or similarity. The residues most strongly conserved during evolution tend to be those that are essential for a protein to maintain its function. The overall folding of the protein ensures that these essential motifs are correctly located within the protein structure, figure 1a. This strongly constrains the positions, the orientations and the lengths of all the structural elements, a-helices, b-strands and loops, connecting the motifs. Consequently, the number of amino-acid residues between the motifs (figure 1b) can itself be a significant structural parameter.
Figure 1. Defining Landmark Patterns. (a) A folded protein structure is represented by the black dotted line. Short sequence motifs, conserved within the protein family, are in red with labels m1, m2, etc.. Some motifs, m2, m3, m4, for example, may be involved in an active site that interacts with a substrate (pink). (b) Showing the motif positions in the amino acid sequence. The landmark pattern for an individual sequence is defined as the number of residues, N1, N2, N3, N4, N5 between successive motifs, NH2-terminus to m1, m1 to m2, etc.. The pattern relates to the three-dimensional protein structure.
Short groups of amino acid residues, as defined above, can constitute specific protein family signatures, and on this basis many computational approaches have been proposed to identify and classify related proteins using unaligned sequences, see for example (1,2). (Note that in these publications, the term ’patterns’ usually corresponds to ’motifs’ here). The approach presented here searches each sequence in a set of unaligned sequences for motifs, m1, m2, etc., and also counts the number of amino acid residues between successive motifs, figure 1b. The motif-to motif scores for each protein sequence in a data set are tabulated and classed according to the numerical patterns that are highly reproducuble within specific subfamilies. This approach allows protein superfamily sequences, like kinesins, myosins, actin related proteins and Ga subunits of trimeric G proteins, etc., to be sorted into subfamilies (3) consistent with phylogenetic trees obtained by multiple sequence alignments and various methods of statistical analysis (4,5).
Method
The presence of conserved sequence motifs a few amino acids in length is checked by sequence alignments of a few members of each protein family using ClustalW. Each motif must be long enough to have a low probability of occuring by chance elsewhere in the protein sequence. For example, at the simplest level, suppose a specific amino acid residue has a 1 in 20 probability of occuring at a given position in a sequence, then the probability of three specified amino acids occuring one after the other will be (1/20)3, i.e.1 in 8000. In practice, this is a minimum motif length but longer motifs with amino acid options at the same position together with wildcards at certain positions are often needed to cover a complete protein family or sub-family.
FASTA format protein sequence files are analysed using versions of a PERL script program (Kwik.pl) to search each sequence in a data base file for a set of amino acid motifs specified in a separate ’motif’ file. See ’example’ below.
References
1. Jonassen, I., Collins, J. F. and Higgins, P. G. (1995) Finding flexible patterns in unaligned protein sequences. Prot. Sci. 4, 1587-1595.
2. Ye, K., Kosters, W. A. and Ijerman, A. (2007) An efficient, versatile and scalable pattern growth approach to mine frequent patterns in unaligned protein sequences. Bioinform. 23, 687-693.
3. Wade, R. H. (2002) Sequence landmark patterns identify and characterize protein families. Structure 10, 1329-1336.
4. Cope, M. J. T. V., Whisstock, J., Rayment, I. and Kendrick-Jones, J. (1996) Conservation within the myosin motor domain : implications for structure and function. Structure 4, 969-987.
5. Kim, A. J. and Endow, S. A. (2000) A kinesin family tree. J. Cell Sci. 113, 3681-3682.
Program and copyright
The Kwik.pl program is copyrighted and was developed by Richard Wade in the Institut de Biologie Structurale - Jean-Pierre Ebel (CEA - CNRS - UJF) at Grenoble, France. The license is free for academic sites and non commercial use. The use of the program is subject to the following conditions : The use of the program is restricted to the individual, laboratory or organization to which it is supplied. This individual, laboratory or organization can make unlimited copies of the program for backup purposes or for running the program on more than one computer system at the host institution. Neither the program nor any part of it may be sold, nor any copies distributed to third parties without the express permission of the authors at the Institut de Biologie Structurale. No software package can be considered to be bug free. The author of the program accepts no responsability whatsoever for damages resulting, directly or indirectly, from the downloading and the use of the program and make no warranty, either express or implied, including but not limited to, any implied warranty of fitness for particular purpose. The program is provided as it is and its users shall assume any loss, risk and damage when using it. If any results obtained with the program are published, whatever the means of publication, particularly in the scientific literature, the program should be referenced.
– If you accept these conditions, download the program (please note that the program will not run with the rtf format : convert to text with line jumps)
– If you do not accept these conditions, click here.
Example
Kwik.pl runs as follows
Kwik.pl
infile = ’actin.txt’
motif file = ’actin.motifs’
output to screen
NAME LEN mot1 mot2 mot3 mot4 mot5
Glactin:gi1 375 10 89 40 143 33
DmArp53d:gi 376 16 88 40 139 33
Tgactin:gi1 376 11 89 40 143 33
Pfactin:gi5 376 11 89 40 143 33
Ehactin:gi1 376 11 89 40 143 33
Spactin:gi1 375 10 89 40 143 33
Scactin:gi1 375 10 89 40 143 33
Atactin:gi6 378 13 89 40 143 33
HsArp3b:gi9 418 10 96 48 150 49
MmArp3 ?:gi1 418 10 96 48 150 49
HsArp3a:gi5 418 10 96 48 150 49
ScArp4:gi63 problem : check sequence
Sp_P23A10.0 problem : check sequence
Sp_C23D3.09 431 15 85 40 192 35
etc
’actin.txt’ is a sequence list in PearsonFasta format with last line > :-
>ORYCU_ACTS gi|62287932|sp|P68135| rabbit 1J6Z sk muscle (_actin1)
MCDEDETTALVCDNGSGLVKAGFAGDDAPRAVFPSIVGRPRHQGVMVGMGQKDSYVGDEAQSKRGILTLK
YPIEHGIITNWDDMEKIWHHTFYNELRVAPEEHPTLLTEAPLNPKANREKMTQIMFETFNVPAMYVAIQA
VLSLYASGRTTGIVLDSGDGVTHNVPIYEGYALPHAIMRLDLAGRDLTDYLMKILTERGYSFVTTAEREI
VRDIKEKLCYVALDFENEMATAASSSSLEKSYELPDGQVITIGNERFRCPETLFQPSFIGMESAGIHETT
YNSIMKCDIDIRKDLYANNVMSGGTTMYPGIADRMQKEITALAPSTMKIKIIAPPERKYSVWIGGSILAS
LSTFQQMWITKQEYDEAGPSIVHRKCF
>Glactin:gi1703155
MTDDNPAIVIDNGSGMCKAGFAGDDAPRAVFPTVVGRPKRETVLVGSTHKEEYIGDEALA
KRGVLKLSYPIEHGQIKDWDMMEKVWHHCYFNELRAQPSDHAVLLTEAPKNPKANREKIC
QIMFETFAVPAFYVQVQAVLALYSSGRTTGIVIDTGDGVTHTVPVYEGYSLPHAVLRSEI
AGKELTDFCQINLQENGASFTTSAEFEIVRDIKEKLCFVALDYESVLAASMESANYTKTY
ELPDGVVITVNQARFKTPELLFRPELNNSDMDGIHQLCYKTIQKCDIDIRSELYSNVVLS
GGSSMFAGLPERLEKELLDLIPAGKRVRISSPEDRKYSAWVGGSVLGSLATFESMWVSSQ
EYQENGASIANRKCM
>DmArp53d:gi7302881
MSSEVDSNSHHAAVVIDNGSGVCKAGFSPEDTPRAVFPSIVGRPRHLNVLLDSVIGDSVI
GEAAARKRGILTLKYPIEHGMVKNWDEMEMVWQHTYELLRADPMDLPALLTEAPLNPKKN
REKMTEIMFEHFQVPAFYVAVQAVLSLYATGRTVGIVVDSGDGVTHTVPIYEGFALPHAC
VRVDLAGRDLTDYLCKLLLERGVTMGTSAEREIVREIKEKLCYVSMNYAKEMDLHGKVET
YELPDGQKIVLGCERFRCPEALFQPSLLGQEVMGIHEATHHSITNCDMDLRKDMYANIVL
SGGTTMFRNIEHRFLQDLTEMAPPSIRIKVNASPDRRFSVWTGGSVLASLTSFQNMWIDS
LEYEEVGSAIVHRKCF
etc
>
’actin.motifs’ is a list of short sequence motifs as below :-
####################
# ’generic’ cf Bork PNAS 1992,89:7290
# valid from actin to arp3
# 20 ; phosphate 1
mot1
D.G[HST]
# 115 ; near connect 1
mot2
[ILMV][FILM][ISTV][CDE].[PS]
# 160 ; phosphate 2
mot3
[AGNS].[IMV].[DE].G
# 310 ; adenosine
mot4
[AST]G[AG][FGNST][APST]
# 360 ; connect 2
mot5
[ACEGLMNST].[ACFWY].G
####################