Landmark patterns

Sequence landmark patterns

Homologous proteins are usually recognisable by a high level of amino acid identity or similarity. The residues most strongly conserved during evolution tend to be those that are essential for a protein to maintain its function. The overall folding of the protein ensures that these essential motifs are correctly located within the protein structure, figure 1a. This strongly constrains the positions, the orientations and the lengths of all the structural elements, a-helices, b-strands and loops, connecting the motifs. Consequently, the number of amino-acid residues between the motifs (figure 1b) can itself be a significant structural parameter.

Figure 1. Defining Landmark Patterns. (a) A folded protein structure is represented by the black dotted line. Short sequence motifs, conserved within the protein family, are in red with labels m1, m2, etc.. Some motifs, m2, m3, m4, for example, may be involved in an active site that interacts with a substrate (pink). (b) Showing the motif positions in the amino acid sequence. The landmark pattern for an individual sequence is defined as the number of residues, N1, N2, N3, N4, N5 between successive motifs, NH2-terminus to m1, m1 to m2, etc.. The pattern relates to the three-dimensional protein structure.

Short groups of amino acid residues, as defined above, can constitute specific protein family signatures, and on this basis many computational approaches have been proposed to identify and classify related proteins using unaligned sequences, see for example (1,2). (Note that in these publications, the term ’patterns’ usually corresponds to ’motifs’ here). The approach presented here searches each sequence in a set of unaligned sequences for motifs, m1, m2, etc., and also counts the number of amino acid residues between successive motifs, figure 1b. The motif-to motif scores for each protein sequence in a data set are tabulated and classed according to the numerical patterns that are highly reproducuble within specific subfamilies. This approach allows protein superfamily sequences, like kinesins, myosins, actin related proteins and Ga subunits of trimeric G proteins, etc., to be sorted into subfamilies (3) consistent with phylogenetic trees obtained by multiple sequence alignments and various methods of statistical analysis (4,5).

Method

The presence of conserved sequence motifs a few amino acids in length is checked by sequence alignments of a few members of each protein family using ClustalW. Each motif must be long enough to have a low probability of occuring by chance elsewhere in the protein sequence. For example, at the simplest level, suppose a specific amino acid residue has a 1 in 20 probability of occuring at a given position in a sequence, then the probability of three specified amino acids occuring one after the other will be (1/20)3, i.e.1 in 8000. In practice, this is a minimum motif length but longer motifs with amino acid options at the same position together with wildcards at certain positions are often needed to cover a complete protein family or sub-family.

FASTA format protein sequence files are analysed using versions of a PERL script program (Kwik.pl) to search each sequence in a data base file for a set of amino acid motifs specified in a separate ’motif’ file. See ’example’ below.

References

1. Jonassen, I., Collins, J. F. and Higgins, P. G. (1995) Finding flexible patterns in unaligned protein sequences. Prot. Sci. 4, 1587-1595.

2. Ye, K., Kosters, W. A. and Ijerman, A. (2007) An efficient, versatile and scalable pattern growth approach to mine frequent patterns in unaligned protein sequences. Bioinform. 23, 687-693.

3. Wade, R. H. (2002) Sequence landmark patterns identify and characterize protein families. Structure 10, 1329-1336.

4. Cope, M. J. T. V., Whisstock, J., Rayment, I. and Kendrick-Jones, J. (1996) Conservation within the myosin motor domain : implications for structure and function. Structure 4, 969-987.

5. Kim, A. J. and Endow, S. A. (2000) A kinesin family tree. J. Cell Sci. 113, 3681-3682.

Program and copyright

The Kwik.pl program is copyrighted and was developed by Richard Wade in the Institut de Biologie Structurale - Jean-Pierre Ebel (CEA - CNRS - UJF) at Grenoble, France. The license is free for academic sites and non commercial use. The use of the program is subject to the following conditions: The use of the program is restricted to the individual, laboratory or organization to which it is supplied. This individual, laboratory or organization can make unlimited copies of the program for backup purposes or for running the program on more than one computer system at the host institution. Neither the program nor any part of it may be sold, nor any copies distributed to third parties without the express permission of the authors at the Institut de Biologie Structurale. No software package can be considered to be bug free. The author of the program accepts no responsability whatsoever for damages resulting, directly or indirectly, from the downloading and the use of the program and make no warranty, either express or implied, including but not limited to, any implied warranty of fitness for particular purpose. The program is provided as it is and its users shall assume any loss, risk and damage when using it. If any results obtained with the program are published, whatever the means of publication, particularly in the scientific literature, the program should be referenced.

– If you accept these conditions, download the program (please note that the program will not run with the rtf format : convert to text with line jumps)
– If you do not accept these conditions, click here.

Example

Kwik.pl runs as follows

Kwik.pl

infile = ’actin.txt’

motif file = ’actin.motifs’

output to screen

NAME LEN mot1 mot2 mot3 mot4 mot5

Glactin:gi1 375 10 89 40 143 33

DmArp53d:gi 376 16 88 40 139 33

Tgactin:gi1 376 11 89 40 143 33

Pfactin:gi5 376 11 89 40 143 33

Ehactin:gi1 376 11 89 40 143 33

Spactin:gi1 375 10 89 40 143 33

Scactin:gi1 375 10 89 40 143 33

Atactin:gi6 378 13 89 40 143 33

HsArp3b:gi9 418 10 96 48 150 49

MmArp3?:gi1 418 10 96 48 150 49

HsArp3a:gi5 418 10 96 48 150 49

ScArp4:gi63 problem : check sequence

Sp_P23A10.0 problem : check sequence

Sp_C23D3.09 431 15 85 40 192 35

etc

’actin.txt’ is a sequence list in PearsonFasta format with last line > :-

>ORYCU_ACTS gi|62287932|sp|P68135| rabbit 1J6Z sk muscle (_actin1)

MCDEDETTALVCDNGSGLVKAGFAGDDAPRAVFPSIVGRPRHQGVMVGMGQKDSYVGDEAQSKRGILTLK

YPIEHGIITNWDDMEKIWHHTFYNELRVAPEEHPTLLTEAPLNPKANREKMTQIMFETFNVPAMYVAIQA

VLSLYASGRTTGIVLDSGDGVTHNVPIYEGYALPHAIMRLDLAGRDLTDYLMKILTERGYSFVTTAEREI

VRDIKEKLCYVALDFENEMATAASSSSLEKSYELPDGQVITIGNERFRCPETLFQPSFIGMESAGIHETT

YNSIMKCDIDIRKDLYANNVMSGGTTMYPGIADRMQKEITALAPSTMKIKIIAPPERKYSVWIGGSILAS

LSTFQQMWITKQEYDEAGPSIVHRKCF

>Glactin:gi1703155

MTDDNPAIVIDNGSGMCKAGFAGDDAPRAVFPTVVGRPKRETVLVGSTHKEEYIGDEALA

KRGVLKLSYPIEHGQIKDWDMMEKVWHHCYFNELRAQPSDHAVLLTEAPKNPKANREKIC

QIMFETFAVPAFYVQVQAVLALYSSGRTTGIVIDTGDGVTHTVPVYEGYSLPHAVLRSEI

AGKELTDFCQINLQENGASFTTSAEFEIVRDIKEKLCFVALDYESVLAASMESANYTKTY

ELPDGVVITVNQARFKTPELLFRPELNNSDMDGIHQLCYKTIQKCDIDIRSELYSNVVLS

GGSSMFAGLPERLEKELLDLIPAGKRVRISSPEDRKYSAWVGGSVLGSLATFESMWVSSQ

EYQENGASIANRKCM

>DmArp53d:gi7302881

MSSEVDSNSHHAAVVIDNGSGVCKAGFSPEDTPRAVFPSIVGRPRHLNVLLDSVIGDSVI

GEAAARKRGILTLKYPIEHGMVKNWDEMEMVWQHTYELLRADPMDLPALLTEAPLNPKKN

REKMTEIMFEHFQVPAFYVAVQAVLSLYATGRTVGIVVDSGDGVTHTVPIYEGFALPHAC

VRVDLAGRDLTDYLCKLLLERGVTMGTSAEREIVREIKEKLCYVSMNYAKEMDLHGKVET

YELPDGQKIVLGCERFRCPEALFQPSLLGQEVMGIHEATHHSITNCDMDLRKDMYANIVL

SGGTTMFRNIEHRFLQDLTEMAPPSIRIKVNASPDRRFSVWTGGSVLASLTSFQNMWIDS

LEYEEVGSAIVHRKCF

etc

’actin.motifs’ is a list of short sequence motifs as below :-

####################

# ’generic’ cf Bork PNAS 1992,89:7290

# valid from actin to arp3

# 20; phosphate 1

mot1

D.G[HST]

# 115; near connect 1

mot2

[ILMV][FILM][ISTV][CDE].[PS]

# 160; phosphate 2

mot3

[AGNS].[IMV].[DE].G

# 310; adenosine

mot4

[AST]G[AG][FGNST][APST]

# 360; connect 2

mot5

[ACEGLMNST].[ACFWY].G