rstoolbox.analysis.sequence_similarity

rstoolbox.analysis.sequence_similarity(df, seqID, key_residues=None, matrix='BLOSUM62')

Evaluate the sequence similarity between each decoy and the reference_sequence for a given seqID.

Sequence similarity is understood in the context of substitution matrices. Thus, a part from identities, also similarities can be evaluated.

It will return the input data container with several new columns:

New Column Data Content
<matrix>_<seqID>_raw Score obtained by applying <matrix>
<matrix>_<seqID>_perc Score obtained by applying <matrix> over score of reference_sequence against itself
<matrix>_<seqID>_identity Total identity matches
<matrix>_<seqID>_positive Total positive matches according to <matrix>
<matrix>_<seqID>_negative Notal negative matches according to <matrix>
<matrix>_<seqID>_ali Representation of aligned residues
<matrix>_<seqID>_per_res Per position score of applying <matrix>

Matrix name in each new column is setup in lowercase.

Tip

If key_residues are applied, the scoring is only used on those, but nothing in the naming of the columns will indicate a partial evaluation. It is important to keep that in mind moving forward on whatever analysis you are performing.

Running this function multiple times (different key_residue selections, for example) adds suffix to the previously mentioned columns following pandas’ merge naming logic (_x, _y, _z, …).

Parameters:
  • df (Union[DesignFrame, DataFrame]) – Data container.
  • seqID (str) – Identifier of the sequence of interest.
  • key_residues (Union[int, list() of int, str, Selection]) – Residues of interest.
  • matrix (str) – Identifier of the matrix used to evaluate similarity. Default is BLOSUM62.
Returns:

DesignFrame.

Raises:
AttributeError:if the data passed is not a DesignFrame or cannot be casted to one.
KeyError:if there is no sequence information for chain seqID of the decoys.
AttributeError:if there is no reference_sequence for chain seqID of the decoys.

Example

In [1]: from rstoolbox.io import parse_rosetta_file
   ...: from rstoolbox.analysis import sequence_similarity
   ...: import pandas as pd
   ...: pd.set_option('display.width', 1000)
   ...: pd.set_option('display.max_columns', 500)
   ...: df = parse_rosetta_file("../rstoolbox/tests/data/input_2seq.minisilent.gz",
   ...:                         {'scores': ['score'], 'sequence': 'B'})
   ...: df.add_reference_sequence('B', df.get_sequence('B').values[0])
   ...: df = sequence_similarity(df.iloc[1:], 'B')
   ...: df.head()
   ...: 
Out[1]: 
     score                                                                                                            sequence_B  blosum62_B_raw  blosum62_B_identity  blosum62_B_positive  blosum62_B_negative                                                                                                        blosum62_B_ali                                                                                                                                                                                                                                                                                                                                          blosum62_B_per_res  blosum62_B_perc
0 -214.362  PKPEEAMREAYKLIKKYMLKAQKEAQEEWERMRRTDGTKEEKDMFPEKMIAQALRAIGEIFNAYYWAFLKLQEFKKYPSVRWEEQEEARKRLKIMMKIGAEWAREIAREMKERIKR  183             41                   68                   48                   .+PEEA...A++L.+..M.K..+E.+.EWE..+R....+EE+DM.PE+MIA.ALRAIGEIFNA.+...L++++.+K.P+...E+.+E.+K....+.......A....++.+E+.++  [-1, 2, 7, 5, 5, 4, -1, 0, 0, 4, 2, 2, 4, -1, 1, -3, -2, 5, -2, 5, 0, -2, 1, 5, -2, 2, 0, 5, 11, 5, -3, -1, 2, 5, 0, 0, -2, -2, 2, 5, 5, 2, 6, 5, 0, 7, 5, 2, 5, 4, 4, -1, 4, 4, 5, 4, 4, 6, 5, 4, 6, 6, 4, -2, 2, -2, -1, -3, 4, 1, 2, 2, 1, -3, 2, 5, -2, 7, 1, 0, -2, -3, 5, 1, 0, 1, 5, -1, 2, 5, 0, -1, -3, -3, 1, -1, -2, -1, -2, 0, ...]             0.288189       
1 -203.582  TKPEEMAREAYKRMLKALKQGEEEMKRMYEQMKKGVDSKEERDMEPEKMIAIALRAIGELFNAWMKALRHMKELRKLGTSGPKEEEKHWRWIFELHRWAGEEIQRAAEIQERKARW  154             39                   65                   51                   T+PEE....A++....A+++G.EE.+R.+E..K+....+EERDM.PE+MIA.ALRAIGE+FNA..+....M++.RK...+G.++.++..+..+++..+.G.......+....K.R.  [5, 2, 7, 5, 5, -1, -1, 0, 0, 4, 2, 2, -2, -1, -3, -3, 4, 2, 2, 1, 6, -3, 5, 5, -2, 1, 5, -2, 2, 5, -2, -1, 5, 2, 0, -2, -1, 0, 2, 5, 5, 5, 6, 5, -3, 7, 5, 2, 5, 4, 4, -1, 4, 4, 5, 4, 4, 6, 5, 2, 6, 6, 4, -3, -1, 1, -1, -2, -2, 0, 5, 1, 1, -3, 5, 5, -3, -2, 0, 1, 6, -1, 1, 1, -2, 1, 1, 0, -3, 2, -3, -1, 1, 1, 2, -2, -2, 2, -3, 6, ...]            0.242520       
2 -213.779  TKPEEWARWAYKEHLKMAEKHRKEMEIEWEELKRRDGKEEEKDMWPERMIAMALRAIGELFNHHMYAEMRAKEEKKKPEAKTEEARRARREIMKYHHEAGRLIEEAMRRLMERHKK  178             42                   63                   53                   T+PEE....A++.......K..+E.E.EWE..KR.....EE+DM.PERMIA.ALRAIGE+FN......+..++E+K.P.A..E+.+..++E..K..+..G.+....+++..E+.+K  [5, 2, 7, 5, 5, -3, -1, 0, -3, 4, 2, 2, -3, -2, -3, -3, -1, -1, 0, 5, -2, -3, 1, 5, -2, 5, -3, 5, 11, 5, -3, -2, 5, 5, -1, 0, -2, -2, 0, 5, 5, 2, 6, 5, -2, 7, 5, 5, 5, 4, 4, -1, 4, 4, 5, 4, 4, 6, 5, 2, 6, 6, -2, 0, -1, -1, -1, 0, 2, 0, -1, 1, 1, 5, 2, 5, 0, 7, 0, 4, -2, -1, 5, 1, -1, 2, 0, -1, 2, 2, 5, -1, -1, 5, -1, -2, 2, -2, -3, 6, ...]       0.280315       
3 -213.972  KKWEEMMREAERQGKEYAQKAWKEALLEWKWMRKRPVTEEMKDMAPEWMIAAALRAIGEHFNIYWQQKLEHEKLRKIPNVPEEELEKGKEELKRIEEEAARMAEKYMQELRKKMES  208             47                   67                   49                   .+.EE....A.R..+...+K.W+E...EW+W.++.....E.+DM.PE.MIAAALRAIGE.FN..WQ.+LE.EK.RK.PN..EE++++.K+E..+I......MA..++++.R+K...  [-1, 2, -4, 5, 5, -1, -1, 0, 0, 4, -3, 5, -2, 0, 1, -3, -2, -1, 1, 5, 0, 11, 1, 5, -2, -3, -2, 5, 11, 1, 11, -1, 2, 2, -1, -1, -2, -2, 0, 5, -2, 2, 6, 5, -1, 7, 5, -3, 5, 4, 4, 4, 4, 4, 5, 4, 4, 6, 5, -3, 6, 6, -1, -2, 11, 5, 0, 2, 4, 5, -2, 5, 5, -3, 5, 5, -3, 7, 6, 0, -2, 5, 5, 1, 2, 1, 1, -2, 5, 1, 5, -1, -3, 2, 4, -1, -2, -2, -3, 0, ...]     0.327559       
4 -195.138  PRPEEMARFAKEEMHKHEEKAYREFLLEYELAIRKNPTEEPKDMQPEWAIAAALRAIGEIFNQWMYHLLEIRKENGSSHTRYEEREKYRKLAKRLHEEAAKEIWKFMHEAMRRFES  101             35                   52                   64                   .RPEE....A.........K.+.E...E+E...R.+...E.+DM.PE..IAAALRAIGEIFN......LE+.KE..+.+...E+.++.+K.A.++..........++.+...+...  [-1, 5, 7, 5, 5, -1, -1, 0, -3, 4, -3, 0, -3, -1, 0, -3, -2, -2, 0, 5, 0, 2, 0, 5, -1, -3, -2, 5, 2, 5, -2, -3, -3, 5, -1, 1, -1, -2, 0, 5, -1, 2, 6, 5, -2, 7, 5, -3, -1, 4, 4, 4, 4, 4, 5, 4, 4, 6, 5, 4, 6, 6, -1, -3, -1, -1, -2, -2, 4, 5, 1, 0, 5, 5, 0, -2, 1, -1, 1, 0, -2, -2, 5, 1, -1, 1, 1, -1, 2, 5, -3, 4, -3, 2, 2, -2, -2, -2, -3, 0, ...]  0.159055