rstoolbox.analysis.binary_similarity

rstoolbox.analysis.binary_similarity(df, seqID, key_residues=None, matrix='IDENTITY')

Binary profile for each design sequence against the reference_sequence.

Makes a DesignFrame with a new column to map binary identity (0/1) with the reference_sequence. If a different matrix than IDENTITY is provides, the binary sequence sets to 1 all the positive values.

New Column Data Content
<matrix>_<seqID>_binary Binary representation of the match with the reference_sequence.
Parameters:
  • df (Union[DesignFrame, DataFrame]) – Data container.
  • seqID (str) – Identifier of the sequence of interest.
  • key_residues (Union[int, list() of int, str, Selection]) – Residues of interest.
  • matrix (str) – Identifier of the matrix used to evaluate similarity. Default is IDENTITY.
Returns:

DesignFrame.

Raises:
AttributeError:if the data passed is not a DesignFrame or cannot be casted to one.
KeyError:if there is no sequence information for chain seqID of the decoys.
AttributeError:if there is no reference_sequence for chain seqID of the decoys.

Example

In [1]: from rstoolbox.io import parse_rosetta_file
   ...: from rstoolbox.analysis import binary_similarity
   ...: import pandas as pd
   ...: pd.set_option('display.width', 1000)
   ...: pd.set_option('display.max_columns', 500)
   ...: df = parse_rosetta_file("../rstoolbox/tests/data/input_2seq.minisilent.gz",
   ...:                         {'scores': ['score'], 'sequence': 'B'})
   ...: df.add_reference_sequence('B', df.get_sequence('B').values[0])
   ...: df = binary_similarity(df.iloc[1:], 'B')
   ...: df.head()
   ...: 
Out[1]: 
     score                                                                                                            sequence_B                                                                                                     identity_B_binary
0 -214.362  PKPEEAMREAYKLIKKYMLKAQKEAQEEWERMRRTDGTKEEKDMFPEKMIAQALRAIGEIFNAYYWAFLKLQEFKKYPSVRWEEQEEARKRLKIMMKIGAEWAREIAREMKERIKR  00111100010010000101000100011100010000011011011011101111111111100000100000010100001000100100000000000010000000010000
1 -203.582  TKPEEMAREAYKRMLKALKQGEEEMKRMYEQMKKGVDSKEERDMEPEKMIAIALRAIGELFNAWMKALRHMKELRKLGTSGPKEEEKHWRWIFELHRWAGEEIQRAAEIQERKARW  10111000010000001000101100100100100000011111011011101111111011100000001000110000100000000000000000010000000000001010
2 -213.779  TKPEEWARWAYKEHLKMAEKHRKEMEIEWEELKRRDGKEEEKDMWPERMIAMALRAIGELFNHHMYAEMRAKEEKKKPEAKTEEARRARREIMKYHHEAGRLIEEAMRRLMERHKK  10111000010000000001000101011100110000011011011111101111111011000000000001010101001000000010010000010000000000010001
3 -213.972  KKWEEMMREAERQGKEYAQKAWKEALLEWKWMRKRPVTEEMKDMAPEWMIAAALRAIGEHFNIYWQQKLEHEKLRKIPNVPEEELEKGKEELKRIEEEAARMAEKYMQELRKKMES  00011000010100000001010100011010000000010011011011111111111011001100110110110110011000001010001000000110000000101000
4 -195.138  PRPEEMARFAKEEMHKHEEKAYREFLLEYELAIRKNPTEEPKDMQPEWAIAAALRAIGEIFNQWMYHLLEIRKENGSSHTRYEEREKYRKLAKRLHEEAAKEIWKFMHEAMRRFES  01111000010000000001000100010100010000010011011001111111111111000000110011000000001000000101000000000000000000000000