rstoolbox.analysis.positional_sequence_similarity

rstoolbox.analysis.positional_sequence_similarity(df, seqID=None, ref_seq=None, key_residues=None, matrix='BLOSUM62')

Per position identity and similarity against a reference_sequence.

Provided a data container with a set of sequences, it will evaluate the percentage of identities and similarities that the whole set has against a reference_sequence. It would do so by sequence position instead that by each individual sequence.

In a way, this generates an extreme simplification from a SequenceFrame.

Parameters:
  • df (Union[DesignFrame, FragmentFrame]) – Data container.
  • seqID (str) – Identifier of the sequence of interest. Required when input is DesignFrame.
  • ref_seq (str) – Reference sequence. Required when input is FragmentFrame. Will overwrite the reference sequence of DesignFrame if provided.
  • key_residues (Union[int, list() of int, str, Selection]) – Residues of interest.
  • matrix (str) – Identifier of the matrix used to evaluate similarity. Default is BLOSUM62.
Returns:

DataFrame - where rows are sequence positions and columns are identity_perc and positive_perc.

Raises:
AttributeError:if the data passed is not in Union[DesignFrame, FragmentFrame]. It will not try to cast a provided DataFrame, as it would not be possible to know into which of the two possible inputs it needs to be casted.
AttributeError:if input is DesignFrame and seqID is not provided.
KeyError:if there is no sequence information for chain seqID of the decoys when input is DesignFrame.
AttributeError:if there is no reference_sequence for chain seqID of the decoys when input is DesignFrame.
AttributeError:if input is FragmentFrame and ref_seq is not provided.

Example

In [1]: from rstoolbox.io import parse_rosetta_file
   ...: from rstoolbox.analysis import positional_sequence_similarity
   ...: import pandas as pd
   ...: pd.set_option('display.width', 1000)
   ...: pd.set_option('display.max_columns', 500)
   ...: df = parse_rosetta_file("../rstoolbox/tests/data/input_2seq.minisilent.gz",
   ...:                         {'scores': ['score'], 'sequence': 'B'})
   ...: df.add_reference_sequence('B', df.get_sequence('B').values[0])
   ...: df = positional_sequence_similarity(df.iloc[1:], 'B')
   ...: df.head()
   ...: 
Out[1]: 
   identity_perc  positive_perc
1  0.4            0.4          
2  0.2            1.0          
3  0.8            0.8          
4  1.0            1.0          
5  1.0            1.0