rstoolbox.analysis.positional_sequence_similarity¶

rstoolbox.analysis.positional_sequence_similarity(df, seqID=None, ref_seq=None, key_residues=None, matrix='BLOSUM62')¶

Per position identity and similarity against a reference_sequence.

Provided a data container with a set of sequences, it will evaluate the percentage of identities and similarities that the whole set has against a reference_sequence. It would do so by sequence position instead that by each individual sequence.

In a way, this generates an extreme simplification from a SequenceFrame.

Parameters:

df (Union[DesignFrame, FragmentFrame]) – Data container.
seqID (str) – Identifier of the sequence of interest. Required when input is DesignFrame.
ref_seq (str) – Reference sequence. Required when input is FragmentFrame. Will overwrite the reference sequence of DesignFrame if provided.
key_residues (Union[int, list() of int, str, Selection]) – Residues of interest.
matrix (str) – Identifier of the matrix used to evaluate similarity. Default is BLOSUM62.

Returns:

DataFrame - where rows are sequence positions and columns are identity_perc and positive_perc.

Raises:

AttributeError:	if the data passed is not in Union[`DesignFrame`, `FragmentFrame`]. It will not try to cast a provided `DataFrame`, as it would not be possible to know into which of the two possible inputs it needs to be casted.
AttributeError:	if input is `DesignFrame` and `seqID` is not provided.
KeyError:	if there is no sequence information for chain `seqID` of the decoys when input is `DesignFrame`.
AttributeError:	if there is no `reference_sequence` for chain `seqID` of the decoys when input is `DesignFrame`.
AttributeError:	if input is `FragmentFrame` and `ref_seq` is not provided.

Example

In [1]: from rstoolbox.io import parse_rosetta_file
   ...: from rstoolbox.analysis import positional_sequence_similarity
   ...: import pandas as pd
   ...: pd.set_option('display.width', 1000)
   ...: pd.set_option('display.max_columns', 500)
   ...: df = parse_rosetta_file("../rstoolbox/tests/data/input_2seq.minisilent.gz",
   ...:                         {'scores': ['score'], 'sequence': 'B'})
   ...: df.add_reference_sequence('B', df.get_sequence('B').values[0])
   ...: df = positional_sequence_similarity(df.iloc[1:], 'B')
   ...: df.head()
   ...: 
Out[1]: 
   identity_perc  positive_perc
1  0.4            0.4          
2  0.2            1.0          
3  0.8            0.8          
4  1.0            1.0          
5  1.0            1.0