rstoolbox.analysis.sequential_frequencies¶

rstoolbox.analysis.sequential_frequencies(df, seqID, query='sequence', seqType='protein', cleanExtra=True, cleanUnused=-1)¶

Generates a SequenceFrame for the frequencies of the sequences in the DesignFrame with seqID identifier.

If there is a reference_sequence for this seqID, it will also be attached to the SequenceFrame.

All letters in the sequence will be capitalized. All symbols that do not belong to string.ascii_uppercase will be transformed to “*” as this is the symbol recognized by the substitution matrices as gap.

This function is directly accessible through some DesignFrame methods.

Parameters:

Parameters:	df (Union[`DesignFrame`, `DataFrame`]) – Data container. seqID (str) – Identifier of the sequence of interest. query (str) – Content type to load from the input data `sequence`, `structure`, `structure_prediction`. seqType (str) – Type of sequence: `protein`, `dna`, `rna` and `protein_sse`. cleanExtra (bool) – Remove from the `SequenceFrame` the non-regular amino/nucleic acids if they are empty for all positions; basically remove ambiguous and gap identifiers. cleanUnused (float) – Remove from the `SequenceFrame` the regular amino/nucleic acids if they frequency is equal or under the value; basically this targets fully empty positions to minimise the size of the matrix. The value itself represents the threshold to consider a position empty. Thus, `-1` triggers no filter while `0.3` would consider all the frequencies equal or lower than that value as empty.
Returns:	`SequenceFrame`

df (Union[DesignFrame, DataFrame]) – Data container.
seqID (str) – Identifier of the sequence of interest.
query (str) – Content type to load from the input data sequence, structure, structure_prediction.
seqType (str) – Type of sequence: protein, dna, rna and protein_sse.
cleanExtra (bool) – Remove from the SequenceFrame the non-regular amino/nucleic acids if they are empty for all positions; basically remove ambiguous and gap identifiers.
cleanUnused (float) – Remove from the SequenceFrame the regular amino/nucleic acids if they frequency is equal or under the value; basically this targets fully empty positions to minimise the size of the matrix. The value itself represents the threshold to consider a position empty. Thus, -1 triggers no filter while 0.3 would consider all the frequencies equal or lower than that value as empty.

Returns:

SequenceFrame

Example

In [1]: from rstoolbox.io import parse_rosetta_file
   ...: from rstoolbox.analysis import sequential_frequencies
   ...: import pandas as pd
   ...: pd.set_option('display.width', 1000)
   ...: pd.set_option('display.max_columns', 500)
   ...: df = parse_rosetta_file("../rstoolbox/tests/data/input_2seq.minisilent.gz",
   ...:                         {'scores': ['score'], 'sequence': 'AB'})
   ...: df = sequential_frequencies(df, 'B')
   ...: df.head()
   ...: 
Out[1]: 
     C    D    S    Q         K    I         P    T    F    N    G    H    L         R         W    A    V    E    Y    M
1  0.0  0.0  0.0  0.0  0.166667  0.0  0.333333  0.5  0.0  0.0  0.0  0.0  0.0  0.000000  0.000000  0.0  0.0  0.0  0.0  0.0
2  0.0  0.0  0.0  0.0  0.666667  0.0  0.000000  0.0  0.0  0.0  0.0  0.0  0.0  0.333333  0.000000  0.0  0.0  0.0  0.0  0.0
3  0.0  0.0  0.0  0.0  0.000000  0.0  0.833333  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  0.166667  0.0  0.0  0.0  0.0  0.0
4  0.0  0.0  0.0  0.0  0.000000  0.0  0.000000  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  0.000000  0.0  0.0  1.0  0.0  0.0
5  0.0  0.0  0.0  0.0  0.000000  0.0  0.000000  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  0.000000  0.0  0.0  1.0  0.0  0.0