Reading Rosetta outputs

One of the key advantadges of rstoolbox is the ability to control the amount and type of data that is loaded from a silent/score file. This control is managed through a definition, a dictionary that describes the type of data that can be loaded.

Note

definition is meant to be applied to parse_rosetta_file().

As of now, there are 10 different options that can be convined into a definition:

definition term description
scores Basic selection of the scores to store. Default is all scores.
scores_ignore Selection of specific scores to ignore.
scores_rename Rename some score names to others.
scores_by_residue Pick score by residue types into a single array value.
scores_missing Names of scores that might be missing in some decoys.
naming Use the decoy identifier’s name to create extra score terms.
sequence Pick sequence data from the silent file.
structure Pick structural data from the silent file.
psipred Pick PSIPRED data from the silent file.
dihedrals Retrieve dihedral data from the silent file.
labels Retrieve residue labels from the silent file.
graft_ranges When using the MotifGraftMover, multi-columns will be created when more than one segment is grafted. Provide here the number of segments.

Tip

definition can be passed directly as a dictionary or can be saved as a JSON or YAML file and loaded from there.

scores

This is the most basic parameter, and refer to regular scores in the silent/score file. It allows to select just the scores that are wanted for the analysis. There are three main ways to define scores, provide a list naming the scores of interest:

{'scores': ['score', 'packstat', 'description']}

add a string asterisc if all scores all wanted (this is the default value for this parameter):

{'scores': '*'}

or add a minus sign, which will ignore all scores:

{'scores': '-'}

Example

In [1]: from rstoolbox.io import parse_rosetta_file
   ...: import pandas as pd
   ...: pd.set_option('display.width', 1000)
   ...: ifile = '../rstoolbox/tests/data/input_2seq.minisilent.gz'
   ...: definition1 = {'scores': ['score', 'packstat', 'description']}
   ...: df = parse_rosetta_file(ifile, definition1)
   ...: df.head()
   ...: 
Out[1]: 
     score  packstat                     description
0 -206.678  0.633     test_3lhp_binder_labeled_00001
1 -214.362  0.577     test_3lhp_binder_labeled_00002
2 -203.582  0.568     test_3lhp_binder_labeled_00003
3 -213.779  0.614     test_3lhp_binder_labeled_00004
4 -213.972  0.591     test_3lhp_binder_labeled_00005

In [2]: definition2 = {'scores': '*'}
   ...: df1 = parse_rosetta_file(ifile, definition2)
   ...: df2 = parse_rosetta_file(ifile)
   ...: df1.head()
   ...: 
Out[2]: 
     score    fa_atr   fa_rep   fa_sol  fa_intra_rep  fa_elec  pro_close  hbond_sr_bb  hbond_lr_bb  hbond_bb_sc  hbond_sc  dslf_fa13    rama   omega   fa_dun  p_aa_pp  yhh_planarity     ref  BUNS  B_ni_mtcontacts  B_ni_rmsd  B_ni_rmsd_threshold  B_ni_trials  GRMSD2Target  GRMSD2Template  LRMSD2Target  LRMSDH2Target  LRMSDLH2Target  cav_vol  design_score  packstat  rmsd_drift    time                     description
0 -206.678 -1510.021  268.657  853.020  2.921        -145.015  5.825     -150.177     -2.452       -13.326      -36.936    0.0       -28.499  42.312  551.509 -23.219   0.000         -21.277  22.0  57.0             0.568      5.0                  1.0          1.976         1.927           4.404         4.055          2.490           387.371 -255.445       0.633     1.677       3194.0  test_3lhp_binder_labeled_00001
1 -214.362 -1490.968  267.328  824.258  3.019        -133.421  6.018     -151.609     -2.452       -12.584      -33.021    0.0       -29.998  38.315  545.056 -23.612   0.003         -20.693  14.0  54.0             0.333      5.0                  1.0          2.659         2.417           4.469         4.124          2.730           332.657 -264.239       0.577     2.240       3210.0  test_3lhp_binder_labeled_00002
2 -203.582 -1483.595  268.234  831.771  2.823        -133.837  6.049     -151.148     -3.091       -11.769      -35.851    0.0       -27.703  38.159  542.102 -25.758   0.007         -19.974  14.0  56.0             0.264      5.0                  1.0          2.026         1.607           5.208         4.598          2.907           333.851 -256.270       0.568     1.522       3235.0  test_3lhp_binder_labeled_00003
3 -213.779 -1519.755  271.747  863.244  2.763        -149.878  5.899     -153.525     -2.452       -11.957      -38.971    0.0       -26.899  42.948  553.487 -23.821   0.020         -26.629  25.0  54.0             0.580      5.0                  1.0          2.407         2.047           5.728         4.866          3.002           280.594 -260.461       0.614     1.791       3237.0  test_3lhp_binder_labeled_00004
4 -213.972 -1504.434  267.568  841.170  2.782        -129.666  5.839     -150.024     -2.452       -11.061      -35.585    0.0       -30.954  40.120  545.857 -23.642   0.001         -29.490  22.0  51.0             0.233      5.0                  1.0          2.245         1.907           3.787         3.258          2.692           330.420 -267.847       0.591     1.745       3239.0  test_3lhp_binder_labeled_00005

In [3]: (df1.columns == df2.columns).all()
Out[3]: True

In [4]: definition3 = {'scores': '-'}
   ...: df3 = parse_rosetta_file(ifile, definition3)
   ...: df3.head()
   ...: 
Out[4]: 
Empty DesignFrame
Columns: []
Index: []

scores_ignore

This is basically the oposite from the previous one, meant in case particular scores are to be ignored. This is usefull when expecting to mix data from different experiments but one has generated more score data than the other (for example, one needed loop closure and includes some loop_closure scores). Extra scores can affect the concatenation of the different data containers as per pandas constraints. There are two ways to define scores_ignore, either list the scores to skip:

{'scores_ignore': ['loop_closure', 'packstat']}

or a string asterisc if aiming to ignore all scores:

{'scores_ignore': '*'}

scores_rename

Allows to retrieve a particular score term with a different name. Again, usefull to merge data from multiple runs when some naming is not matching. It is defined as a dictionary in which keys are the original score names and values the new naming schema:

{'scores_rename': {'grmsd': 'globalRMSD', 'lrmsd': 'localRMSD'}}

scores_by_residue

Scores that target per-residue values are normally ignored by the library. This parameter allows for them to be captured as a single vector score. As of now, the only available score is residue_ddg_:

{'scores_by_residue': ['residue_ddg_']}

as provided by Rosetta’s‘ ddG mover:

<ddG name="(&string)" per_residue_ddg="1" "/>

scores_missing

Sometimes a decoy is missing a column. It happens. By providing the name of those columns, this issue can be corrected so other scores are not mistakenly assigned:

{'scores_missing': ['rama_per_res_filter']}

naming

Naming conventions are important. As long as one keeps that in mind, one could target particular identifiers inside a decoy description as new score columns, thus allowing to cluster different decoys for analysis. As an example, let’s assume that the naming of a set of decoys is such as:

nubinitio_auto_2015_binder_2pw9C_0001

In this case, three of the forut first elements of the identifier relate to the conditions of the experiment. To capture them, an array needs to be defined. As long as an identifier is set for an element, that element will be captured as a new score. Elements can be skiped with an empty string. The array has to be long enough to capture all the elements of interest, but it does not need to have as many fields as elements in the definition. Thus:

{'naming': ['experiment', 'fragments', '', 'binder']}

will create the three new scores with value nubinitio, auto, binder.

sequence

Sequence data is integrated into the silent file as default. It can be captured and used to evaluate mutants and sequence drift amongst others. To retrieve that data, a string with the identifiers of all the chains of interest need to be provided, thus:

{'sequence': 'AB'}

will allow to capture the sequence for chains A and B. Sequence data can then be accessed through the appropiate getter functions of DesignFrame and DesignSeries, and is expected for any of the sequence analysis functions and plots.

Alternatively, * can be used to indicate that all sequence chains should be retrieved and $ can be used to retrieve the data without keeping chain track.

Warning

The $ should only be used when the silentfile contains multiple poses with different chain identifier in each but with only one chain each. Unexpected behaviour will occur otherwise.

structure

Secondary structure data can be loaded into the silent file by means of Rosetta’s WriteSSEMover:

<WriteSSEMover name="(&string;)" dssp="1" />

Similarly to sequence, it can be loaded as extra data by calling the chains of interest:

{'structure': 'AB'}

or using the available wildcards.

psipred

Secondary structure prediction data can be loaded into the silent file by means of Rosetta’s WriteSSEMover:

<WriteSSEMover name="(&string;)" cmd="/path/to/psipred" />

Similarly to sequence and structure, it can be loaded as extra data by calling the chains of interest:

{'psipred': 'AB'}

or using the available wildcards.

dihedrals

Phi and psi dihedral angle data can be loaded into the silent file by means of Rosetta’s WriteSSEMover:

<WriteSSEMover name="(&string;)" write_phipsi="1" />

Angle data from specific chains can be loaded as:

{'dihedrals': 'AB'}

or using the available wildcards.

The data will be loaded as a single array of floats.

labels

Residue labels allow to target residues through a simulation by a particular tag. They can also be used to highlight residues with particular properties during the simulation. They can be saved into the silent file with Rosetta’s DisplayPoseLabelsMover:

<DisplayPoseLabelsMover name="(&string;)" write="1" />

To retrieve that data, a list with the names of the labels of interest can be provided:

{'labels': ['MOTIF', 'CONTEXT', 'HOTSPOTS']}

graft_ranges

Rosetta’s MotifGraftMover allows the grafting of one or more segments into a target protein. The mover will generate several score terms with more than one column. In order to properly assign the repeated columns to the same header, the number of segments inserted needs to be provided in order to properly be processed:

{'graft_ranges': 2}

Warning

The scores targeted by this definition are not compatible with scores_rename.

Note

The ranges format generated by MotifGraftMover are coma separated start,end positions, such as 10,35. To cast this into the appropiate format to use with Selection and, thus, with the key_residues attribute present in some functions, one can call string replace: df['graft_out_scaffold_ranges'] = df['graft_out_scaffold_ranges'].str.replace(',', '-').