rstoolbox.utils.sequencing_enrichment

rstoolbox.utils.sequencing_enrichment(indata, enrichment=None, bounds=None, matches=None, seqID='A')

Retrieve data from multiple NGS files.

Allows to obtain data from multiple files while ataching them to two conditions, a primary one (key1) and a secondary one (key2).

For instance, let’s assume that one has data obtained through selection of sequences by two different binders and three different concentration of binder each; we would define a indata dictionary such as:

{'binder1': {'conc1': 'file1.fastq', 'conc2': 'file2.fastq', 'conc3': 'file3.fastq'},
 'binder2': {'conc1': 'file4.fastq', 'conc2': 'file5.fastq', 'conc3': 'file6.fastq'}}

Also, for each binder we could decide to calculate the enrichment between any two concentrations; we can do that by defining a enrichment dictionary such as:

{'binder1': ['conc1', 'conc3'],
 'binder2': ['conc1', 'conc3']}
Parameters:
  • indata (dict) – First key is binder, second key is concentration, value is fastq file.
  • enrichment (dict) – Key is binder, value is list of two concentrations (min,max) to calculate enrichment.
  • bounds (list() of str) – N and C limit of the sequences. Follow the logic of adapt_length() with inclusive as False.
  • matches (list() of str) – Sequence pattern to match. Follows the same logic as in translate_3frames().
Returns:

DesignFrame with the sequences, counts (sequence) per fastq file and enrichment per binder (if requested).

Example

(We skip printing the sequence column to ease visibility of the differences)

In [1]: from rstoolbox.io import read_fastq
   ...: from rstoolbox.utils import sequencing_enrichment
   ...: import pandas as pd
   ...: pd.set_option('display.width', 1000)
   ...: pd.set_option('display.max_columns', 20)
   ...: indat = {'binder1': {'conc1': '../rstoolbox/tests/data/cdk2_rand_001.fasq.gz',
   ...:                      'conc2': '../rstoolbox/tests/data/cdk2_rand_002.fasq.gz',
   ...:                      'conc3': '../rstoolbox/tests/data/cdk2_rand_003.fasq.gz'},
   ...:          'binder2': {'conc1': '../rstoolbox/tests/data/cdk2_rand_004.fasq.gz',
   ...:                      'conc2': '../rstoolbox/tests/data/cdk2_rand_005.fasq.gz',
   ...:                      'conc3': '../rstoolbox/tests/data/cdk2_rand_006.fasq.gz'}}
   ...: df = sequencing_enrichment(indat)
   ...: df[[_ for _ in df.columns if _ != 'sequence_A']].head()
   ...: 
Out[1]: 
   description  binder1_conc1  binder1_conc2  binder1_conc3  binder2_conc1  binder2_conc2  binder2_conc3  len
0  0            4.0            1.0            0.0            1.0            0.0            3.0            304
1  1            4.0            2.0            1.0            2.0            1.0            0.0            304
2  2            3.0            2.0            4.0            1.0            1.0            1.0            304
3  3            3.0            1.0            1.0            1.0            0.0            3.0            304
4  4            3.0            0.0            1.0            2.0            2.0            1.0            298

In [2]: enrich = {'binder1': ['conc1', 'conc3'],
   ...:           'binder2': ['conc1', 'conc3']}
   ...: df = sequencing_enrichment(indat, enrich)
   ...: df[[_ for _ in df.columns if _ != 'sequence_A']].head()
   ...: 
Out[2]: 
   description  binder1_conc1  binder1_conc2  binder1_conc3  binder2_conc1  binder2_conc2  binder2_conc3  len  enrichment_binder1  enrichment_binder2
0  0            4.0            1.0            0.0            1.0            0.0            3.0            304 -1.00                0.333333          
1  1            4.0            2.0            1.0            2.0            1.0            0.0            304  4.00               -1.000000          
2  2            3.0            2.0            4.0            1.0            1.0            1.0            304  0.75                1.000000          
3  3            3.0            1.0            1.0            1.0            0.0            3.0            304  3.00                0.333333          
4  4            3.0            0.0            1.0            2.0            2.0            1.0            298  3.00                2.000000