tcrdist_rs¶
- tcrdist_rs.amino_acid_distance(s1, s2)¶
Compute the distance between two amino acids using BLOSUM62 substitution penalities.
For valid residue characters,
\[\begin{split}d(a, a') = \begin{cases} 0, & a = a' \\ \min(4, 4 - \mathrm{BLOSUM62}(a, a')), & \mathrm{else} \end{cases}\end{split}\]This function is invariant to case. I.e., lowercase and uppercase residues can be compared.
- Parameters:
s1 (str) – An amino acid residue.
s2 (str) – Another amino acid residue.
- Returns:
The distance between the two residues.
- Return type:
int
Examples
>>> s1 = "C" >>> s2 = "a" >>> assert(amino_acid_distance(s1, s2) == 4)
- tcrdist_rs.cdr1_distance(s1, s2)¶
Compute the CDR1 distance between V alleles using precomputed TCRdists.
TCRdists were precomputed using ntrim = ctrim = 0, dist_weight = 1, gap_penalty = 4, and fixed_gappos = True.
- Parameters:
s1 (str) – A V allele.
s2 (str) – Another V allele.
- Returns:
The distance between the V alleles’ CDR1s.
- Return type:
int
Examples
>>> s1 = "TRBV2*01" >>> s2 = "TRBV6-2*01" >>> assert(cdr1_distance(s1, s2) == 8)
- tcrdist_rs.cdr2_distance(s1, s2)¶
Compute the CDR2 distance between V alleles using precomputed TCRdists.
TCRdists were precomputed using ntrim = ctrim = 0, dist_weight = 1, gap_penalty = 4, and fixed_gappos = True.
- Parameters:
s1 (str) – A V allele.
s2 (str) – Another V allele.
- Returns:
The distance between the V alleles’ CDR2s.
- Return type:
int
Examples
>>> s1 = "TRBV2*01" >>> s2 = "TRBV6-2*01" >>> assert(cdr2_distance(s1, s2) == 24)
- tcrdist_rs.hamming(s1, s2)¶
Compute the Hamming distance between two strings.
The strings being compared must have equal lengths or the program will terminate. Moreover, the strings must be representable as byte strings.
- Parameters:
s1 (str) – A string.
s2 (str) – Another string
- Returns:
The Hamming distance between the two strings.
- Return type:
int
Examples
>>> s1 = "abcdefg" >>> s2 = "abddefg" >>> assert(hamming(s1, s2) == 1)
- tcrdist_rs.hamming_bin_many_to_many(seqs1, seqs2, parallel=True)¶
Compute the Hamming distance between many strings and many other strings, counting the number of occurrences of each distance.
The strings being compared must have equal lengths or the program will terminate. Moreover, the strings must be representable as byte strings. This method is preferable to returning the full distance vectorform and counting occurrences in terms of both speed and memory.
- Parameters:
seqs1 (sequence of str) – The first sequence of strings.
seqs2 (sequence of str) – The other sequence of strings.
parallel (bool, default True) – Bool to specify if computation should be parallelized.
- Returns:
The number of occurrences of Hamming distances, where the index gives the Hamming distance.
- Return type:
list of int
Examples
>>> seqs1 = ['abc', 'cdf', 'aaa', 'tfg'] >>> seqs2 = ['abc', 'cdf', 'ggg', 'uuu', 'otp'] >>> out = tdr.hamming_bin_many_to_many(seqs1, seqs2) >>> assert out == [2, 0, 2, 16]
- tcrdist_rs.hamming_many_to_many(seqs1, seqs2, parallel=True)¶
Compute the Hamming distance between many strings and many others.
The strings being compared must have equal lengths or the program will terminate. Moreover, the strings must be representable as byte strings.
- Parameters:
seqs1 (sequence of str) – The first sequence of strings.
seqs2 (sequence of str) – The other sequence of strings.
parallel (bool, default True) – Bool to specify if computation should be parallelized.
- Returns:
The Hamming distances among the strings.
- Return type:
list of int
Examples
>>> seqs1 = ["abb", "abc"] >>> seqs2 = ["abc", "abd", "fcd"] >>> assert(hamming_many_to_many(seqs1, seqs2, parallel=False) == [1, 1, 3, 0, 1, 3])
- tcrdist_rs.hamming_matrix(seqs, parallel=True)¶
Compute the Hamming distance matrix on an sequence of strings.
This returns the upper right triangle of the distance matrix. The strings being compared must have equal lengths or the program will terminate. Moreover, the strings must be representable as byte strings.
- Parameters:
seqs (sequence of str) – The strings to be compared. They must all be the same length and have an appropriate representation in bytes.
parallel (bool, default True) – Bool to specify if computation should be parallelized.
- Returns:
The Hamming distances among the strings.
- Return type:
list of int
Examples
>>> seqs = ["abc", "abd", "fcd"] >>> assert(hamming_matrix(seqs, parallel=False) == [1, 3, 2])
- tcrdist_rs.hamming_neighbor_many_to_many(seqs1, seqs2, threshold, parallel=True)¶
Obtain the Hamming neighbors between a sequence of strings and another sequence of strings.
The strings being compared must have equal lengths or the program will terminate. Moreover, the strings must be representable as byte strings. This function is preferable to computing the entire distance matrix and then identifying neighbors for both speed and memory.
- Parameters:
seqs1 (sequence of str) – The first sequence of strings.
seqs2 (sequence of str) – The other sequence of strings.
threshold (int) – The largest Hamming distance used to consider a pair of strings neighbors.
parallel (bool, default True) – Bool to specify if computation should be parallelized.
- Returns:
The 0th and 1st values of the sublist are the indices of the neighbors in their respective sequences and the 2nd value is the Hamming distance between the pair.
- Return type:
list of list of int
Examples
>>> seqs1 = ["bbi", "abd", "tih", "fcd", "abb"] >>> seqs2 = ["bbb", "tjh"] >>> assert(hamming_neighbor_many_to_many(seqs1, seqs2, 1, parallel=False) == [[0, 0, 1], [2, 1, 1], [4, 0, 1]])
- tcrdist_rs.hamming_neighbor_matrix(seqs, threshold, parallel=True)¶
Obtain the Hamming neighbors from a sequence of strings, ignoring self-neighbors.
The strings being compared must have equal lengths or the program will terminate. Moreover, the strings must be representable as byte strings. This function is preferable to computing the entire matrix and then identifying neighbors for both speed and memory.
- Parameters:
seqs (sequence of str) – The strings to be compared. They must all be the same length and have an appropriate representation in bytes.
threshold (int) – The largest Hamming distance used to consider a pair of strings neighbors.
parallel (bool, default True) – Bool to specify if computation should be parallelized.
- Returns:
The 0th and 1st values of the sublist are the indices of the neighbors and the 2nd value is the Hamming distance between the pair.
- Return type:
list of list of int
Examples
>>> seqs = ["abc", "abd", "fcd", "abb"] >>> assert(hamming_neighbor_matrix(seqs, 1, parallel=False) == [[0, 1, 1], [0, 3, 1], [1, 3, 1]])
- tcrdist_rs.hamming_neighbor_one_to_many(seq, seqs, threshold, parallel=True)¶
Compute the Hamming neighbors between one string and many others.
The strings being compared must have equal lengths or the program will terminate. Moreover, the strings must be representable as byte strings.
- Parameters:
seq (str) – The string against which all others will be compared.
seqs (sequence of str) – The other strings being compared.
threshold (int) – The largest Hamming distance used to consider a pair of strings neighbors.
parallel (bool, default True) – Bool to specify if computation should be parallelized.
- Returns:
The 0th value of the sublist gives the index of in seqs that is a neighbor of seq. The 1st value is the Hamming distance between the pair.
- Return type:
list of list of int
Examples
>>> seq = "abb" >>> seqs = ["abc", "fcd", "abb"] >>> assert(hamming_neighbor_one_to_many(seq, seqs, 1, parallel=False) == [[0, 1], [2, 0]])
- tcrdist_rs.hamming_neighbor_pairwise(seqs1, seqs2, threshold, parallel=True)¶
Obtain the Hamming neighbors between a sequence of strings and another sequence of strings elementwise.
The strings being compared must have equal lengths or the program will terminate. Moreover, the strings must be representable as byte strings. This function is preferable to computing the entire distance matrix and then identifying neighbors for both speed and memory. If the sequences of strings differ in length, the sequence with the least items dictates how many comparisons are performed.
- Parameters:
seqs1 (sequence of str) – The first sequence of strings.
seqs2 (sequence of str) – The other sequence of strings.
threshold (int) – The largest Hamming distance used to consider a pair of strings neighbors.
parallel (bool, default True) – Bool to specify if computation should be parallelized.
- Returns:
The 0th value of the sublist gives the index of the neighbors (only one index is needed since they are compared elementwise) and the 1st value of the sublist is the Hamming distance between the pair.
- Return type:
list of list of int
Examples
>>> seqs1 = ["bbi", "abd", "tih", "fcd", "abb"] >>> seqs2 = ["bbb", "tjh"] >>> assert(hamming_neighbor_pairwise(seqs1, seqs2, 1, parallel=False) == [[0, 1]])
- tcrdist_rs.hamming_one_to_many(seq, seqs, parallel=True)¶
Compute the Hamming distance between one string and many others.
The strings being compared must have equal lengths or the program will terminate. Moreover, the strings must be representable as byte strings.
- Parameters:
seq (str) – The string against which all others will be compared.
seqs (sequence of str) – The other strings being compared.
parallel (bool, default True) – Bool to specify if computation should be parallelized.
- Returns:
The Hamming distances among the strings.
- Return type:
list of int
Examples
>>> seq = "abb" >>> seqs = ["abc", "abd", "fcd"] >>> assert(hamming_one_to_many(seq, seqs, parallel=False) == [1, 1, 3])
- tcrdist_rs.hamming_pairwise(seqs1, seqs2, parallel=True)¶
Compute the Hamming distance between two sequences of strings elementwise.
The strings being compared must have equal lengths or the program will terminate. Moreover, the strings must be representable as byte strings. If the sequences of strings differ in length, the sequence with the least items dictates how many comparisons are performed.
- Parameters:
seqs1 (sequence of str) – The first sequence of strings.
seqs2 (sequence of str) – The other sequence of strings.
parallel (bool, default True) – Bool to specify if computation should be parallelized.
- Returns:
The Hamming distances from the elementwise comparisons.
- Return type:
list of int
Examples
>>> seqs1 = ["abb", "abc"] >>> seqs2 = ["abc", "abd", "fcd"] >>> assert(hamming_pairwise(seqs1, seqs2, parallel=False) == [1, 1])
- tcrdist_rs.levenshtein(s1, s2)¶
Compute the Levenshtein distance between two strings.
The strings must be representable as byte strings.
- Parameters:
s1 (str) – A string.
s2 (str) – Another string
- Returns:
The Levenshtein distance between the two strings.
- Return type:
int
Examples
>>> s1 = "abcdefg" >>> s2 = "abdcd defgggg" >>> assert(levenshtein(s1, s2) == 6)
- tcrdist_rs.levenshtein_bin_many_to_many(seqs1, seqs2, parallel=True)¶
- tcrdist_rs.levenshtein_exp(s1, s2)¶
Compute the Levenshtein distance between two strings using an exponential search.
The strings must be representable as byte strings. This uses an exponential search to estimate the number of edits. It will be more efficient than levenshtein_distance when the number of edits is small.
- Parameters:
s1 (str) – A string.
s2 (str) – Another string
- Returns:
The Levenshtein distance between the two strings.
- Return type:
int
Examples
>>> s1 = "abcdefg" >>> s2 = "abdcd defgggg" >>> assert(levenshtein_exp(s1, s2) == 6)
- tcrdist_rs.levenshtein_exp_bin_many_to_many(seqs1, seqs2, parallel=True)¶
- tcrdist_rs.levenshtein_exp_many_to_many(seqs1, seqs2, parallel=True)¶
Compute the Levenshtein distance between many strings and many others.
The strings must be representable as byte strings. This uses an exponential search to estimate the number of edits. It will be more efficient than levenshtein_many_to_many when the number of edits is small.
- Parameters:
seqs1 (sequence of str) – The first sequence of strings.
seqs (sequence of str) – The second sequence of strings.
parallel (bool, default True) – Bool to specify if computation should be parallelized.
- Returns:
The Levenshtein distances among the strings.
- Return type:
list of int
Examples
>>> seqs1 = ["Sunday", "Saturday"] >>> seqs2 = ["Monday", "Tuesday", "Wednesday"] >>> assert(levenshtein_many_to_many(seqs1, seqs2, parallel=False) == [2, 3, 5, 5, 5, 6])
- tcrdist_rs.levenshtein_exp_matrix(seqs, parallel=True)¶
Compute the Levenshtein distance matrix on an sequence of strings using exponential search.
The strings must be representable as byte strings. This uses an exponential search to estimate the number of edits. It will be more efficient than levenshtein_exp_matrix when the number of edits is small.
- Parameters:
seqs (sequence str) – The strings to be compared. The must have an appropriate representation as byte strings.
parallel (bool, default True) – Bool to specify if computation should be parallelized.
- Returns:
The Levenshtein distances among the strings.
- Return type:
list of int
Examples
>>> seqs = ["Monday", "Tuesday", "Wednesday", "Thursday"] >>> assert(levenshtein_exp_matrix(seqs, parallel=False) == [4, 5, 5, 4, 2, 5])
- tcrdist_rs.levenshtein_exp_neighbor_many_to_many(seqs1, seqs2, threshold, parallel=True)¶
- tcrdist_rs.levenshtein_exp_neighbor_matrix(seqs, threshold, parallel=True)¶
- tcrdist_rs.levenshtein_exp_neighbor_one_to_many(seq, seqs, threshold, parallel=True)¶
- tcrdist_rs.levenshtein_exp_neighbor_pairwise(seqs1, seqs2, threshold, parallel=True)¶
- tcrdist_rs.levenshtein_exp_one_to_many(seq, seqs, parallel=True)¶
Compute the Levenshtein distance between one string and many others using exponential search.
This returns the upper right triangle of the distance matrix. The strings must be representable as byte strings. This uses an exponential search to estimate the number of edits. It will be more efficient than levenshtein_one_to_many when the number of edits is small.
- Parameters:
seq (str) – The string against which all others will be compared.
seqs (sequence of str) – The other strings being compared.
parallel (bool, default True) – Bool to specify if computation should be parallelized.
- Returns:
The Levenshtein distances among the strings.
- Return type:
list of int
Examples
>>> seq = "Sunday" >>> seqs = ["Monday", "Tuesday", "Wednesday", "Thursday"] >>> assert(levenshtein_one_to_many(seq, seqs, parallel=False) == [2, 3, 5, 4])
- tcrdist_rs.levenshtein_exp_pairwise(seqs1, seqs2, parallel=True)¶
- tcrdist_rs.levenshtein_many_to_many(seqs1, seqs2, parallel=True)¶
Compute the Levenshtein distance between many strings and many others.
The strings must be representable as byte strings.
- Parameters:
seqs1 (sequence of str) – The first sequence of strings.
seqs (sequence of str) – The second sequence of strings.
parallel (bool, default True) – Bool to specify if computation should be parallelized.
- Returns:
The Levenshtein distances among the strings.
- Return type:
list of int
Examples
>>> seqs1 = ["Sunday", "Saturday"] >>> seqs2 = ["Monday", "Tuesday", "Wednesday"] >>> assert(levenshtein_many_to_many(seqs1, seqs2, parallel=False) == [2, 3, 5, 5, 5, 6])
- tcrdist_rs.levenshtein_matrix(seqs, parallel=True)¶
Compute the Levenshtein distance matrix on an sequence of strings.
The strings must be representable as byte strings.
- Parameters:
seqs (sequence str) – The strings to be compared. The must have an appropriate representation as byte strings.
parallel (bool, default True) – Bool to specify if computation should be parallelized.
- Returns:
The Levenshtein distances among the strings.
- Return type:
list of int
Examples
>>> seqs = ["Monday", "Tuesday", "Wednesday", "Thursday"] >>> assert(levenshtein_matrix(seqs, parallel=False) == [4, 5, 5, 4, 2, 5])
- tcrdist_rs.levenshtein_neighbor_many_to_many(seqs1, seqs2, threshold, parallel=True)¶
- tcrdist_rs.levenshtein_neighbor_matrix(seqs, threshold, parallel=True)¶
- tcrdist_rs.levenshtein_neighbor_one_to_many(seq, seqs, threshold, parallel=True)¶
- tcrdist_rs.levenshtein_neighbor_pairwise(seqs1, seqs2, threshold, parallel=True)¶
- tcrdist_rs.levenshtein_one_to_many(seq, seqs, parallel=True)¶
Compute the Levenshtein distance between one string and many others.
This returns the upper right triangle of the distance matrix. The strings must be representable as byte strings.
- Parameters:
seq (str) – The string against which all others will be compared.
seqs (sequence of str) – The other strings being compared.
parallel (bool, default True) – Bool to specify if computation should be parallelized.
- Returns:
The Levenshtein distances among the strings.
- Return type:
list of int
Examples
>>> seq = "Sunday" >>> seqs = ["Monday", "Tuesday", "Wednesday", "Thursday"] >>> assert(levenshtein_one_to_many(seq, seqs, parallel=False) == [2, 3, 5, 4])
- tcrdist_rs.levenshtein_pairwise(seqs1, seqs2, parallel=True)¶
- tcrdist_rs.phmc_distance(s1, s2)¶
Compute the pMHC distance between V alleles using precomputed TCRdists.
TCRdists were precomputed using ntrim = ctrim = 0, dist_weight = 1, gap_penalty = 4, and fixed_gappos = True.
- Parameters:
s1 (str) – A V allele.
s2 (str) – Another V allele.
- Returns:
The distance between the V alleles’ pMHCs.
- Return type:
int
Examples
s1 = “TRBV2*01” s2 = “TRBV6-2*01” assert(phmc_distance(s1, s2) == 16)
- tcrdist_rs.tcrdist(s1, s2, dist_weight=3, gap_penalty=12, ntrim=3, ctrim=2, fixed_gappos=False)¶
Compute the tcrdist between two strings.
The strings must be representable as byte strings.
- Parameters:
s1 (str) – A string. Ideally, this should be a string of amino acid residues.
s2 (str) – A string. Ideally, this should be a string of amino acid residues.
dist_weight (int, default 3) – A weight applied to the mismatch distances. This weight is not applied to the gap penalties.
gap_penalty (int, default 12) – The penalty given to the difference in length of the strings.
ntrim (int, default 3) – The position at which the distance calculation will begin. This parameter must be >= 0.
ctrim (int, default 2) – The position, counted from the end, at which the calculation will end. This parameter must be >= 0.
fixed_gappos (bool, default False) – If True, insert gaps at a fixed position after the cysteine residue starting the CDR3 (typically position 6). If False, find the “optimal” position for inserting the gaps to make up the difference in length.
- Returns:
The tcrdist between two strings.
- Return type:
int
Examples
>>> s1 = "CASRTGTVYEQYF" >>> s2 = "CASSTLDRVYNSPLHF" >>> dist_weight = 1 >>> gap_penalty = 4 >>> ntrim = 3 >>> ctrim = 2 >>> fixed_gappos = False >>> dist = tcrdist(s1, s2, dist_weight, gap_penalty, ntrim, ctrim, fixed_gappos) >>> assert(dist == 40)
- tcrdist_rs.tcrdist_allele(s1, s2, phmc_weight=1, cdr1_weight=1, cdr2_weight=1, cdr3_weight=3, gap_penalty=4, ntrim=3, ctrim=2, fixed_gappos=False)¶
Compute the tcrdist between two CDR3-V allele pairs.
This incorporates differences between the pMHC, CDR1, CDR2, and CDR3.
- Parameters:
s1 (sequence of str) – A sequence of the CDR3 amino acid sequence and V allele.
s2 (sequence of str) – A sequence of the CDR3 amino acid sequence and V allele.
phmc_weight (int, default 1) – How much the difference in pMHCs contributes to the distance.
cdr1_weight (int, default 1) – How much the difference in CDR1s contributes to the distance.
cdr2_weight (int, default 1) – How much the difference in CDR2s contributes to the distance.
cdr3_weight (int, default 3) – How much the difference in CDR3s contributes to the distance.
gap_penalty (int, default 4) – The penalty given to the difference in length of the strings.
ntrim (int, default 3) – The position at which the distance calculation will begin. This parameter must be >= 0.
ctrim (int, default 2) – The position, counted from the end, at which the calculation will end. This parameter must be >= 0.
fixed_gappos (bool, default False) – If True, insert gaps at a fixed position after the cysteine residue starting the CDR3 (typically position 6). If False, find the “optimal” position for inserting the gaps to make up the difference in length.
- Returns:
The tcrdist between two CDR3-V allele pairs.
- Return type:
int
Examples
>>> s1 = ["CASRTGTVYEQYF", "TRBV2*01"] >>> s2 = ["CASSTLDRVYNSPLHF", "TRBV6-2*01"] >>> phmc_weight = 1 >>> cdr1_weight = 1 >>> cdr2_weight = 1 >>> cdr3_weight = 3 >>> gap_penalty = 4 >>> ntrim = 3 >>> ctrim = 2 >>> fixed_gappos = False >>> dist = tcrdist_allele(s1, s2, phmc_weight, cdr1_weight, cdr2_weight, cdr3_weight, gap_penalty, ntrim, ctrim, fixed_gappos) >>> assert(dist == 168)
- tcrdist_rs.tcrdist_allele_many_to_many(seqs1, seqs2, phmc_weight=1, cdr1_weight=1, cdr2_weight=1, cdr3_weight=3, gap_penalty=4, ntrim=3, ctrim=2, fixed_gappos=False, parallel=True)¶
Compute the tcrdist between many CDR3-V allele pairs and many others.
This incorporates differences between the pMHC, CDR1, CDR2, and CDR3.
- Parameters:
seqs1 (sequence of sequence of str) – A sequence of sequences containing pairs of CDR3 amino acid sequences and V alleles.
seqs2 (sequence of sequence of str) – A sequence of sequences containing pairs of CDR3 amino acid sequences and V alleles.
phmc_weight (int, default 1) – How much the difference in pMHCs contributes to the distance.
cdr1_weight (int, default 1) – How much the difference in CDR1s contributes to the distance.
cdr2_weight (int, default 1) – How much the difference in CDR2s contributes to the distance.
cdr3_weight (int, default 3) – How much the difference in CDR3s contributes to the distance.
gap_penalty (int, default 4) – The penalty given to the difference in length of the strings.
ntrim (int, default 3) – The position at which the distance calculation will begin. This parameter must be >= 0.
ctrim (int, default 2) – The position, counted from the end, at which the calculation will end. This parameter must be >= 0.
fixed_gappos (bool, False) – If True, insert gaps at a fixed position after the cysteine residue starting the CDR3 (typically position 6). If False, find the “optimal” position for inserting the gaps to make up the difference in length.
parallel (bool, default True) – Bool to specify if computation should be parallelized.
- Returns:
The TCRdists among the CDR3-V allele pairs.
- Return type:
list of int
Examples
>>> seqs1 = [["CASRTGTVYEQYF", "TRBV2*01"], ["CASSYSEEPSSPLHF", "TRBV6-6*01"]] >>> seqs2 = [["CASSTLDRVYNSPLHF", "TRBV6-2*01"], ["CASSESGGQVDTQYF", "TRBV6-4*01"], ["CASSPTGPTDTQYF", "TRBV18*01"], ["CASSYPIEGGRAFTGELFF", "TRBV6-5*01"]] >>> phmc_weight = 1 >>> cdr1_weight = 1 >>> cdr2_weight = 1 >>> cdr3_weight = 3 >>> gap_penalty = 4 >>> ntrim = 3 >>> ctrim = 2 >>> fixed_gappos = False >>> dist = tcrdist_allele_many_to_many(seqs1, seqs2, phmc_weight, cdr1_weight, cdr2_weight, cdr3_weight, gap_penalty, ntrim, ctrim, fixed_gappos, parallel=False) >>> assert(dist == [168, 142, 134, 203, 104, 125, 143, 121])
- tcrdist_rs.tcrdist_allele_matrix(seqs, phmc_weight=1, cdr1_weight=1, cdr2_weight=1, cdr3_weight=3, gap_penalty=4, ntrim=3, ctrim=2, fixed_gappos=False, parallel=True)¶
Compute the tcrdist matrix on an sequence of CDR3-V allele pairs.
This returns the upper right triangle of the distance matrix. This incorporates differences between the pMHC, CDR1, CDR2, and CDR3.
- Parameters:
seqs (sequence of sequence of str) – A sequence containing sequences of CDR3 amino acid sequences and V alleles.
phmc_weight (int, default 1) – How much the difference in pMHCs contributes to the distance.
cdr1_weight (int, default 1) – How much the difference in CDR1s contributes to the distance.
cdr2_weight (int, default 1) – How much the difference in CDR2s contributes to the distance.
cdr3_weight (int, default 3) – How much the difference in CDR3s contributes to the distance.
gap_penalty (int, default 4) – The penalty given to the difference in length of the strings.
ntrim (int, default 3) – The position at which the distance calculation will begin. This parameter must be >= 0.
ctrim (int, default 2) – The position, counted from the end, at which the calculation will end. This parameter must be >= 0.
fixed_gappos (bool, default False) – If True, insert gaps at a fixed position after the cysteine residue starting the CDR3 (typically position 6). If False, find the “optimal” position for inserting the gaps to make up the difference in length.
parallel (bool, default True) – Bool to specify if computation should be parallelized.
- Returns:
The TCRdists among the CDR3-V allele pairs.
- Return type:
list of int
Examples
>>> seqs = [["CASRTGTVYEQYF", "TRBV2*01"], ["CASSTLDRVYNSPLHF", "TRBV6-2*01"], ["CASSESGGQVDTQYF", "TRBV6-4*01"], ["CASSPTGPTDTQYF", "TRBV18*01"], ["CASSYPIEGGRAFTGELFF", "TRBV6-5*01"]] >>> phmc_weight = 1 >>> cdr1_weight = 1 >>> cdr2_weight = 1 >>> cdr3_weight = 3 >>> gap_penalty = 4 >>> ntrim = 3 >>> ctrim = 2 >>> fixed_gappos = False >>> dist = tcrdist_allele_matrix(seqs, phmc_weight, cdr1_weight, cdr2_weight, cdr3_weight, gap_penalty, ntrim, ctrim, fixed_gappos, parallel=False) >>> assert(dist == [168, 142, 134, 203, 163, 169, 148, 116, 189, 198])
- tcrdist_rs.tcrdist_allele_one_to_many(seq, seqs, phmc_weight=1, cdr1_weight=1, cdr2_weight=1, cdr3_weight=3, gap_penalty=4, ntrim=3, ctrim=2, fixed_gappos=False, parallel=True)¶
Compute the tcrdist between one CDR3-V allele pair and many others.
This incorporates differences between the pMHC, CDR1, CDR2, and CDR3.
- Parameters:
s1 (sequence of str) – A sequence containing a CDR3 amino acid sequence and V allele.
seqs (sequence of sequence of str) – A sequence of sequences containing pairs of CDR3 amino acid sequences and V alleles.
phmc_weight (int, default 1) – How much the difference in pMHCs contributes to the distance.
cdr1_weight (int, default 1) – How much the difference in CDR1s contributes to the distance.
cdr2_weight (int, default 1) – How much the difference in CDR2s contributes to the distance.
cdr3_weight (int, default 3) – How much the difference in CDR3s contributes to the distance.
gap_penalty (int, default 4) – The penalty given to the difference in length of the strings.
ntrim (int, default 3) – The position at which the distance calculation will begin. This parameter must be >= 0.
ctrim (int, default 2) – The position, counted from the end, at which the calculation will end. This parameter must be >= 0.
fixed_gappos (bool, False) – If True, insert gaps at a fixed position after the cysteine residue starting the CDR3 (typically position 6). If False, find the “optimal” position for inserting the gaps to make up the difference in length.
parallel (bool, default True) – Bool to specify if computation should be parallelized.
- Returns:
The TCRdists among the CDR3-V allele sequences.
- Return type:
list of int
Examples
>>> seq = ["CASRTGTVYEQYF", "TRBV2*01"] >>> seqs = [["CASSTLDRVYNSPLHF", "TRBV6-2*01"], ["CASSESGGQVDTQYF", "TRBV6-4*01"], ["CASSPTGPTDTQYF", "TRBV18*01"], ["CASSYPIEGGRAFTGELFF", "TRBV6-5*01"]] >>> phmc_weight = 1 >>> cdr1_weight = 1 >>> cdr2_weight = 1 >>> cdr3_weight = 3 >>> gap_penalty = 4 >>> ntrim = 3 >>> ctrim = 2 >>> fixed_gappos = False >>> dist = tcrdist_allele_one_to_many(seq, seqs, phmc_weight, cdr1_weight, cdr2_weight, cdr3_weight, gap_penalty, ntrim, ctrim, fixed_gappos, parallel=False) >>> assert(dist == [168, 142, 134, 203])
- tcrdist_rs.tcrdist_allele_pairwise(seqs1, seqs2, phmc_weight=1, cdr1_weight=1, cdr2_weight=1, cdr3_weight=3, gap_penalty=4, ntrim=3, ctrim=2, fixed_gappos=False, parallel=True)¶
- tcrdist_rs.tcrdist_gene(s1, s2, ntrim=3, ctrim=2)¶
Compute the tcrdist between two CDR3-V gene pairs.
- Parameters:
s1 (sequence of str) – A sequence containing a CDR3 amino acid sequence and V gene pair.
s2 (sequence of str) – A sequence containing a CDR3 amino acid sequence and V gene pair.
ntrim (int, default 3) – The position at which the distance calculation will begin. This parameter must be >= 0.
ctrim (int, default 2) – The position, counted from the end, at which the calculation will end. This parameter must be >= 0.
- Returns:
The tcrdist between two CDR3-V gene pairs.
- Return type:
int
Examples
>>> s1 = ["CASRTGTVYEQYF", "TRBV2"] >>> s2 = ["CASSTLDRVYNSPLHF", "TRBV6-2"] >>> ntrim = 3 >>> ctrim = 2 >>> dist = tcrdist_gene(s1, s2, ntrim, ctrim) >>> assert(dist == 168)
- tcrdist_rs.tcrdist_gene_many_to_many(seqs1, seqs2, ntrim=3, ctrim=2, parallel=True)¶
Compute the tcrdist between many CDR3-V gene pairs and many others.
- Parameters:
seqs1 (sequence of sequence of str) – A sequence containing sequences of CDR3 amino acid sequence and V gene pairs.
seqs2 (sequence of sequence of str) – A sequence containing sequences of CDR3 amino acid sequence and V gene pairs.
ntrim (int, default 3) – The position at which the distance calculation will begin. This parameter must be >= 0.
ctrim (int, default 2) – The position, counted from the end, at which the calculation will end. This parameter must be >= 0.
parallel (bool, default True) – Bool to specify if computation should be parallelized.
- Returns:
The TCRdists among the CDR3-V gene pairs.
- Return type:
list of int
Examples
>>> seqs1 = [["CASRTGTVYEQYF", "TRBV2"], ["CASSYSEEPSSPLHF", "TRBV6-6"]] >>> seqs2 = [["CASSTLDRVYNSPLHF", "TRBV6-2"], ["CASSESGGQVDTQYF", "TRBV6-4"], ["CASSPTGPTDTQYF", "TRBV18"], ["CASSYPIEGGRAFTGELFF", "TRBV6-5"]] >>> ntrim = 3 >>> ctrim = 2 >>> dist = tcrdist_gene_many_to_many(seqs1, seqs2, ntrim, ctrim, parallel=False) >>> assert(dist == [168, 142, 134, 203, 104, 125, 143, 121])
- tcrdist_rs.tcrdist_gene_matrix(seqs, ntrim=3, ctrim=2, parallel=True)¶
Compute the tcrdist matrix on an sequence of CDR3-V gene pairs.
This returns the upper right triangle of the distance matrix.
- Parameters:
seqs (sequence of sequence of str) – A sequence containing sequences of CDR3 amino acid sequence and V gene pairs.
ntrim (int, default 3) – The position at which the distance calculation will begin. This parameter must be >= 0.
ctrim (int, default 2) – The position, counted from the end, at which the calculation will end. This parameter must be >= 0.
parallel (bool, default True) – Bool to specify if computation should be parallelized.
- Returns:
The tcrdist among the CDR3-V gene pairs.
- Return type:
list of int
Examples
>>> seqs = [["CASRTGTVYEQYF", "TRBV2"], ["CASSTLDRVYNSPLHF", "TRBV6-2"], ["CASSESGGQVDTQYF", "TRBV6-4"], ["CASSPTGPTDTQYF", "TRBV18"], ["CASSYPIEGGRAFTGELFF", "TRBV6-5"]] >>> ntrim = 3 >>> ctrim = 2 >>> dist = tcrdist_gene_matrix(seqs, ntrim, ctrim, parallel=False) >>> assert(dist == [168, 142, 134, 203, 163, 169, 148, 116, 189, 198])
- tcrdist_rs.tcrdist_gene_neighbor(s1, s2, threshold, ntrim=3, ctrim=2)¶
Compute whether two CDR3-V gene pairs are neighbors with tcrdist_gene.
This function is quicker than using the tcrdist_gene function since it first computes whether the V genes are within the distance threshold and whether the difference in lengths won’t incur a penalty larger than the distance threshold. With these two checks, many unnecessary calculations are avoided.
- Parameters:
s1 (sequence of str) – A sequence containing a CDR3 amino acid sequence and V gene pair.
s2 (sequence of str) – A sequence containing a CDR3 amino acid sequence and V gene pair.
threshold (int) – The distance threshold that will be used to call sequences neighbors.
ntrim (int, default 3) – The position at which the distance calculation will begin. This parameter must be >= 0.
ctrim (int, default 2) – The position, counted from the end, at which the calculation will end. This parameter must be >= 0.
- Returns:
Whether the two CDR3-V gene pairs are have tcrdist within the threshold.
- Return type:
bool
Examples
>>> s1 = ["CASRTGTVYEQYF", "TRBV2"] >>> s2 = ["CASSTLDRVYNSPLHF", "TRBV6-2"] >>> threshold = 20 >>> ntrim = 3 >>> ctrim = 2 >>> are_neighbors = tcrdist_gene(s1, s2, threshold, ntrim, ctrim) >>> assert(are_neighbors == False)
- tcrdist_rs.tcrdist_gene_neighbor_many_to_many(seqs1, seqs2, threshold, ntrim=3, ctrim=2, parallel=True)¶
- tcrdist_rs.tcrdist_gene_neighbor_matrix(seqs, threshold, ntrim=3, ctrim=2, parallel=True)¶
- tcrdist_rs.tcrdist_gene_neighbor_one_to_many(seq, seqs, threshold, ntrim=3, ctrim=2, parallel=True)¶
- tcrdist_rs.tcrdist_gene_neighbor_pairwise(seqs1, seqs2, threshold, ntrim=3, ctrim=2, parallel=True)¶
- tcrdist_rs.tcrdist_gene_one_to_many(seq, seqs, ntrim=3, ctrim=2, parallel=True)¶
Compute the tcrdist between one CDR3-V gene pair and many others.
- Parameters:
seq (sequence of str) – A sequence containing a CDR3 amino acid sequence and V allele pair.
seqs (sequence of sequence of str) – A sequence containing sequences of CDR3 amino acid sequence and V gene pairs.
ntrim (int, default 3) – The position at which the distance calculation will begin. This parameter must be >= 0.
ctrim (int, default 2) – The position, counted from the end, at which the calculation will end. This parameter must be >= 0.
parallel (bool, default True) – Bool to specify if computation should be parallelized.
- Returns:
The TCRdists among the CDR3-V gene pairs.
- Return type:
list of int
Examples
>>> seq = ["CASRTGTVYEQYF", "TRBV2"] >>> seqs = [["CASSTLDRVYNSPLHF", "TRBV6-2"], ["CASSESGGQVDTQYF", "TRBV6-4"], ["CASSPTGPTDTQYF", "TRBV18"], ["CASSYPIEGGRAFTGELFF", "TRBV6-5"]] >>> ntrim = 3 >>> ctrim = 2 >>> dist = tcrdist_gene_one_to_many(seq, seqs, ntrim, ctrim, parallel=False) >>> assert(dist == [168, 142, 134, 203])
- tcrdist_rs.tcrdist_gene_pairwise(seqs1, seqs2, ntrim=3, ctrim=2, parallel=True)¶
- tcrdist_rs.tcrdist_many_to_many(seqs1, seqs2, dist_weight=3, gap_penalty=12, ntrim=3, ctrim=2, fixed_gappos=False, parallel=True)¶
Compute the tcrdist between many strings and many others.
The strings must be representable as byte strings.
- Parameters:
seqs1 (sequence of str) – The first sequence of strings.
seqs2 (sequence of str) – The other sequence of strings.
dist_weight (int, default 3) – A weight applied to the mismatch distances. This weight is not applied to the gap penalties.
gap_penalty (int, default 12) – The penalty given to the difference in length of the strings.
ntrim (int, default 3) – The position at which the distance calculation will begin. This parameter must be >= 0.
ctrim (int, default 2) – The position, counted from the end, at which the calculation will end. This parameter must be >= 0.
fixed_gappos (bool, default False) – If True, insert gaps at a fixed position after the cysteine residue starting the CDR3 (typically position 6). If False, find the “optimal” position for inserting the gaps to make up the difference in length.
parallel (bool, default True) – Bool to specify if computation should be parallelized.
- Returns:
The TCRdists among the strings.
- Return type:
list of int
Examples
>>> seqs1 = ["CASSPTGPTDTQYF", "CASSYPIEGGRAFTGELFF"] >>> seqs2 = ["CASRTGTVYEQYF", "CASSTLDRVYNSPLHF", "CASSESGGQVDTQYF"] >>> dist_weight = 1 >>> gap_penalty = 4 >>> ntrim = 3 >>> ctrim = 2 >>> fixed_gappos = False >>> dist = tcrdist_many_to_many(seqs1, seqs2, dist_weight, gap_penalty, ntrim, ctrim, fixed_gappos, parallel=False) >>> assert(dist == [52, 41, 52, 48])
- tcrdist_rs.tcrdist_matrix(seqs, dist_weight=3, gap_penalty=12, ntrim=3, ctrim=2, fixed_gappos=False, parallel=True)¶
Compute the tcrdist matrix on an sequence of strings.
The strings must be representable as byte strings.
- Parameters:
seqs (sequence of str) – Iterable of strings.
dist_weight (int, default 3) – A weight applied to the mismatch distances. This weight is not applied to the gap penalties.
gap_penalty (int, default 12) – The penalty given to the difference in length of the strings.
ntrim (int, default 3) – The position at which the distance calculation will begin. This parameter must be >= 0.
ctrim (int, default 2) – The position, counted from the end, at which the calculation will end. This parameter must be >= 0.
fixed_gappos (bool, default False) – If True, insert gaps at a fixed position after the cysteine residue starting the CDR3 (typically position 6). If False, find the “optimal” position for inserting the gaps to make up the difference in length.
parallel (bool, default True) – Bool to specify if computation should be parallelized.
- Returns:
The TCRdists among the strings.
- Return type:
list of int
Examples
>>> seqs = ["CASRTGTVYEQYF", "CASSTLDRVYNSPLHF", "CASSESGGQVDTQYF", "CASSPTGPTDTQYF"] >>> dist_weight = 1 >>> gap_penalty = 4 >>> ntrim = 3 >>> ctrim = 2 >>> fixed_gappos = False >>> dist = tcrdist_matrix(seqs, dist_weight, gap_penalty, ntrim, ctrim, fixed_gappos, parallel=False) >>> assert(dist == [40, 28, 28, 40, 40, 19])
- tcrdist_rs.tcrdist_neighbor_many_to_many(seqs1, seqs2, threshold, dist_weight=3, gap_penalty=12, ntrim=3, ctrim=2, fixed_gappos=False, parallel=True)¶
- tcrdist_rs.tcrdist_neighbor_matrix(seqs, threshold, dist_weight=3, gap_penalty=12, ntrim=3, ctrim=2, fixed_gappos=False, parallel=True)¶
- tcrdist_rs.tcrdist_neighbor_one_to_many(seq, seqs, threshold, dist_weight=3, gap_penalty=12, ntrim=3, ctrim=2, fixed_gappos=False, parallel=True)¶
- tcrdist_rs.tcrdist_neighbor_pairwise(seqs1, seqs2, threshold, dist_weight=3, gap_penalty=12, ntrim=3, ctrim=2, fixed_gappos=False, parallel=True)¶
- tcrdist_rs.tcrdist_one_to_many(seq, seqs, dist_weight=3, gap_penalty=12, ntrim=3, ctrim=2, fixed_gappos=False, parallel=True)¶
Compute the tcrdist between one string and many others.
This returns the upper right triangle of the distance matrix. The strings must be representable as byte strings.
- Parameters:
seq (str) – The string against which all others will be compared.
seqs (str) – The other strings being compared.
dist_weight (int, default 3) – A weight applied to the mismatch distances. This weight is not applied to the gap penalties.
gap_penalty (int, default 12) – The penalty given to the difference in length of the strings.
ntrim (int, default 3) – The position at which the distance calculation will begin. This parameter must be >= 0.
ctrim (int, default 2) – The position, counted from the end, at which the calculation will end. This parameter must be >= 0.
fixed_gappos (bool, default False) – If True, insert gaps at a fixed position after the cysteine residue starting the CDR3 (typically position 6). If False, find the “optimal” position for inserting the gaps to make up the difference in length.
parallel (bool, default True) – Bool to specify if computation should be parallelized.
- Returns:
The TCRdists among the strings.
- Return type:
list of int
Examples
>>> seq = "CASSYPIEGGRAFTGELFF" >>> seqs = ["CASRTGTVYEQYF", "CASSTLDRVYNSPLHF", "CASSESGGQVDTQYF", "CASSPTGPTDTQYF"] >>> dist_weight = 1 >>> gap_penalty = 4 >>> ntrim = 3 >>> ctrim = 2 >>> fixed_gappos = False >>> dist = tcrdist_one_to_many(seq, seqs, dist_weight, gap_penalty, ntrim, ctrim, fixed_gappos, parallel=False) >>> assert(dist == [52, 41, 52, 48])
- tcrdist_rs.tcrdist_paired_gene(s1, s2, ntrim=3, ctrim=2)¶
- tcrdist_rs.tcrdist_paired_gene_many_to_many(seqs1, seqs2, ntrim=3, ctrim=2, parallel=True)¶
- tcrdist_rs.tcrdist_paired_gene_matrix(seqs, ntrim=3, ctrim=2, parallel=True)¶
- tcrdist_rs.tcrdist_paired_gene_neighbor_many_to_many(seqs1, seqs2, threshold, ntrim=3, ctrim=2, parallel=True)¶
- tcrdist_rs.tcrdist_paired_gene_neighbor_matrix(seqs, threshold, ntrim=3, ctrim=2, parallel=True)¶
- tcrdist_rs.tcrdist_paired_gene_neighbor_one_to_many(seq, seqs, threshold, ntrim=3, ctrim=2, parallel=True)¶
- tcrdist_rs.tcrdist_paired_gene_neighbor_pairwise(seqs1, seqs2, threshold, ntrim=3, ctrim=2, parallel=True)¶
- tcrdist_rs.tcrdist_paired_gene_one_to_many(seq, seqs, ntrim=3, ctrim=2, parallel=True)¶
- tcrdist_rs.tcrdist_paired_gene_pairwise(seqs1, seqs2, ntrim=3, ctrim=2, parallel=True)¶
- tcrdist_rs.tcrdist_pairwise(seqs1, seqs2, dist_weight=3, gap_penalty=12, ntrim=3, ctrim=2, fixed_gappos=False, parallel=True)¶
- tcrdist_rs.v_total_distance(s1, s2)¶
Compute the distance between V genes using precomputed TCRdists.
- Parameters:
s1 (str) – A V gene.
s2 (str) – A V gene.
- Returns:
The distance between two V genes.
- Return type:
int
Examples
>>> s1 = "C" >>> s2 = "a" >>> assert(amino_acid_distance(s1, s2) == 4)
- tcrdist_rs.version()¶
Return the current version of the package.
- Parameters:
None
- Returns:
CARGO_PKG_VERSION
- Return type:
str