seqpdist

Calculate pairwise distance between sequences

Syntax

D = seqpdist(Seqs) D = seqpdist(Seqs, ...'PropertyName', PropertyValue, ...) D = seqpdist(Seqs, ...'Method', MethodValue, ...) D = seqpdist(Seqs, ...'Indels', IndelsValue, ...) D = seqpdist(Seqs, ...'OptArgs', OptArgsValue, ...) D = seqpdist(Seqs, ...'PairwiseAlignment', PairwiseAlignmentValue, ...) D = seqpdist(Seqs, ...'UseParallel', UseParallelValue, ...) D = seqpdist(Seqs, ...'SquareForm', SquareFormValue ...) D = seqpdist(Seqs, ...'Alphabet', AlphabetValue, ...) D = seqpdist(Seqs, ...'ScoringMatrix', ScoringMatrixValue, ...) D = seqpdist(Seqs, ...'Scale', ScaleValue, ...) D = seqpdist(Seqs, ...'GapOpen', GapOpenValue, ...) D = seqpdist(Seqs, ...'ExtendGap', ExtendGapValue, ...)

Input Arguments

`Seqs`	Any of the following: Cell array of character vectors or string vector containing nucleotide or amino acid sequences Vector of structures containing a `Sequence` field Matrix of characters, in which each row corresponds to a nucleotide or amino acid sequence
`MethodValue`	Character vector or string that specifies the method to calculate pairwise distances. Default is `'Jukes-Cantor'`.
`IndelsValue`	Character vector or string that specifies how to treat sites with gaps. Default is `'score'`.
`OptArgsValue`	Character vector or cell array that specifies one or more input arguments required or accepted by the distance method specified by the `Method` property.
`PairwiseAlignmentValue`	Controls the global pairwise alignment of input sequences (using the `nwalign` function), while ignoring the multiple alignment of the input sequences (if any). Choices are `true` or `false`. Default is: `true` — When all input sequences do not have the same length. `false` — When all input sequences have the same length. Tip If your input sequences are the same length, `seqpdist` assumes they are aligned. If they are not aligned, do one of the following: Align the sequences before passing them to `seqpdist`, for example, using the `multialign` function. Set `PairwiseAlignment` to `true` when using `seqpdist`.
`UseParallelValue`	Controls the calculation of the pairwise distances using `parfor`-loops. When `true`, and Parallel Computing Toolbox™ is installed and a `parpool` is open, computation occurs in parallel. If there are no open `parpool`, but automatic creation is enabled in the Parallel Preferences, the default pool will be automatically open and computation occurs in parallel. If Parallel Computing Toolbox is installed, but there are no open `parpool` and automatic creation is disabled, then computation uses `parfor`-loops in serial mode. If Parallel Computing Toolbox is not installed, then computation uses `parfor`-loops in serial mode. Default is `false`, which uses for-loops in serial mode.
`SquareFormValue`	Controls the conversion of the output into a square matrix. Choices are `true` or `false` (default).
`AlphabetValue`	Character vector or string specifying the type of sequence (nucleotide or amino acid). Choices are `'NT'` or `'AA'` (default).
`ScoringMatrixValue`	Either of the following: Character vector or string specifying the scoring matrix to use for the alignment. Choices for amino acid sequences are: `'BLOSUM62'` `'BLOSUM30'` increasing by `5` up to `'BLOSUM90'` `'BLOSUM100'` `'PAM10'` increasing by `10` up to `'PAM500'` `'DAYHOFF'` `'GONNET'` Default is: `'BLOSUM50'` — When `AlphabetValue` equals `'AA'` `'NUC44'` — When `AlphabetValue` equals `'NT'` Note The above scoring matrices, provided with the software, also include a structure containing a scale factor that converts the units of the output score to bits. You can also use the `'Scale'` property to specify an additional scale factor to convert the output score from bits to another unit. Matrix representing the scoring matrix to use for the alignment, such as returned by the `blosum`, `pam`, `dayhoff`, `gonnet`, or `nuc44` function. Note If you use a scoring matrix that you created or was created by one of the above functions, the matrix does not include a scale factor. The output score will be returned in the same units as the scoring matrix. You can use the `'Scale'` property to specify a scale factor to convert the output score to another unit. Note If you need to compile `seqpdist` into a stand-alone application or software component using MATLAB^® Compiler™, use a matrix instead of a character vector or string for `ScoringMatrixValue`.
`ScaleValue`	Positive value that specifies the scale factor used to return the score in arbitrary units. If the scoring matrix information also provides a scale factor, then both are used.
`GapOpenValue`	Positive integer that specifies the penalty for opening a gap in the alignment. Default is `8`.
`ExtendedGapValue`	Positive integer that specifies the penalty for extending a gap. Default is equal to `GapOpenValue`.

Output Arguments

D Vector that contains biological distances between each pair of sequences stored in the M elements of Seqs.

Description

D = seqpdist(Seqs) returns D, a vector containing biological distances between each pair of sequences stored in the M sequences of Seqs, a cell array of sequences, a vector of structures, or a matrix or sequences.

D is a 1-by-(M*(M-1)/2) row vector corresponding to the M*(M-1)/2 pairs of sequences in Seqs. The output D is arranged in the order ((2,1),(3,1),..., (M,1),(3,2),...(M,2),...(M,M-1)). This is the lower-left triangle of the full M-by-M distance matrix. To get the distance between the Ith and the Jth sequences for I > J, use the formula D((J-1)*(M-J/2)+I-J).

D = seqpdist(Seqs, ...'PropertyName', PropertyValue, ...) calls seqpdist with optional properties that use property name/property value pairs. Specify one or more properties in any order. Enclose each PropertyName in single quotation marks. Each PropertyName is case insensitive. These property name/property value pairs are as follows:

D = seqpdist(Seqs, ...'Method', MethodValue, ...) specifies a method to compute distances between each sequence pair. Choices are shown in the following tables.

Methods for Nucleotides and Amino Acids

Method Description

Method	Description
`p-distance`	Proportion of sites at which the two sequences are different. `p` is close to `1` for poorly related sequences, and `p` is close to `0` for similar sequences. d = p
`Jukes-Cantor` (default)	Maximum likelihood estimate of the number of substitutions between two sequences. `p` is described with the method `p-distance`. For nucleotides: `d = -3/4 log(1-p * 4/3)` For amino acids: `d = -19/20 log(1-p * 20/19)`
`alignment-score`	Distance (`d`) between two sequences (`1, 2`) is computed from the pairwise alignment score between the two sequences (`score12`), and the pairwise alignment score between each sequence and itself (`score11`, `score22`) as follows: d = (1-score12/score11)* (1-score12/score22) This option does not imply that prealigned input sequences will be realigned, it only scores them. Use with care; this distance method does not comply with the ultrametric condition. In the rare case where the score between sequences is greater than the score when aligning a sequence with itself, then `d = 0`

p-distance

Proportion of sites at which the two sequences are different. p is close to 1 for poorly related sequences, and p is close to 0 for similar sequences.

d = p

Jukes-Cantor (default)

Maximum likelihood estimate of the number of substitutions between two sequences. p is described with the method p-distance.

For nucleotides:

d = -3/4 log(1-p * 4/3)

For amino acids:

d = -19/20 log(1-p * 20/19)

alignment-score

Distance (d) between two sequences (

1,
2

) is computed from the pairwise alignment score between the two sequences (score12), and the pairwise alignment score between each sequence and itself (score11, score22) as follows:

d = (1-score12/score11)* (1-score12/score22)

This option does not imply that prealigned input sequences will be realigned, it only scores them. Use with care; this distance method does not comply with the ultrametric condition. In the rare case where the score between sequences is greater than the score when aligning a sequence with itself, then d = 0

Methods with No Scoring of Gaps (Nucleotides Only)

Method	Description
`Tajima-Nei`	Maximum likelihood estimate considering the background nucleotide frequencies. It can be computed from the input sequences or given by setting `OptArgs` to `[gA gC gG gT]`. `gA`, `gC`, `gG`, `gT` are scalar values for the nucleotide frequencies.
`Kimura`	Considers separately the transitional nucleotide substitution and the transversional nucleotide substitution.
`Tamura`	Considers separately the transitional nucleotide substitution, the transversional nucleotide substitution, and the GC content. GC content can be computed from the input sequences or given by setting `OptArgs` to the proportion of GC content (scalar value from `0` to `1`).
`Hasegawa`	Considers separately the transitional nucleotide substitution, the transversional nucleotide substitution, and the background nucleotide frequencies. Background frequencies can be computed from the input sequences or given by setting the `OptArgs` property to `[gA gC gG gT]`.
`Nei-Tamura`	Considers separately the transitional nucleotide substitution between purines, the transitional nucleotide substitution between pyrimidines, the transversional nucleotide substitution, and the background nucleotide frequencies. Background frequencies can be computed from the input sequences or given by setting the `OptArgs` property to `[gA gC gG gT]`.

Methods with No Scoring of Gaps (Amino Acids Only)

Method	Description
`Poisson`	Assumes that the number of amino acid substitutions at each site has a Poisson distribution.
`Gamma`	Assumes that the number of amino acid substitutions at each site has a Gamma distribution with parameter `a`. Set `a` using the `OptArgs` property. Default is `2`.

You can also specify a user-defined distance function using @, for example, @distfun. The distance function must have the form:

function D = distfun(S1, S2, OptArgsValue)

The distfun function takes the following arguments:

S1 , S2 — Two sequences of the same length (nucleotide or amino acid).
OptArgsValue — Optional problem-dependent arguments.

The distfun function returns a scalar that represents the distance between S1 and S2.

D = seqpdist(Seqs, ...'Indels', IndelsValue, ...) specifies how to treat sites with gaps. Choices are:

score (default) — Scores these sites either as a point mutation or with the alignment parameters, depending on the method selected.
pairwise-del — For every pairwise comparison, it ignores the sites with gaps.
complete-del — Ignores all the columns in the multiple alignment that contain a gap. This option is available only if you provided a multiple alignment as the input Seqs.

D = seqpdist(Seqs, ...'OptArgs', OptArgsValue, ...) passes one or more arguments required or accepted by the distance method specified by the Method property. Use a character vector or cell array to pass one or more input arguments. For example, provide the nucleotide frequencies for the Tajima-Nei distance method, instead of computing them from the input sequences.

D = seqpdist(Seqs, ...'PairwiseAlignment', PairwiseAlignmentValue, ...) controls the global pairwise alignment of input sequences (using the nwalign function), while ignoring the multiple alignment of the input sequences (if any). Default is:

true — When all input sequences do not have the same length.
false — When all input sequences have the same length.

Tip

If your input sequences have the same length, seqpdist assumes they are aligned. If they are not aligned, do one of the following:

Align the sequences before passing them to seqpdist, for example, using the multialign function.
Set PairwiseAlignment to true when using seqpdist.

D = seqpdist(Seqs, ...'UseParallel', UseParallelValue, ...) specifies whether to use parfor-loops when calculating the pairwise distances. When true, and Parallel Computing Toolbox is installed and a parpool is open, computation occurs in parallel. If there are no open parpool, but automatic creation is enabled in the Parallel Preferences, the default pool will be automatically open and computation occurs in parallel. If Parallel Computing Toolbox is installed, but there are no open parpool and automatic creation is disabled, then computation uses parfor-loops in serial mode. If Parallel Computing Toolbox is not installed, then computation uses parfor-loops in serial mode. Default is false, which uses for-loops in serial mode.

D = seqpdist(Seqs, ...'SquareForm', SquareFormValue ...) controls the conversion of the output into a square matrix such that D(I,J) denotes the distance between the Ith and Jth sequences. The square matrix is symmetric and has a zero diagonal. Choices are true or false (default). Setting Squareform to true is the same as using the squareform function in Statistics and Machine Learning Toolbox™.

D = seqpdist(Seqs, ...'Alphabet', AlphabetValue, ...) specifies the type of sequence (nucleotide or amino acid). Choices are 'NT' or 'AA' (default).

The remaining input properties are available when the Method property equals 'alignment-score' or the PairwiseAlignment property equals true.

D = seqpdist(Seqs, ...'ScoringMatrix', ScoringMatrixValue, ...) specifies the scoring matrix to use for the global pairwise alignment. Default is:

'NUC44' — When AlphabetValue equals 'NT'.
'BLOSUM50' — When AlphabetValue equals 'AA'.

D = seqpdist(Seqs, ...'Scale', ScaleValue, ...) specifies the scale factor used to return the score in arbitrary units. Choices are any positive value. If the scoring matrix information also provides a scale factor, then both are used.

D = seqpdist(Seqs, ...'GapOpen', GapOpenValue, ...) specifies the penalty for opening a gap in the alignment. Choices are any positive integer. Default is 8.

D = seqpdist(Seqs, ...'ExtendGap', ExtendGapValue, ...) specifies the penalty for extending a gap in the alignment. Choices are any positive integer. Default is equal to GapOpenValue.

Examples

Read amino acid alignment data into a MATLAB structure.
```
seqs = fastaread('pf00002.fa');
```

For every possible pair of sequences in the multiple alignment, ignore sites with gaps and score with the scoring matrix PAM250.

dist = seqpdist(seqs,'Method','alignment-score',...
                'Indels','pairwise-delete',...
                'ScoringMatrix','pam250');

Force the realignment of each sequence pair ignoring the provided multiple alignment.

dist = seqpdist(seqs,'Method','alignment-score',...
                'Indels','pairwise-delete',...
                'ScoringMatrix','pam250',...
                'PairwiseAlignment',true);

Measure the Jukes-Cantor pairwise distances after realigning each sequence pair, counting the gaps as point mutations.

dist = seqpdist(seqs,'Method','jukes-cantor',...
                'Indels','score',...
                'Scoringmatrix','pam250',...
                'PairwiseAlignment',true);

Extended Capabilities

Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.

To run in parallel, set 'UseParallel' to true.

For more information, see the 'UseParallel' name-value pair argument.

Version History

Introduced before R2006a

seqpdist

Syntax

Input Arguments

Output Arguments

Description

Examples

Extended Capabilities

Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.

Version History

See Also

Topics

seqpdist

Syntax

Input Arguments

Output Arguments

Description

Examples

Extended Capabilities

Automatic Parallel Support Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.

Version History

See Also

Topics

Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.