# knnimpute

Impute missing data using nearest-neighbor method

## Syntax

``imputedData = knnimpute(data)``
``imputedData = knnimpute(data,k)``
``imputedData = knnimpute(data,k,Name,Value)``

## Description

example

````imputedData = knnimpute(data)` returns `imputedData` after replacing `NaN`s in the input `data` with the corresponding value from the nearest-neighbor column. If the corresponding value from the nearest-neighbor column is also `NaN`, the next nearest column is used. The function calculates the Euclidean distance between observation columns by using only the rows with no `NaN` values. Thus, the data must have at least one row that contains no `NaN`.```

example

````imputedData = knnimpute(data,k)` replaces `NaN`s in `Data` with a weighted mean of the `k` nearest-neighbor columns. The weights are inversely proportional to the distances from the neighboring columns. ```

example

````imputedData = knnimpute(data,k,Name,Value)` uses additional options specified by one or more name-value pair arguments. For example, `imputedData = knnimpute(data,k,'Distance','mahalanobis')` uses the Mahalanobis distance to compute the nearest-neighbor columns.```

## Examples

collapse all

The function `knnimpute` replaces NaNs in the input data with the corresponding value from the nearest-neighbor column. Consider the following matrix.

`A = [1 2 5;4 5 7;NaN -1 8;7 6 0]`
```A = 4×3 1 2 5 4 5 7 NaN -1 8 7 6 0 ```

A(3,1) is NaN, and because column 2 is the closest column to column 1 in the Euclidean distance, `knnimpute` replaces the (3,1) entry of column 1 with the corresponding entry from column 2, which is -1.

`results = knnimpute(A)`
```results = 4×3 1 2 5 4 5 7 -1 -1 8 7 6 0 ```

The data must have at least one row without any NaN values for `knnimpute` to work. If all rows have NaN values, you can add a row where every observation (column) has identical values and call `knnimpute` on the updated matrix to replace the NaN values with the average of all column values for a given row.

`B = [NaN 2 1; 3 NaN 1; 1 8 NaN]`
```B = 3×3 NaN 2 1 3 NaN 1 1 8 NaN ```
`B(4,:) = ones(1,3)`
```B = 4×3 NaN 2 1 3 NaN 1 1 8 NaN 1 1 1 ```
`imputed = knnimpute(B)`
```imputed = 4×3 1.5000 2.0000 1.0000 3.0000 2.0000 1.0000 1.0000 8.0000 4.5000 1.0000 1.0000 1.0000 ```

You can then remove the added row.

`imputed(4,:) = []`
```imputed = 3×3 1.5000 2.0000 1.0000 3.0000 2.0000 1.0000 1.0000 8.0000 4.5000 ```

Load a sample biological data set and imputes missing values in `yeastvalues, `where each row represents each gene and each column represents an experimental condition or observation.

`load yeastdata`

Remove data for empty spots where gene labels are set to '`EMPTY'`.

```emptySpots = strcmp('EMPTY',genes); yeastvalues(emptySpots,:) = [];```

`knnimpute` uses the next nearest column if the corresponding value from the nearest-neighbor column is also NaN. However, if all columns are NaNs, the function generates a warning for each row and keeps the rows instead of deleting the whole row in the returned output. The sample data contains some rows with all NaNs. Remove those rows to avoid the warnings.

`yeastvalues(~any(~isnan(yeastvalues),2),:) = [];`

Impute missing values.

`imputedData1 = knnimpute(yeastvalues);`

Check if there any NaN left after imputing data.

`sum(any(isnan(imputedData1),2))`
```ans = 0 ```

Use the 5-nearest neighbor search to get the nearest column.

`imputedData2 = knnimpute(yeastvalues,5);`

Change the distance metric to use the Minknowski distance.

`imputedData3 = knnimpute(yeastvalues,5,'Distance','minkowski');`

You can also specify the parameter for the distance metric. For instance, specify a different exponent (say 5) for the Minknowski distance.

`imputedData4 = knnimpute(yeastvalues,5,'Distance','minkowski','DistArgs',5);`

## Input Arguments

collapse all

Input data, specified as a matrix. The data must have at least one row that contains no `NaN` because the function calculates the Euclidean distance between observation columns by using only the rows with no `NaN` values.

Data Types: `double`

Number of nearest neighbors, specified as a positive integer.

Data Types: `double`

### Name-Value Arguments

Specify optional pairs of arguments as `Name1=Value1,...,NameN=ValueN`, where `Name` is the argument name and `Value` is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose `Name` in quotes.

Example: ```imputedData = knnimpute(data,k,'Distance','mahalanobis')```

Distance metric, specified as a character vector, string, or function handle, as described in the following table.

Use the `'DistArgs'` name-value pair in conjunction to specify parameters for the distance function. For instance, to specify a different exponent (say 5) for the Minknowski distance, use: ```output = knnimpute(data,3,'Distance','minkowski','DistArgs',5)```.

ValueDescription
`'euclidean'`

Euclidean distance (default).

`'squaredeuclidean'`

Squared Euclidean distance. (This option is provided for efficiency only. It does not satisfy the triangle inequality.)

`'seuclidean'`

Standardized Euclidean distance. Each coordinate difference between observations is scaled by dividing by the corresponding element of the standard deviation, `S = nanstd(X)`. Use `'DistArgs'` to specify another value for `S`.

`'mahalanobis'`

Mahalanobis distance using the sample covariance of `X`, `C = nancov(X)`. Use `'DistArgs'` to specify another value for `C`, where the matrix `C` is symmetric and positive definite.

`'cityblock'`

City block distance.

`'minkowski'`

Minkowski distance. The default exponent is 2. Use `DistParameter` to specify a different exponent `P`, where `P` is a positive scalar value of the exponent.

`'chebychev'`

Chebychev distance (maximum coordinate difference).

`'cosine'`

One minus the cosine of the included angle between points (treated as vectors).

`'correlation'`

One minus the sample correlation between points (treated as sequences of values).

`'hamming'`

Hamming distance, which is the percentage of coordinates that differ.

`'jaccard'`

One minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ.

`'spearman'`

One minus the sample Spearman's rank correlation between observations (treated as sequences of values).

`@distfun`

Custom distance function handle. A distance function has the form

```function D2 = distfun(ZI,ZJ) % calculation of distance ...```
where

• `ZI` is a `1`-by-`n` vector containing a single observation.

• `ZJ` is an `m2`-by-`n` matrix containing multiple observations. `distfun` must accept a matrix `ZJ` with an arbitrary number of observations.

• `D2` is an `m2`-by-`1` vector of distances, and `D2(k)` is the distance between observations `ZI` and `ZJ(k,:)`.

If your data is not sparse, you can generally compute distance more quickly by using a built-in distance instead of a function handle.

See `pdist` for more details.

Example: `'Distance','cosine'`

Data Types: `char` | `string` | `function_handle`

Distance metric parameter values, specified as a positive scalar or cell array of values. Use `'DistArgs'` together with `'Distance'` to specify parameters for the distance function. For instance, to specify a different exponent (say 5) for the Minknowski distance, use: ```output = knnimpute(data,3,'Distance','minkowski','DistArgs',5)```

Example: `'DistArgs',3`

Data Types: `double` | `cell`

Weights used in the weighted mean calculation, specified as a numeric vector of length `k`.

Example: `'Weights',[0.3 0.5 0.2]`

Data Types: `double`

Flag to use the median of `k` nearest neighbors instead of the weighted mean, specified as `true` or `false`.

Example: `'Median',true`

Data Types: `logical`

## Output Arguments

collapse all

Results after replacing NaNs from the input `data` with the corresponding value from the nearest-neighbor column, returned as a numeric matrix.

## References

[1] Speed, T. (2003). Statistical Analysis of Gene Expression Microarray Data (Chapman & Hall/CRC).

[2] Hastie, T., Tibshirani, R., Sherlock, G., Eisen, M., Brown, P., and Botstein, D. (1999). “Imputing missing data for gene expression arrays”, Technical Report, Division of Biostatistics, Stanford University.

[3] Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., and Altman, R. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525.

## Version History

Introduced before R2006a