# ANOMALY

**AXCEL.ANOMALY function**

Anomaly detection is one of the unsupervised machine learning algorithms. This function applies anomaly detection which is the identification of unusual entities, observations, or events that raise suspicions by differing significantly from the majority of the data.

**Syntax**

AXCEL.ANOMALY(data, contamination, [KeepOut], [Method], [probability], [NAsValue])

The AXCEL.ANOMALY function syntax has the following arguments:

**data** Required. data must be a table of attributes that are considered to be used in anomaly detection. For example, we would like to exercise a fair lending review program inside a bank. We cannot review thousands of the loans already issued. As a result, we would like to identify loans with a higher risk of unfair lending and review them to make sure that the underwriting process is in compliance and fair. For this purpose, we have a list of loans issued by the bank and we would like to find the loans with abnormal interest rates based on borrowers and loan attributes for which we run an anomaly detection model. Here is an example:

interest | FICO | Debt_to_Income | loan_amout | term |

2.2 | 730 | 0.31 | 995000 | 15-year |

2.8 | 710 | 0.38 | 990000 | 15-year |

2.4 | 740 | 0.35 | 850000 | 30-year |

… | … | … | … | … |

If you add “.n” at the end of a column, you can drop it out from your analysis. For instance, if you would like to drop FICO in the above table you can change *FICO* to *FICO.n*:

FICO.n |

730 |

710 |

740 |

… |

If you have categorical variables, you can add “.f” at the end of the variable name. For instance, in the above example, if you run the ANOMALY function on that data, you get an error since you have non-numeric values in the *term* column. If you want to keep this variable in your analysis, you can add “.f” to this column name to transfer it to a categorical variable as shown below:

term.f |

15-year |

15-year |

30-year |

… |

Axcel uses Binary Encoding for the categorical variables. If you would like to use other techniques such as One-Hot encoding or REplacing, you need to produce them separately and add them to your dataset.

**contamination** Required. This is the percentage of data that you expect to be identified as part of anomaly detection. In our loan review example, if we assume that we would like to review a maximum of 5% of the loans in our book, we set the contamination rate to 5.

**Method** Optional. Default is K-nearest-neighbors methodology. Axcel supports several anomaly detection methodologies that you can choose here. Here is the list of available methods

"KNN" or "knn" for K-nearest-neighbors (default) "AB" or "ab" for Angle-base Outlier Detection "FB" or "fb" for Feature Bagging "HB" or "hb" for Histogram-Based Outlier Detection "IF" or "if" for Isolation Forest Outlier Detection "LOF" or "lof" for Local Outlier Factor "MCD" or "mcd" for Minimum Covariance Determinant "OCS" or "ocs" for One-class SVM Detection "PCA" or "pca" for Principal Component Analysis

**Probability** Optional. Default is False. If set TRUE, Axcel gives the probability of anomaly for each observation. Otherwise, it reports the outcome in binary format, i.e., 0 and 1.

**NAsValue: **Optional. If it is not set any value, Axcel replaces the average value of each column for missing values to minimize the bias. If you set a number, Axcel replaces all NAs with that number. To make it transparent for analysis, if there is an NA value in any column, Axcel reports a separate column called Has_NA and shows TRUE for those rows which have one or more NA values.

when you type *=AXCEL.ANOMALY* in an Excel cell, the IntelliSense guides you through required and optional (shown in [] brackets) inputs:

In the example above, we have:

=AXCEL.ANOMALY(B1:CD988, 5,,TRUE)

This means that our data is located at cell B1 through CD1081, we expect 5% contamination in our data, we use the default methodology which is K-nearest-neighbors. We set Probability to TRUE. For NA values, we skipped this entry to use default functionality which replaces NAs by the average of the values in the column. In this example, we use *Mice Protein Expression* dataset from UCI. Here is the output of this function:

As you can see, in a new column called *Anomaly *the probability of anomaly for each experiment is reported. Also, since the dataset has NA values a new column of Has_NA is also reported. Also, please note that for columns *Genotype, Treatment, Behavior,* and *Class* “.f” is added to the names to identify them as categorical variables for Axcel.