ANOMALY

You are here:

AXCEL.ANOMALY function

Anomaly detection is one of the unsupervised machine learning algorithms. This function applies anomaly detection which is the identification of unusual entities, observations, or events that raise suspicions by differing significantly from the majority of the data.

Syntax

AXCEL.ANOMALY(data, contamination, [KeepOut], [Method], [probability], [NAsValue])


The AXCEL.ANOMALY function syntax has the following arguments:

data Required. data must be a table of attributes that are considered to be used in anomaly detection. For example, we would like to exercise a fair lending review program inside a bank. We cannot review thousands of the loans already issued. As a result, we would like to identify loans with a higher risk of unfair lending and review them to make sure that the underwriting process is in compliance and fair. For this purpose, we have a list of loans issued by the bank and we would like to find the loans with abnormal interest rates based on borrowers and loan attributes for which we run an anomaly detection model. Here is an example:

interestFICODebt_to_Incomeloan_amoutterm
2.27300.3199500015-year
2.87100.3899000015-year
2.47400.3585000030-year

If you add “.n” at the end of a column, you can drop it out from your analysis. For instance, if you would like to drop FICO in the above table you can change FICO to FICO.n:

FICO.n
730
710
740

If you have categorical variables, you can add “.f” at the end of the variable name. For instance, in the above example, if you run the ANOMALY function on that data, you get an error since you have non-numeric values in the term column. If you want to keep this variable in your analysis, you can add “.f” to this column name to transfer it to a categorical variable as shown below:

term.f
15-year
15-year
30-year

Axcel uses Binary Encoding for the categorical variables. If you would like to use other techniques such as One-Hot encoding or REplacing, you need to produce them separately and add them to your dataset.

contamination Required. This is the percentage of data that you expect to be identified as part of anomaly detection. In our loan review example, if we assume that we would like to review a maximum of 5% of the loans in our book, we set the contamination rate to 5.

Method Optional. Default is K-nearest-neighbors methodology. Axcel supports several anomaly detection methodologies that you can choose here. Here is the list of available methods

"KNN" or "knn" for K-nearest-neighbors (default)
"AB" or "ab" for Angle-base Outlier Detection
"FB" or "fb" for Feature Bagging
"HB" or "hb" for Histogram-Based Outlier Detection
"IF" or "if" for Isolation Forest Outlier Detection
"LOF" or "lof" for Local Outlier Factor 
"MCD" or "mcd" for Minimum Covariance Determinant
"OCS" or "ocs" for One-class SVM Detection
"PCA" or "pca" for Principal Component Analysis

Probability Optional. Default is False. If set TRUE, Axcel gives the probability of anomaly for each observation. Otherwise, it reports the outcome in binary format, i.e., 0 and 1.

NAsValue: Optional. If it is not set any value, Axcel replaces the average value of each column for missing values to minimize the bias. If you set a number, Axcel replaces all NAs with that number. To make it transparent for analysis, if there is an NA value in any column, Axcel reports a separate column called Has_NA and shows TRUE for those rows which have one or more NA values.


when you type =AXCEL.ANOMALY in an Excel cell, the IntelliSense guides you through required and optional (shown in [] brackets) inputs:

In the example above, we have:

=AXCEL.ANOMALY(B1:CD988, 5,,TRUE)

This means that our data is located at cell B1 through CD1081, we expect 5% contamination in our data, we use the default methodology which is K-nearest-neighbors. We set Probability to TRUE. For NA values, we skipped this entry to use default functionality which replaces NAs by the average of the values in the column. In this example, we use Mice Protein Expression dataset from UCI. Here is the output of this function:

As you can see, in a new column called Anomaly the probability of anomaly for each experiment is reported. Also, since the dataset has NA values a new column of Has_NA is also reported. Also, please note that for columns Genotype, Treatment, Behavior, and Class “.f” is added to the names to identify them as categorical variables for Axcel.