Anomaly Detection：Isolation Forest

IForest algorithm is proposed by Dr. Liu (Fei Tony Liu), while studying at Monash University, derected by Prof. Chen (Kai-Ming Ting) and Prof. Zhou (Zhi-Hua Zhou). The first version, published in ICDM in 2008, won the annual best paper. In industry, IForest is a practical, effective, time efficient algorithm, and can deal with high-dimensional data and mass data effectively. Here is a brief summary of IForest.

ITree

When it comes to forest, we must pay attention to tree.So, before talking something about Isolation Forest (IForest), we should understand how to establish a Isolation Tree (ITree). ITree is a kind of random two binary tree, whose each node either has two childs or no child (leaf node). Given a set of data D, assumping all attributes of D here are continuous variables, ITree is established as follows:

Select an attribute Attr randomly;
Select a value X of Attr randomly;
Classify D according to Attr, and place the record of Attr less than X at the left child, and place the record of Attr more than or qual to X is at the right child;

Then construct the left chikd and the right child recursively until the following conditions are met:

Input data sets have only one record or multiple same records;
The height of the tree has reached the limit height;

When ITree is established, we can predict the data. The process of prediction is to walk the test record on the ITree to see which leaf node the test record falls on. ITree can detect the abnormal data assumption that abnormal nodes are generally very rare in ITree that will soon be divided into leaf nodes. So we can use the length of path H(x) from the leaf node to the root node to determine whether a record is abnormal. For a data set containing n datum, the minium height of the tree is log(n), the maximum height is n-1. However, using log(n) and N-1 to normalize cannot guarantee convenient comparison and limited bound. Here is a slightly complex normalization formula:

S(x, n) is the abnormal index of record X in the Itree composed by training n sample data. S(x, n) ranges from [0, 1], the more closer to 1 the more possibility of abnormal point, and the more closer to 0 the more possibility of normal point.The greek letter represents euler’s constant. If most S(x, n) of the training samples are close to 0.5, the data set is indicated to have no obvious abnormal value.
However, random selection of attributes, random selection of attribute values, a ITree is absolutely not reliable, but the combination of a number of ITrees will become powerful and trustworthy.

IForest

Then， let’s talk about how to establish a IForest with given data set D containing N records. Methods to establish IForest are similar to Random Forest’s counterpart, using part of sampling data set to establish every tree intending to ensure the difference between trees. But the size of sampling data of IForest needn’t to be equal to N and can be far less than N. Actually, there will be little promotion of effect if the sampling size of data bigger than 256. And it will cause a waste of computation time. Why IForest performs worse with more data？ We can get the answer in the two following maps：

The left map is the result of training with all data set and the right map is the result of using only sampling data. Blue represents normal sample and red represents abnormal sample. It is obvious that normal samples and abnormal samples overlap together before sampling and it is very difficult to separate them. But after sampling, the normal samples and the abnormal samples can be separated clearly.

Besides limit the sampling size, we need to set the maximum height to each ITree (l=ceiling (log2 M), M is the number of ITree). Considering the abnormal sample is few, the length of abnormal sample’s path is relatively short and we just need to distinguish normal and abnormal records, we can only pay attention to the part below the average height of ITree. In this way, the algorithm is more efficient. Howerer, we need to improve the calculation of the height h(x) after the adjustment. Reader can see the improvement behind. Now, let’s take a look at the pseudo code of IForest firstly:

After the IForest is established, it’s necessary to integrate the results of each ITree when predicting the tests set:

E(H(x)) represents the average height of record X in each ITree. In addition, the calculation of h(x) needs to be improved. While generating leaf nodes, the algorithm records the number Y of records the leaf node containing and we can estimate the average height with Y. h(x) can be calculated as follows:

Processing High-Dimensional Data

While processing high-dimensional data, we should improve the algorithm. Howerer we needn’t to use all of the attributes of the samples, we can take the advantage of the coefficient of kurtosis to pick out some valuable attributes. Then, we establish the ITree like a random forest, established with random records and random attributes.

Other Instructions

This algorithm is essentially an unsupervised learning, which needn’t label the data previously. Sometimes, abnormal data is too rare, even can only attain several abnormal samples, to process training. The paper mentions that the construction of IForest is feasible by using normal samples only. However, the effect is reduced, the algorithm has good performance, and could be improved by adjusting the size of sample.