Skip to content

IForest

ikpykit.anomaly.IForest

IForest(
    n_estimators=100,
    max_samples="auto",
    contamination=0.1,
    max_features=1.0,
    bootstrap=False,
    n_jobs=1,
    random_state=None,
    verbose=0,
)

Bases: OutlierMixin, BaseEstimator

Wrapper of scikit-learn Isolation Forest for anomaly detection.

The IsolationForest 'isolates' observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node.

This path length, averaged over a forest of such random trees, is a measure of normality and our decision function. Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.

Parameters:

Name Type Description Default
n_estimators int

The number of base estimators (trees) in the ensemble.

100
max_samples int or float

The number of samples to draw from X to train each base estimator. - If int, then draw max_samples samples. - If float, then draw max_samples * X.shape[0] samples. - If "auto", then max_samples=min(256, n_samples).

"auto"
contamination float or auto

The proportion of outliers in the data set. Used to define the threshold on the scores of the samples. - If 'auto', the threshold is determined as in the original paper. - If float, the contamination should be in the range (0, 0.5].

0.1
max_features int or float

The number of features to draw from X to train each base estimator. - If int, then draw max_features features. - If float, then draw max_features * X.shape[1] features.

1.0
bootstrap bool

If True, individual trees are fit on random subsets of the training data sampled with replacement. If False, sampling without replacement is performed.

False
n_jobs int

The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.

1
random_state int, RandomState instance or None

Controls the pseudo-randomness of the selection of the feature and split values for each branching step and each tree in the forest. Pass an int for reproducible results across multiple function calls.

None
verbose int

Controls the verbosity of the tree building process.

0

Attributes:

Name Type Description
detector_ IsolationForest

The underlying scikit-learn IsolationForest object.

is_fitted_ bool

Indicates whether the estimator has been fitted.

References

.. [1] Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008, December). "Isolation forest." In 2008 Eighth IEEE International Conference on Data Mining (pp. 413-422). IEEE.

.. [2] Liu, F. T., Ting, K. M., & Zhou, Z. H. (2012). "Isolation-based anomaly detection." ACM Transactions on Knowledge Discovery from Data (TKDD), 6(1), 1-39.

Examples:

>>> from ikpykit.anomaly import IForest
>>> import numpy as np
>>> X = np.array([[-1.1, 0.2], [0.3, 0.5], [0.5, 1.1], [100, 90]])
>>> clf = IForest(contamination=0.25).fit(X)
>>> clf.predict([[0.1, 0.3], [0, 0.7], [90, 85]])
array([ 1,  1, -1])
Source code in ikpykit/anomaly/_iforest.py
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
def __init__(
    self,
    n_estimators=100,
    max_samples="auto",
    contamination=0.1,
    max_features=1.0,
    bootstrap=False,
    n_jobs=1,
    random_state=None,
    verbose=0,
):
    self.contamination = contamination
    self.n_estimators = n_estimators
    self.max_samples = max_samples
    self.max_features = max_features
    self.bootstrap = bootstrap
    self.n_jobs = n_jobs
    self.random_state = random_state
    self.verbose = verbose

fit

fit(X, y=None)

Fit the isolation forest model.

Parameters:

Name Type Description Default
X array-like of shape (n_samples, n_features)

The input samples. Use dtype=np.float32 for maximum efficiency.

required
y Ignored

Not used, present for API consistency by convention.

None

Returns:

Name Type Description
self object

Fitted estimator.

Source code in ikpykit/anomaly/_iforest.py
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
def fit(self, X, y=None):
    """
    Fit the isolation forest model.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The input samples. Use ``dtype=np.float32`` for maximum
        efficiency.

    y : Ignored
        Not used, present for API consistency by convention.

    Returns
    -------
    self : object
        Fitted estimator.
    """
    # Check data
    X = check_array(X, accept_sparse=False)

    self.detector_ = IsolationForest(
        n_estimators=self.n_estimators,
        max_samples=self.max_samples,
        contamination=self.contamination,
        max_features=self.max_features,
        bootstrap=self.bootstrap,
        n_jobs=self.n_jobs,
        random_state=self.random_state,
        verbose=self.verbose,
    )

    self.detector_.fit(X=X, y=None, sample_weight=None)
    self.is_fitted_ = True

    return self

predict

predict(X)

Predict if a particular sample is an outlier or not.

Parameters:

Name Type Description Default
X array-like of shape (n_samples, n_features)

The input samples.

required

Returns:

Name Type Description
is_inlier ndarray of shape (n_samples,)

The predicted labels. +1 for inliers, -1 for outliers.

Source code in ikpykit/anomaly/_iforest.py
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
def predict(self, X):
    """
    Predict if a particular sample is an outlier or not.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The input samples.

    Returns
    -------
    is_inlier : ndarray of shape (n_samples,)
        The predicted labels. +1 for inliers, -1 for outliers.
    """
    check_is_fitted(self, "is_fitted_")
    return self.detector_.predict(X)

decision_function

decision_function(X)

Compute the anomaly score for each sample.

The anomaly score of an input sample is computed as the mean anomaly score of the trees in the forest.

Parameters:

Name Type Description Default
X array-like of shape (n_samples, n_features)

The input samples.

required

Returns:

Name Type Description
scores ndarray of shape (n_samples,)

The anomaly score of the input samples. The lower, the more abnormal. Negative scores represent outliers, positive scores represent inliers.

Source code in ikpykit/anomaly/_iforest.py
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
def decision_function(self, X):
    """
    Compute the anomaly score for each sample.

    The anomaly score of an input sample is computed as
    the mean anomaly score of the trees in the forest.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The input samples.

    Returns
    -------
    scores : ndarray of shape (n_samples,)
        The anomaly score of the input samples.
        The lower, the more abnormal. Negative scores represent outliers,
        positive scores represent inliers.
    """
    check_is_fitted(self, "is_fitted_")
    return self.detector_.decision_function(X)

score_samples

score_samples(X)

Return the raw anomaly score of samples.

The anomaly score of an input sample is computed as the mean anomaly score of the trees in the forest.

Parameters:

Name Type Description Default
X array-like of shape (n_samples, n_features)

The input samples.

required

Returns:

Name Type Description
scores ndarray of shape (n_samples,)

The raw anomaly score of the input samples. The lower, the more abnormal.

Source code in ikpykit/anomaly/_iforest.py
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
def score_samples(self, X):
    """
    Return the raw anomaly score of samples.

    The anomaly score of an input sample is computed as
    the mean anomaly score of the trees in the forest.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The input samples.

    Returns
    -------
    scores : ndarray of shape (n_samples,)
        The raw anomaly score of the input samples.
        The lower, the more abnormal.
    """
    check_is_fitted(self, "is_fitted_")
    # Check data
    X = check_array(X, accept_sparse=False)
    return self.detector_.score_samples(X)