Skip to content

IDKC

ikpykit.cluster.IDKC

IDKC(
    n_estimators,
    max_samples,
    method,
    k,
    kn,
    v,
    n_init_samples,
    init_center=None,
    is_post_process=True,
    random_state=None,
)

Bases: BaseEstimator, ClusterMixin

Isolation Distributional Kernel Clustering.

A clustering algorithm that leverages Isolation Kernels to transform data into a feature space where cluster structures are more distinguishable. The algorithm first constructs Isolation Kernel representations, then performs clustering in this transformed space using a threshold-based assignment mechanism.

Parameters:

Name Type Description Default
n_estimators int

Number of base estimators in the ensemble for the Isolation Kernel. Higher values generally lead to more stable results but increase computation time.

required
max_samples int

Number of samples to draw from X to train each base estimator in the Isolation Kernel. Controls the granularity of the kernel representation.

required
method (inne, anne, iforest)

Method used to calculate the Isolation Kernel: - 'inne': Isolation Nearest Neighbor Ensemble - 'anne': Approximate Nearest Neighbor Ensemble - 'iforest': Isolation Forest

'inne'
k int

Number of clusters to form in the dataset.

required
kn int

Number of nearest neighbors used for local contrast density calculation during initialization. Higher values consider more neighbors when determining density.

required
v float

Decay factor (0 < v < 1) for reducing the similarity threshold during clustering. Smaller values cause faster decay, leading to more aggressive cluster assignments.

required
n_init_samples int or float

If int, number of samples to consider when initializing cluster centers. If float, fraction of total samples to consider when initializing cluster centers. Number of samples to consider when initializing cluster centers. Larger values may produce better initial centers but increase computation.

required
init_center int or array-like of shape (k,)

Index or indices of initial cluster centers. If None, centers are selected automatically based on density and distance considerations.

None
is_post_process bool

Whether to perform post-processing refinement of clusters through iterative reassignment. Improves cluster quality but adds computational overhead.

True
random_state int, RandomState instance or None

Controls the randomness of the algorithm. Pass an int for reproducible results.

None

Attributes:

Name Type Description
clusters_ list of KCluster objects

The cluster objects containing assignment and centroid information.

it_ int

Number of iterations performed during the initial clustering phase.

labels_ ndarray of shape (n_samples,)

Cluster labels for each point. Points not assigned to any cluster have label -1 (outliers).

is_fitted_ bool

Whether the model has been fitted to data.

Examples:

>>> from ikpykit.cluster import IDKC
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [5, 2], [5, 5],  [1, 0], [5, 0]])
>>> clustering = IDKC(
...     n_estimators=100, max_samples=3, method='anne',
...     k=2, kn=5, v=0.5, n_init_samples=4, random_state=42
... )
>>> clustering.fit_predict(X)
array([1, 1, 0, 0, 1, 0])
References

.. [1] Ye Zhu, Kai Ming Ting (2023). Kernel-based Clustering via Isolation Distributional Kernel. Information Systems.

Source code in ikpykit/cluster/_idkc.py
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
def __init__(
    self,
    n_estimators,
    max_samples,
    method,
    k,
    kn,
    v,
    n_init_samples,
    init_center=None,
    is_post_process=True,
    random_state=None,
):
    self.n_estimators = n_estimators
    self.max_samples = max_samples
    self.method = method
    self.k = k
    self.kn = kn
    self.v = v
    self.n_init_samples = n_init_samples
    self.is_post_process = is_post_process
    self.init_center = init_center
    self.random_state = random_state
    self.clusters_ = []
    self.it_ = 0
    self.labels_ = None
    self.data_index = None

n_it property

n_it

Get number of iterations performed during clustering.

fit

fit(X, y=None)

Fit the IDKC clustering model on data X.

Parameters:

Name Type Description Default
X ndarray of shape (n_samples, n_features)

The input instances to cluster.

required
y Ignored

Not used, present for API consistency by convention.

None

Returns:

Name Type Description
self object

Fitted estimator.

Source code in ikpykit/cluster/_idkc.py
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
def fit(self, X, y=None):
    """Fit the IDKC clustering model on data X.

    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features)
        The input instances to cluster.
    y : Ignored
        Not used, present for API consistency by convention.

    Returns
    -------
    self : object
        Fitted estimator.
    """
    X = check_array(X)
    if self.n_init_samples <= 0:
        raise ValueError(
            f"Number of initial samples n_init_samples={self.n_init_samples} must be greater than 0"
        )
    elif isinstance(self.n_init_samples, numbers.Integral):
        if self.n_init_samples > X.shape[0]:
            self.n_init_samples = X.shape[0]
            raise warn(
                f"Number of initial samples n_init_samples={self.n_init_samples} is greater than the number of samples in the dataset. Setting n_init_samples to {X.shape[0]}"
            )
        else:
            self.n_init_samples = int(self.n_init_samples)
    elif isinstance(self.n_init_samples, float):
        if not (0 < self.n_init_samples <= 1):
            raise ValueError(
                f"Fraction of initial samples n_init_samples={self.n_init_samples} must be between 0 and 1"
            )
        self.n_init_samples = int(self.n_init_samples * X.shape[0])
    self.data_index = np.arange(X.shape[0])
    isokernel = IsoKernel(
        method=self.method,
        max_samples=self.max_samples,
        n_estimators=self.n_estimators,
        random_state=self.random_state,
    )
    data_ik = isokernel.fit_transform(X)
    self._fit(data_ik)

    # Apply post-processing if requested
    if self.is_post_process:
        self._post_process(data_ik)

    self.is_fitted_ = True
    self.labels_ = self._get_labels(X)
    return self

predict

predict(X)

Predict the cluster labels for each point in X.

Parameters:

Name Type Description Default
X ndarray of shape (n_samples, n_features)

The input instances to predict cluster labels for.

required

Returns:

Name Type Description
labels ndarray of shape (n_samples,)

Cluster labels for each point. Points not assigned to any cluster have label -1 (outliers).

Source code in ikpykit/cluster/_idkc.py
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
def predict(self, X):
    """Predict the cluster labels for each point in X.

    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features)
        The input instances to predict cluster labels for.

    Returns
    -------
    labels : ndarray of shape (n_samples,)
        Cluster labels for each point. Points not assigned to any cluster
        have label -1 (outliers).
    """
    X = check_array(X)

    return self._get_labels