StreaKHC
ikpykit.stream.STREAMKHC ¶
STREAMKHC(
method="anne",
n_estimators=200,
max_samples="auto",
max_leaf=5000,
random_state=None,
)
Bases: BaseEstimator
, ClusterMixin
Streaming Hierarchical Clustering Based on Point-Set Kernel.
This algorithm performs hierarchical clustering on streaming data using isolation kernel techniques. It builds a tree structure that adapts as new data points arrive, allowing for efficient online clustering.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
method |
str
|
The method used to calculate the Isolation Kernel. Possible values are 'inne' and 'anne'. |
"anne"
|
n_estimators |
int
|
The number of base estimators in the isolation kernel. |
200
|
max_samples |
(str, int or float)
|
The number of samples to draw from X to train each base estimator.
- If int, then draw |
"auto"
|
max_leaf |
int
|
Maximum number of data points to maintain in the clustering tree. When exceeded, the oldest points will be removed. |
5000
|
random_state |
int, RandomState instance or None
|
Controls the randomness of the estimator. |
None
|
Attributes:
Name | Type | Description |
---|---|---|
tree_ |
INODE
|
The root node of the hierarchical clustering tree. |
iso_kernel_ |
IsoKernel
|
The isolation kernel used for data transformation. |
point_counter_ |
int
|
Counter tracking the total number of points processed. |
n_features_in_ |
int
|
Number of features seen during fit. |
Examples:
>>> from ikpykit.stream import STREAMKHC
>>> import numpy as np
>>> # Generate sample data
>>> X = np.random.rand(100, 10) # 100 samples with 10 features
>>> y = np.random.randint(0, 3, size=100) # Optional class labels
>>> # Initialize and fit the model with a batch
>>> clusterer = STREAMKHC(n_estimators=100, random_state=42)
>>> clusterer = clusterer.fit(X, y)
>>> # Process new streaming data
>>> new_data = np.random.rand(1, 10) # 10 new samples
>>> new_labels = np.random.randint(0, 3, size=1) # Optional class labels
>>> clusterer = clusterer.fit_online(new_data, new_labels)
>>> # Calculate clustering purity (if labels were provided)
>>> purity = clusterer.get_purity()
References
.. [1] Xin Han, Ye Zhu, Kai Ming Ting, De-Chuan Zhan, Gang Li (2022) Streaming Hierarchical Clustering Based on Point-Set Kernel. Proceedings of The ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
Source code in ikpykit/stream/cluster/_streakhc.py
93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
|
fit ¶
fit(X, y=None)
Fit the model with a batch of data points.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
array-like of shape (n_samples, n_features)
|
The input data points. |
required |
y |
array-like of shape (n_samples,), optional (default=None)
|
The labels of the data points. Not used in clustering processing, just for calculating purity. If not provided, the model will generate a tree with a single label. |
None
|
Returns:
Name | Type | Description |
---|---|---|
self |
STREAMKHC
|
Returns self. |
Raises:
Type | Description |
---|---|
ValueError
|
If parameters are invalid or data has incorrect shape. |
Source code in ikpykit/stream/cluster/_streakhc.py
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
|
fit_online ¶
fit_online(X, y=None)
Fit the model with a stream of data points.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
array-like of shape (n_samples, n_features)
|
The input data points. |
required |
y |
array-like of shape (n_samples,), optional (default=None)
|
The labels of the data points. Not used in clustering processing, just for calculating purity. If not provided, the model will generate a tree with a single label. |
None
|
Returns:
Name | Type | Description |
---|---|---|
self |
STREAMKHC
|
Returns self. |
Raises:
Type | Description |
---|---|
NotFittedError
|
If the model has not been initialized with fit. |
ValueError
|
If X has a different number of features than seen during fit. |
Source code in ikpykit/stream/cluster/_streakhc.py
192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 |
|
get_purity ¶
get_purity()
Calculate the purity of the clustering tree.
Returns:
Type | Description |
---|---|
float
|
The purity score of the clustering tree. |
Raises:
Type | Description |
---|---|
NotFittedError
|
If the model has not been initialized. |
Source code in ikpykit/stream/cluster/_streakhc.py
237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 |
|
serialize_tree ¶
serialize_tree(path)
Serialize the clustering tree to a file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str
|
The file path to save the serialized tree. |
required |
Raises:
Type | Description |
---|---|
NotFittedError
|
If the model has not been initialized. |
Source code in ikpykit/stream/cluster/_streakhc.py
255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 |
|
visualize_tree ¶
visualize_tree(path)
Visualize the clustering tree using Graphviz.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str
|
The file path to save the visualization. |
required |
Raises:
Type | Description |
---|---|
NotFittedError
|
If the model has not been initialized. |
Source code in ikpykit/stream/cluster/_streakhc.py
271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 |
|