machine learning

One-class Classification (Ep 2/2)

อ่าน Ep 1 ที่นี่ -> https://www.nerd-data.com/one-class-classification-ep1/

Isolation Forest

หรือ เขียนย่อว่า iForest เป็น Tree-based Anomaly Detection Algorithm
Model สร้างบนพื้นฐานของ Normal Data ในแนวทางที่จะ Isolate Anomalies
ซึ่งมีจำนวนน้อนและมีค่าที่แตกต่างไปจากค่า Normal ใน Feature Space Tree Structures สามารถถูกสร้างได้จาก Isolate Anomalies ผลลัพธ์ คือ Isolated Data จะอยู่ที่ความลึกของ Tree ไม่มาก ในขณะที่ Normal Data จะถูกแยกออกมาน้อยกว่าและ อยู่ที่ความลึกของ Tree ที่มากกว่า สามารถ Implement ใน Scikit-Learn ได้ดังนี้

ทำการกำหนด Model เพื่อ Detect Outliers

model = IsolationForest(contamination=0.02)

ในกรณีนี้ จะ Fit Model กับ Majority Class โดยไม่พิจารณา Outliers

trainX = trainX[trainy==0]
model.fit(trainX)

เหมือนกับตัวอย่าง One-class Classification โดยใช้ SVM จะทำนายค่า Inlier เป็น +1 และ ค่า Outlier เป็น -1 ดังนั้น ต้องทำการเปลี่ยนค่า Label ของ Test Set ก่อนที่จะประเมินประสิทธิภาพของ Model (ทำนาย Majority Class โดยไม่พิจารณา Outliers)

👨🏻‍💻 Code ทั้งหมด

# Use isolation forest for imbalanced classification
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.ensemble import IsolationForest

# Define the dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
 n_clusters_per_class=1, weights=[0.999], flip_y=0, random_state=4)
 
# Split train/test sets
trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y)

# Define an outlier detection model
model = IsolationForest(contamination=0.02)

# Fit on majority class
trainX = trainX[trainy==0]
model.fit(trainX)

# Detect outliers in the test set
yhat = model.predict(testX)

# Change inliers 1, outliers -1
testy[testy == 1] = -1
testy[testy == 0] = 1

# Evaluate the model
score = f1_score(testy, yhat, pos_label=-1)
print('F1 Score: %.3f' % score)

เป็น Code ตัวอย่างสำหรับการ Train Isolated Forest Model กับข้อมูล Train แบบ Unsupervised Learning จากนั้น Classify ข้อมูล Test ระหว่าง ค่า Inlier และ ค่า Outlier (Anomaly) และ ทำการประเมินประสิทธิภาพ Model
ผลลัพธ์อาจแตกต่างกันไปขึ้นกับ การ Random ของ Algorithm ขั้นตอนการประเมิน หรือความแม่นยำเชิงตัวเลข สามารถทดลองทำ 2-3 ครั้ง แล้วเปรียบเทียบผลลัพธ์โดยเฉลี่ย
กรณีนี้จะได้คะแนน F1 เท่ากับ 0.088

Minimum Covariance Determinant

หาก Input Variables เป็น Gaussian Distribution สามารถใช้วิธีการทางสถิติเพื่อตรวจหาค่าผิดปกติได้

เช่น หากชุดข้อมูลมี Input Variables 2 ตัว และ ทั้งสองเป็นแบบ Gaussian ใน Feature Space จะสร้าง Multi-dimensional Gaussian และ ความรู้เกี่ยวกับ Distribution นี้ สามารถใช้เพื่อระบุค่าที่ห่างไกล (Outlier) จาก Distribution

วิธีการนี้ทำโดยกำหนด Hypersphere (Elliptical) ที่ครอบคลุมข้อมูลปกติ (Normal Data) และ ข้อมูลที่อยู่ Elliptical จะถือว่าเป็นค่าผิดปกติ (Outlier) การใช้เทคนิคนี้อย่างมีประสิทธิภาพกับข้อมูลหลายตัวแปร เรียกว่า Minimum Covariance Determinant (MCD)

การมีข้อมูลที่มี Distribution แบบนี้ อาจเป็นเรื่องยาก แต่อาจใช้การ Transform เพื่อสร้าง Distrubution แบบ Gaussian ก่อนใช้ MCD

👨🏻‍💻 Code ทั้งหมด

# Use elliptic envelope for imbalanced classification
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.covariance import EllipticEnvelope

# Simulate the dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
 n_clusters_per_class=1, weights=[0.999], flip_y=0, random_state=4)
 
# Split train/test sets
trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y)

# Define an outlier detection model
model = EllipticEnvelope(contamination=0.01)

# Fit on majority class
trainX = trainX[trainy==0]
model.fit(trainX)

# Detect outliers in the test set
yhat = model.predict(testX)

# Change inliers 1, outliers -1
testy[testy == 1] = -1
testy[testy == 0] = 1

# Evaluate the model
score = f1_score(testy, yhat, pos_label=-1)
print('F1 Score: %.3f' % score)

ได้ค่า F1 Score เท่ากับ 0.157 (ผลลัพธ์อาจแตกต่างกันไปขึ้นกับขั้นตอนการ Random ของ Algorithm)

Local Outlier Factor

วิธีง่ายๆ ในการระบุค่าผิดปกติ (Outlier) คือ การหา Sample ที่อยู่ไกลจาก Samples อื่นๆ ใน Feature Space

วิธีนี้ทำงานได้ดีกับ Feature Space ที่มี Dimensions ต่ำๆ (จำนวน Features น้อย) จะมีความน่าเชื่อถือน้อยลงเมื่อจำนวน Features มากเพิ่มขึ้น เรียกว่า Curse of Dimensionality (คำสาปแห่งมิติ)

Local Outlier Factor (LOF) เป็นเทคนิคที่ใช้แนวคิดของเพื่อนบ้านที่ใกล้ที่สุด (Nearest Neighbors) เพื่อการตรวจจับค่าผิดปกติ แต่ละ Sample จะถูกให้คะแนนว่ามีความโดดเดี่ยว หรือ มีแนวโน้มที่จะมีค่าผิดปกติ (Outlier) มากน้อยเพียงใด โดยพิจารณาจากขนาดของ Local Neighborhood โดย Sample ที่มีคะแนนมากมีแนวโน้มที่จะมีค่าผิดปกติมากกว่า

👨🏻‍💻 Code ทั้งหมด

# LOF for imbalanced classification
from numpy import vstack
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.neighbors import LocalOutlierFactor

# Make a prediction with a LOF model
def lof_predict(model, trainX, testX):
	# create one large dataset
	composite = vstack((trainX, testX))
	# make prediction on composite dataset
	yhat = model.fit_predict(composite)
	# return just the predictions on the test set
	return yhat[len(trainX):]

# Simulate the dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
	n_clusters_per_class=1, weights=[0.999], flip_y=0, random_state=4)

# Split train/test sets
trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y)

# Define an outlier detection model
model = LocalOutlierFactor(contamination=0.01)

# Get examples for just the majority class
trainX = trainX[trainy==0]

# Detect outliers in the test set
yhat = lof_predict(model, trainX, testX)

# Change inliers 1, outliers -1
testy[testy == 1] = -1
testy[testy == 0] = 1
# calculate score
score = f1_score(testy, yhat, pos_label=-1)
print('F1 Score: %.3f' % score)

ได้ค่า F1 Score เท่ากับ 0.138 (ผลลัพธ์อาจแตกต่างกันไปขึ้นกับขั้นตอนการ Random ของ Algorithm)

******

ข้อมูลอ้างอิง - https://machinelearningmastery.com/one-class-classification-algorithms/

One-class Classification (Ep 2/2)

👨🏻‍💻 Code ทั้งหมด

👨🏻‍💻 Code ทั้งหมด

👨🏻‍💻 Code ทั้งหมด

Read next

ทำนายการเลิกใช้งานของลูกค้า

Hyper-parameters Tuning in Machine Learning

Customer Look-alike Model