BASEMENT
KNN 알고리즘 본문
KNN (K-Nearest Neighbor) 알고리즘
1. 정의
- 새로운 데이터와 기존 데이터들간 거리를 측정하고, 가까운 데이터들의 종류가 무엇인지 확인하여 새로운 데이터의 종류를 판별
- K는 인접한 데이터의 개수로, KNN알고리즘은 K의 결과에 따라 달라지기 때문에 K를 정해주는 것이 가장 중요
1) 장점
- 어떤 분포든 상관 없음
- 쉽고 이해하기 직관적
- 샘플 수가 많을 때 좋은 분류법
2) 단점
- 최적의 K를 선택하기 어려움
- 데이터가 많을 경우 분석속도가 느릴 수 있음
- 특정분포를 가정하지 않기 때문에 샘플수가 많이 있어야 정확도가 좋음

2. KNN 거리 구하는 공식
1) 유클리드 거리 공식
2) Manhattan Distance 거리 공식
-> 일반적으로 유클리드를 더 많이 사용함. 둘 다 사용해보고 예측결과가 더 높은 것을 선택

3. 정규화
비교하려는 데이터의 편차가 클 경우, 특성을 고르게 반영하기 위해 정규화 과정 필요
MinMaxScaler 사용
4. KNN 알고리즘 사용
from sklearn.neighbors import KNeighborsClassifier
classifer = KneighborsClassifier(n_neighbors = 3) # n_neighbors = K값
# metric='minkowski' / p=1 (manhattan) , p=2 (유클리드)
classifier.fit(X, y) # X : 여러개의 차원으로 이루어진 배열(점들의 집합) / y : 레이블. 0 또는 1로 분류
guesses = classifier.predict()
classifier.score() # 정확도 확인
5. KNN 알고리즘 예제
1)
from sklearn.neighbors import KNeighborsClassifier
X = [[0],[1],[2],[3]]
y = [0,0,1,1]
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y)
print(neigh.predict([[1.1]]))
print(neigh.predict_proba([[0.9]]))
2)
classifier = KNeighborsClassifier(n_neighbors=3, weights="distance", metric="euclidean")
training_points = [
[0.5,0.2,0.1],
[0.9,0.7,0.3],
[0.4,0.5,0.7]
]
training_labels = [0,1,1]
classifier.fit(training_points, training_labels)
unknown_points = [
[0.2,0.1,0.7],
[0.4,0.7,0.6],
[0.5,0.8,0.1]
]
guesses = classifier.predict(unknown_points)
from sklearn import metrics
print("Accuracy: ", metrics.accuracy_score(training_labels, guesses))
3) iris 데이터
from sklearn import datasets
from sklear.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
iris = datasets.load_iris()
print(iris.feature_names)
X = iris.data[:, :4]
y = iris.target
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
X = pd.DataFrame(X_scaled, columns = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'])
X.head()
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)
clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(x_train, y_train)
print(clf.score(x_test, y_test))
# K-fold cross-validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(classifier, X, y, cv=5, scoring='accuracy') # cv : 횟수
print(scores)
print(scores.mean())
-> K-fold cross validation : k-겹 교차검증하여 모델 성능을 평가함
4) breast_cancer 데이터
from sklearn.datasets import load_breast_cancer
import pandas as pd
breast_cancer = load_breast_cancer()
X_Data = pd.DataFrame(breast_cancer.data)
y = pd.DataFrame(breast_cancer.target)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_Data)
X_scaled = scaler.transform(X_Data)
X = pd.DataFrame(X_scaled)
X.columns = breast_cancer.feature_names
X.head()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(X_train, y_train)
import matplotlib.pyplot as plt
k_list = range(1,10)
accuracies = []
for k in k_list:
classifier = KNeighborsClassifier(n_neighbors=k)
classifier.fit(X_train, y_train)
accuracies.append(classifier.score(X_test, y_test))
plt.plot(k_list, accuracies)
plt.xlabel("K")
plt.ylabel("Validation Accuracy")
plt.title("Breast Cancer Classifier Accuracy")
plt.show()
# cross validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(classifier, X, y, cv=5, scoring='accuracy')
print(scores)
print(scores.mean())
from sklearn import model_selection
import matplotlib.pyplot as plt
k_range = range(1,30)
k_scores = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
scores = model_selection.cross_val_score(knn, X, y, cv=5, scoring='accuracy')
k_scores.append(scores.mean())
plt.plot(k_range, k_scores, marker='o', color='green', linestyle='dashed', markersize=5)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')
plt.show()

-> 그래프를 보고 정확도가 떨어지는 부분에서 최적의 K (16)를 선택함
6. KNN 회귀 알고리즘
- from sklearn import neighbors
- RMSE 값이 가장 작은 것을 선택함
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn import neighbors
from math import sqrt
from sklearn.metrics import mean_squared_error
import pandas as pd
boston = datasets.load_boston()
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2, random_state=42)
print(X_train.shape, X_test.shape)
k_range = range(1,20)
rmse_val = []
for K in range(1,20):
model = neighbors.KNeighborsRegressor(n_neighbors=K)
model.fit(X_train, y_train)
pred = model.predict(X_test)
error = sqrt(mean_squared_error(y_test, pred))
rmse_val.append(error)
print('RMSE value for k = ', K, 'is: ', error)
plt.plot(k_range, rmse_val, marker='o', color='green', linestyle='dashed', markersize=5)
plt.xlabel('Value of K for KNN')
plt.ylabel('Accuracy')
plt.show()
'Programming > Machine Learning' 카테고리의 다른 글
SVM, SVR (0) | 2020.10.05 |
---|---|
Naive Bayes (나이브 베이즈) (0) | 2020.10.05 |
모델 성능 평가 척도, ROC 커브 (0) | 2020.09.27 |
로지스틱 회귀분석 (0) | 2020.09.27 |
Sklearn과 선형회귀분석 (0) | 2020.09.27 |
Comments