Notice
Recent Posts
Recent Comments
Link
«   2025/04   »
1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30
Tags
more
Archives
Today
Total
관리 메뉴

BASEMENT

KNN 알고리즘 본문

Programming/Machine Learning

KNN 알고리즘

2_34 2020. 9. 27. 18:57

KNN (K-Nearest Neighbor) 알고리즘

 

1. 정의

 

- 새로운 데이터와 기존 데이터들간 거리를 측정하고, 가까운 데이터들의 종류가 무엇인지 확인하여 새로운 데이터의 종류를 판별

- K는 인접한 데이터의 개수로, KNN알고리즘은 K의 결과에 따라 달라지기 때문에 K를 정해주는 것이 가장 중요

 

1) 장점

- 어떤 분포든 상관 없음

- 쉽고 이해하기 직관적

- 샘플 수가 많을 때 좋은 분류법

 

2) 단점

- 최적의 K를 선택하기 어려움

- 데이터가 많을 경우 분석속도가 느릴 수 있음

- 특정분포를 가정하지 않기 때문에 샘플수가 많이 있어야 정확도가 좋음

 

 

 

2. KNN 거리 구하는 공식

 

1) 유클리드 거리 공식

2) Manhattan Distance 거리 공식

-> 일반적으로 유클리드를 더 많이 사용함. 둘 다 사용해보고 예측결과가 더 높은 것을 선택

 

3. 정규화

 

비교하려는 데이터의 편차가 클 경우, 특성을 고르게 반영하기 위해 정규화 과정 필요

MinMaxScaler 사용

 

 

4. KNN 알고리즘 사용

from sklearn.neighbors import KNeighborsClassifier
classifer = KneighborsClassifier(n_neighbors = 3)	# n_neighbors = K값
# metric='minkowski' / p=1 (manhattan) , p=2 (유클리드)

classifier.fit(X, y)	# X : 여러개의 차원으로 이루어진 배열(점들의 집합) / y : 레이블. 0 또는 1로 분류
guesses = classifier.predict()
classifier.score()	# 정확도 확인

 

5. KNN 알고리즘 예제

 

1)

from sklearn.neighbors import KNeighborsClassifier

X = [[0],[1],[2],[3]]
y = [0,0,1,1]

neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y)

print(neigh.predict([[1.1]]))
print(neigh.predict_proba([[0.9]]))

 

2)

classifier = KNeighborsClassifier(n_neighbors=3, weights="distance", metric="euclidean")

training_points = [
    [0.5,0.2,0.1],
    [0.9,0.7,0.3],
    [0.4,0.5,0.7]
]
training_labels = [0,1,1]
classifier.fit(training_points, training_labels)
unknown_points = [
    [0.2,0.1,0.7],
    [0.4,0.7,0.6],
    [0.5,0.8,0.1]
]

guesses = classifier.predict(unknown_points)

from sklearn import metrics
print("Accuracy: ", metrics.accuracy_score(training_labels, guesses))

 

3) iris 데이터

from sklearn import datasets
from sklear.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score

iris = datasets.load_iris()

print(iris.feature_names)

X = iris.data[:, :4]
y = iris.target
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

X = pd.DataFrame(X_scaled, columns = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'])

X.head()
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)

clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(x_train, y_train)

print(clf.score(x_test, y_test))
# K-fold cross-validation

from sklearn.model_selection import cross_val_score

scores = cross_val_score(classifier, X, y, cv=5, scoring='accuracy')	# cv : 횟수
print(scores)
print(scores.mean())

-> K-fold cross validation : k-겹 교차검증하여 모델 성능을 평가함

 

 

4)  breast_cancer 데이터

from sklearn.datasets import load_breast_cancer
import pandas as pd

breast_cancer = load_breast_cancer()

X_Data = pd.DataFrame(breast_cancer.data)
y = pd.DataFrame(breast_cancer.target)
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_Data)
X_scaled = scaler.transform(X_Data)

X = pd.DataFrame(X_scaled)

X.columns = breast_cancer.feature_names
X.head()
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier(n_neighbors=3)

classifier.fit(X_train, y_train)
import matplotlib.pyplot as plt

k_list = range(1,10)
accuracies = []

for k in k_list:
    classifier = KNeighborsClassifier(n_neighbors=k)
    classifier.fit(X_train, y_train)
    accuracies.append(classifier.score(X_test, y_test))

plt.plot(k_list, accuracies)
plt.xlabel("K")
plt.ylabel("Validation Accuracy")
plt.title("Breast Cancer Classifier Accuracy")
plt.show()
# cross validation

from sklearn.model_selection import cross_val_score

scores = cross_val_score(classifier, X, y, cv=5, scoring='accuracy')
print(scores)
print(scores.mean())
from sklearn import model_selection
import matplotlib.pyplot as plt

k_range = range(1,30)
k_scores = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = model_selection.cross_val_score(knn, X, y, cv=5, scoring='accuracy')
    k_scores.append(scores.mean())

plt.plot(k_range, k_scores, marker='o', color='green', linestyle='dashed', markersize=5)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')
plt.show()

 

-> 그래프를 보고 정확도가 떨어지는 부분에서 최적의 K (16)를 선택함

 

 

6. KNN 회귀 알고리즘

 

- from sklearn import neighbors

- RMSE 값이 가장 작은 것을 선택함

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn import neighbors
from math import sqrt
from sklearn.metrics import mean_squared_error
import pandas as pd

boston = datasets.load_boston()

X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2, random_state=42)
print(X_train.shape, X_test.shape)

k_range = range(1,20)

rmse_val = []

for K in range(1,20):
    model = neighbors.KNeighborsRegressor(n_neighbors=K)
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    error = sqrt(mean_squared_error(y_test, pred))
    rmse_val.append(error)
    print('RMSE value for k = ', K, 'is: ', error)

plt.plot(k_range, rmse_val, marker='o', color='green', linestyle='dashed', markersize=5)
plt.xlabel('Value of K for KNN')
plt.ylabel('Accuracy')
plt.show()

'Programming > Machine Learning' 카테고리의 다른 글

SVM, SVR  (0) 2020.10.05
Naive Bayes (나이브 베이즈)  (0) 2020.10.05
모델 성능 평가 척도, ROC 커브  (0) 2020.09.27
로지스틱 회귀분석  (0) 2020.09.27
Sklearn과 선형회귀분석  (0) 2020.09.27
Comments