랜덤포레스트 (Random Forest)

Notice

Recent Posts

Recent Comments

Link

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

BASEMENT

랜덤포레스트 (Random Forest) 본문

Programming/Machine Learning

랜덤포레스트 (Random Forest)

2_34 2020. 10. 10. 19:23

랜덤포레스트 (Random Forest)

1. 개념

- Decision Tree의 오버피팅 한계를 극복하기 위한 방법

- 데이터에 의사결정나무 여러 개를 동시에 적용해서 학습성능을 높이는 앙상블 기법

- 동일한 데이터로부터 복원추출(bootstrap sampling)을 통해 30개 이상의 데이터 셋을 만들어 각각 의사결정나무를 적용한 뒤 학습 결과를 취합하는 방식

- 데이터 분류, 군집, 예측, Feature 중요성 확인

- 배깅(bagging) : 부트스트랩(bootstrap)을 통해 조금씩 다른 훈련 데이터에 대해 훈련된 기초 분류기들을 결합시키는 방법

- 트리들의 편향(오차)은 그대로 유지하면서, 분산(편차)은 감소시키기 때문에 포레스트의 성능이 향상됨

1) 장점

- 다양성을 극대화 하여 예측력이 우수한 편

- 다수의 트리 예측 결과를 종합하여 의사결정을 진행하기 때문에 안정성도 높음

- 랜덤화는 포레스트에 노이즈가 포한된 데이터에 대해서도 강인함

2) 단점

- 다수의 트리를 이용한 의사결정 기법이기 때문에 설명력을 잃음

2. 변수 중요도

- 선형 회귀모델/로지스틱 회귀모델과는 달리 개별 변수가 통계적으로 얼마나 유의미한지에 대한 정보를 제공하지 않음 (랜덤 추출이기 때문에)

- Out-Of-Bag(OOB) : bootstrap 샘플링 과정에서 추출되지 않은 관측치. test데이터에서의 오분류율을 예측하는 용도 및 변수 중요도를 주청하는 용도로 사용됨

3. 랜덤포레스트 사용

from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators=5, random_state=2)
forest.fit(X_train, y_train)

forest.score(X_train, y_train)
y_pred = forest.predict(X_test)
metrics.accuracy_socre(y_test, y_pred)

- n_estimators : 트리의 갯수 결정

4. 랜덤포레스트 classification 예제

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

x_data = np.array([
    [2,1],
    [3,2],
    [3,4],
    [5,6],
    [7,5],
    [2,1],
    [8,9],
    [9,10],
    [6,12],
    [7,2],
    [6,10],
    [3,4]
])
y_data = np.array([0,0,1,1,1,0,1,1,1,1,1,0])

Label = ['Y', 'N']

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.3, random_state=4)

model = RandomForestClassifier()
model.fit(x_train, y_train)

print(model.score(x_train, y_train))	# score값이 80~90%이면 모델링에 적합함
print(model.score(x_test, y_test))

x_test = np.array([
    [2,2]
])
y_predict = model.predict(x_test)
print(Label[y_predict[0]])

# [2,2]를 넣었을 때 x_test가 Label중 어디에 속하는지 확인 가능

- socre : 모델 성능 평가지표-> crossvalidation 사용

2) make_moon 데이터 (비선형 데이터)

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt

X, y = make_moons(n_samples=100, noise=0.25, random_state=3)

plt.scatter(X[:,0], X[:,1], marker='o', c=y, s=100, edgecolor="k", linewidth=2)
plt.xlabel("$X_1$")
plt.ylabel("$X_2$")
plt.show()

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

forest = RandomForestClassifier(n_estimators=5, random_state=2)
forest.fit(X_train, y_train)

import matplotlib.pyplot as plt
import numpy as np
import mglearn

fig, axes = plt.subplots(2, 3, figsize=(20,10))

for i, (ax, tree) in enumerate(zip(axes.ravel(), forest.estimators_)):
    ax.set_title("tree {}".format(i))
    mglearn.plots.plot_tree_partition(X, y, tree, ax=ax)

mglearn.plots.plot_2d_separator(forest, X, fill=True, ax=axes[-1,-1], alpha=.4)
axes[-1,-1].set_title("random")
mglearn.discrete_scatter(X[:,0], X[:,1], y)

3) breast_cancer 데이터

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score

cancer = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)
forest = RandomForestClassifier(n_estimators=100, random_state=0)
forest.fit(X_train, y_train)

print("train acc: {:.3f}".format(forest.score(X_train, y_train)))
print("test acc: {:.3f}".format(forest.score(X_test, y_test)))

# 그래프로 변수중요도 확인

n_feature = cancer.data.shape[1]
index = np.arange(n_feature)

forest = RandomForestClassifier(n_estimators=100, n_jobs=-1)
forest.fit(X_train, y_train)

plt.barh(index, forest.feature_importances_, align='center')
plt.yticks(index, cancer.feature_names)
plt.ylim(-1, n_feature)
plt.xlabel('feature importance', size=15)
plt.ylabel('feature', size=15)
plt.show()

cf) n_jobs : cpu 코어수 지정. n_jobs=-1 경우 컴퓨터의 모든 코어를 사용함

4) iris 데이터

from sklearn import datasets

iris = datasets.load_iris()

print(iris.target_names)
print(iris.feature_names)

data = pd.DataFrame({
    'sepal length':iris.data[:,0],
    'sepal width':iris.data[:,1],
    'petal length':iris.data[:,2],
    'petal width':iris.data[:,3],
    'species':iris.target
})
data.head()

from sklearn.ensemble import RandomForestClassifier

X = data[['sepal length', 'sepal width', 'petal length', 'petal width']]
y = data['species']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=6)

clf = RandomForestClassifier(n_estimators=100, random_state=4)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)


from sklearn import metrics
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

feature_imp = pd.Series(clf.feature_importances_, index=iris.feature_names).sort_values(ascending=False)
feature_imp

import seaborn as sns
%matplotlib inline

sns.barplot(x=feature_imp, y=feature_imp.index)

plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title('Visualizing import Features')
plt.legend()
plt.show()

# RandomForest - GridSearch

from sklearn.model_selection import GridSearchCV

params = {'n_estimators':[10,100], 'max_depth':[6,8,10,12], 'min_samples_leaf':[8,12,18], 'min_samples_split':[8,16,20]}

# RandomForestClassifier 객체 생성 후 GridSearchCV 수행
rf_clf = RandomForestClassifier(random_state = 4, n_jobs=-1)

grid_cv = GridSearchCV(rf_clf, param_grid=params, cv=5, n_jobs=-1)
grid_cv.fit(X_train, y_train)

print('최적 하이퍼 파라미터:', grid_cv.best_params_)
print('최고 예측 정확도: {:.4f}'.format(grid_cv.best_score_))

5. 랜덤포레스트 regressor(회귀) 예제

1) boston 데이터

from sklearn.ensemble import RandomForestRegressor # 회귀트리(모델)
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing, load_boston
from sklearn.metrics import mean_absolute_error

X, y = load_boston(return_X_y=True)     # return : X, y값 지정

boston = load_boston()
X = boston.data
y = boston.target

colnames = boston.feature_names

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

model = RandomForestRegressor()
model.fit(X = x_train, y = y_train)

y_pred = model.predict(x_test)

mse = mean_absolute_error(y_test, y_pred)
print('mse: ', mse)
rmse = (np.sqrt(mse))
print('rmse: ', rmse)

import matplotlib.pyplot as plt

imp = model.feature_importances_
plt.barh(range(13), imp)        # 변수 중요도
plt.yticks(range(13), colnames) # 축 이름

2) wine 데이터

import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

wine_data = pd.read_csv('winequality-white.csv', delimiter=';', dtype=float)
wine_data.head(10)

x_data = wine_data.iloc[:,0:-1]
y_data = wine_data.iloc[:,-1]

from sklearn.metrics import mean_squared_error

xTrain, xTest, yTrain, yTest = train_test_split(x_data, y_data, test_size=0.3, random_state=531)

# 트리 갯수에 따른 MSE의 변화 확인
mseOos = []
nTreeList = range(50,500,10)
for iTrees in nTreeList:
    depth = None
    maxFeat = 4
    wineRFModel = RandomForestRegressor(n_estimators=iTrees, 
                                        max_depth=depth, 
                                        max_features=maxFeat, 
                                        oob_score=False, 
                                        random_state=531)
    wineRFModel.fit(xTrain, yTrain)
    # 데이터 세트에 대한 MSE 누적
    prediction = wineRFModel.predict(xTest)
    mseOos.append(mean_squared_error(yTest, prediction))

plt.plot(nTreeList, mseOos)
plt.xlabel('Number of Trees in Ensemble')
plt.ylabel('Mean Squared Error')
plt.show()

# 변수중요도 확인

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

featureImportance = wineRFModel.feature_importances_
sns.barplot(x=featureImportance, y=x_data.columns)

plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title('Visualizing Important Features')
plt.legend()
plt.show()