[ML/python] 범주형, 서열형 변수 처리 (더미화, 치환)

⚙️ Tech/ML

[ML/python] 범주형, 서열형 변수 처리 (더미화, 치환)

fiftyline 2025. 2. 9. 17:26

머신러닝 전처리과정 중 범주형 변수와 서열형 변수 처리방법이다.

df = pd.read_csv("mldata.csv")
X = df.drop('y', axis = 1)
y = df['y']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 50)

더미화

from sklearn.preprocessing import OneHotEncoder
one = OneHotEncoder(sparse = False) #False:numpy배열, True:희소행렬
one.get_feature_names() #원본 열이름

ct = make_column_transformer(
    (StandardScaler(), ['score','weight']),
    (OneHotEncoder(sparse = False),['grade','class','gender'])
)
ct.fit(X_train)
X_train_trans = ct.transform(X_train)




# feature_engine 라이브러리 활용 방법
from feature_engine.encoding import OneHotEncoder as OHE
dummy_model = OHE(top_categories=5, drop_last = True).fit(X_train) #상위5개
Z_train = dummy_model.transform(X_train)
Z_test = dummy_model.transform(X_test)
# 입출력 모두 Df, 기존열 유지, 값 수 제한 가능(상위N개 카테고리만 인코딩후 나머지는 Other로)

라벨(y)을 활용한 치환

# 방법 설명
S = df.groupby('x1')['y'].mean() # 범주형변수(x1)의 값에 따라 y의 평균 계산
df['x1'].replace(S.to_dict()) # 딕셔너리로 변환 후 기존 x1값에 따라 연속형(y의 평균)으로 대체 
# -------------------------------

# 계산을 위해 X와 y 붙이기
train = pd.concat([X_train, y_train], axis = 1)

# 범주형 변수에 대해서만 치환
for col, dtype in zip(X_train.columns, X_train.dtypes):
    if dtype == object:
        S = train.groupby(col)['y'].mean().to_dict()
        X_train.loc[:, col] = X_train[col].replace(S)
        X_test.loc[:, col] = X_test[col].replace(S)

display(X_train['x1'].head())
display(X_test['x1'].head())

참고 서적 : GIL's LAB, 「파이썬을 활용한 머신러닝 자동화 시스템 구축」, 위키북스(2022)

'⚙️ Tech > ML' 카테고리의 다른 글

[ML] 사이킷런 지도 학습 클래스 정리 (0)	2025.02.12
[ML/python] 필터링 기반 특징 선택 (Feature selection, Filter Method, SelectKBest) (0)	2025.02.10
[ML/python] 불균형데이터 오버샘플링, 언더샘플링 (SMOTE, NearMiss) (0)	2025.02.09
[ML/python] 스케일링 (정규화, 표준화, Normalization, Min-Max Scaling, Standardization, Z-score Scaling) (0)	2025.02.09
[ML/python] 결측값 처리 (SimpleImputer, KNNImputer) (0)	2025.02.09

현재글[ML/python] 범주형, 서열형 변수 처리 (더미화, 치환)

데이터 통역가

GA4, 데이터분석, 라인차트, 그로스해킹, SimpleImputer, mongoclient, 메타광고, PYTHON, 기타옵션, KNNImputer, pysprk, sql, 데스크리서치, 멋쟁이사자처럼후기, 데이터마케팅, looker studio, subplots, 시각화 라이브러리, 박스플롯, 그로스마케팅,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

50

[ML/python] 범주형, 서열형 변수 처리 (더미화, 치환)

더미화

라벨(y)을 활용한 치환

'⚙️ Tech > ML' 카테고리의 다른 글

'⚙️ Tech/ML'의 다른글

티스토리툴바

[ML/python] 범주형, 서열형 변수 처리 (더미화, 치환)

더미화

라벨(y)을 활용한 치환

'⚙️ Tech > ML' 카테고리의 다른 글

'⚙️ Tech/ML'의 다른글

관련글

티스토리툴바