Python(Colab) 타이타닉 데이터 셋 (Titanic data set)(LabelEncoder)

Python(Colab) 타이타닉 데이터 셋 (Titanic data set)(LabelEncoder)

2023. 6. 12. 16:24ㆍ파이썬/머신러닝 및 딥러닝

1. 기초 설정

import numpy as np

import pandas as pd

df = pd.read_csv('https://bit.ly/fc-ml-titanic')

df.head()

2. 데이터 정보 확인

df.info()

3. 데이터 셋 컬럼 설명

* PassengerId : 승객 아이디
* Survived : 생존 여부
* Pclass : 좌석 등급
* Name : 이름
* Sex : 성별
* Age : 나이
* SibSp : 형제자매 배우자수
* Parch : 부모 자식 수
* Ticket: 티켓번호
* Fare: 요금
* Cabin: 선실번호
* Embarked : 탑승 항구

4. 학습 데이터와 검증 데이터 나누기

독립변수 column : 성별, 요금, 나이, 좌석등급

종속변수 column : 생존 여부

# 독립변수

feature = ['Sex','Fare','Age','Pclass']

# 종속변수

label = ['Survived']

df[feature].head()

df[label].head()

5. 결측치 확인 및 처리

df.isnull().sum()

#나이, 선실번호, 탑승항구 NaN 값 확인

#나이 결측치를 평균을 넣어서 처리

df['Age'] = df['Age'].fillna(df['Age'].mean())

6. 라벨 인코딩 (Label_Encoding)

문자를 수치로 변환하는 작업

1) apply(함수)를 이용하기

def conver_sex(data):

if data == 'male':

return 1

else:

return 0

df['Sex'] = df['Sex'].apply(conver_sex)

2) 라벨인코딩 모듈 사용해보기

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

le.fit_transform(df['Embarked'])

7. le(라벨인코딩) 숫자의미 확인하기

.classes_

le.classes_

8. 원 핫 인코딩하기(One Hot Encoding)

✔ 원 핫 인코딩

> 독립적인 데이터는 별도의 컬럼으로 분리하고 각각 컬럼에 해당하는 값에만 1, 나머지는 0의 값을 갖게하는 법

새로운 파생변수로 라벨 인코딩

df['Embarked_num'] = LabelEncoder().fit_transform(df['Embarked'])

df.head()

탑승 항구 원 핫 인코딩

pd.get_dummies(df['Embarked'])

9. 학습데이터와 검증데이터를 분할 후 학습시키기

from sklearn.model_selection import train_test_split

# 검증데이터 20% / 데이터가 변하지 않게 random_state를 준다

X_train, X_test, y_train, y_test = train_test_split(df[feature]

,df[label]

,test_size=0.2

,random_state=10)

728x90

'파이썬 > 머신러닝 및 딥러닝' 카테고리의 다른 글

Python(Colab) 의사 결정 나무(Decision Tree) (0)	2023.06.13
Python(Colab) 선형 회귀(Linear Regression) (0)	2023.06.12
Python(Colab) 아이리스 데이터 셋 (Iris data set)(Scikit-Learn) (0)	2023.06.12
Python(Colab) 사이킷 런 모듈(Scikit-learn Module) (0)	2023.06.11
Python(Colab) 머신러닝 개념 및 기초 (0)	2023.06.11

영차 영차