你的第二个机器学习项目:房价预测
数据集的下载地址为:
首先导入我们程序所需要的几个包:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
准备数据
加载数据,输出训练集和测试集的形状。
x = pd.read_csv('./data/train.csv', index_col='Id')
x_test = pd.read_csv('./data/test.csv', index_col='Id')
print('Train data size:{}'.format(x.shape))
print('Test data size:{}'.format(x_test.shape))
Train data size:(1460, 80)
Test data size:(1459, 79)
去掉输出值为空的行(样本),然后将输出值赋值给y变量,在训练集中删除输出值,使训练集和测试集的特征数量相等。
然后将训练数据分为训练集和验证集,分别占比80%
和20%
。
x.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = x['SalePrice']
x.drop(['SalePrice'], axis=1, inplace=True)
x_train, x_val, y_train, y_val = train_test_split(x, y, train_size=0.8, test_size=0.2, random_state=0)
print('Train data size:{}'.format(x_train.shape))
print('Test data size:{}'.format(x_val.shape))
数据预处理
统计包含空值的属性列
miss_count = x_train.isnull().sum()
cols_with_missing = list(miss_count[miss_count > 0].index)
print('Numbers of missing values:{}'.format(len(cols_with_missing)))
print('Missing features:', cols_with_missing)
统计包含分类变量的属性列
categorical_features = (x_train.dtypes == 'object')
categorical_features_list = list(categorical_features[categorical_features].index)
print('Numbers of categorical features:{}'.format(len(categorical_features_list)))
print('Categorical features:', categorical_features_list)
在处理空值的方法中,主要有以下几种:
1.直接删掉空值属性;
# Fill in the line below: get names of columns with missing values
column_missing = [col for col in x_train.columns if x_train[col].isnull().any()] # Your code here
# Fill in the lines below: drop columns in training and validation data
reduced_x_train = x_train.drop(column_missing, axis=1)
reduced_x_val = x_val.drop(column_missing, axis=1)
2.特殊值填充属性中的空值,例如:中位数;
# Fill in the lines below: imputation
my_impute = SimpleImputer(strategy='median') # Your code here
imputed_X_train = pd.DataFrame(my_impute.fit_transform(x_train))
imputed_X_valid = pd.DataFrame(my_impute.transform(x_val))
# Fill in the lines below: imputation removed column names; put them back
imputed_X_train.columns = x_train.columns
imputed_X_valid.columns = x_val.columns
3. 改进版的特殊值填充,设置标志位
# Make copy to avoid changing original data (when imputing)
x_train_plus = x_train.copy()
x_val_plus = x_val.copy()
# Make new columns indicating what will be imputed
for col in cols_with_missing:
x_train_plus[col + '_was_missing'] = x_train_plus[col].isnull()
x_val_plus[col + '_was_missing'] = x_val_plus[col].isnull()
# Imputation
my_imputer = SimpleImputer()
imputed_x_train_plus = pd.DataFrame(my_imputer.fit_transform(x_train_plus))
imputed_x_val_plus = pd.DataFrame(my_imputer.transform(x_val_plus))
# Imputation removed column names; put them back
imputed_x_train_plus.columns = x_train_plus.columns
imputed_x_val_plus.columns = x_val_plus.columns
在处理分类变量的方法中,主要有以下几种:
1.直接删掉分类变量的属性;
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
2.使用Label Encoding;
在对特征进行编码时,要考虑训练集的分类变量范围是否和训练集的分类变量范围相同,不多不少。如果不相同的情况,则直接删除掉范围不同的属性。
# All categorical columns
object_cols = [col for col in x.columns if x_test[col].dtype == "object"]
# Columns that can be safely label encoded
good_label_cols = [col for col in object_cols if
set(x[col]) == set(x_test[col])]
# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(object_cols)-set(good_label_cols))
print('Categorical columns that will be label encoded:\n', good_label_cols)
print('\nCategorical columns that will be dropped from the dataset:\n', bad_label_cols)
from sklearn.preprocessing import LabelEncoder
# Drop categorical columns that will not be encoded
label_x_train = x_train.drop(bad_label_cols, axis=1)
label_x_val = x_val.drop(bad_label_cols, axis=1)
label_x_test = x_test.drop(bad_label_cols, axis=1)
print(label_x_train.shape)
print(label_x_val.shape)
print(label_x_test.shape)
# label_x_train = x_train.select_dtypes(include=['object'])
# label_x_val = x_val.select_dtypes(include=['object'])
# label_x_test = x_test.select_dtypes(include=['object'])
# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder()
for col in good_label_cols:
label_x_train[col] = label_encoder.fit_transform(x_train[col])
label_x_val[col] = label_encoder.transform(x_val[col])
label_x_test[col] = label_encoder.transform(x_test[col])
3.使用One-Hot Encoding;
在使用One-Hot
编码之前先查看每个属性的基数,我们只将基数小的属性进行One-Hot
编码,基数较大的使用Label Encoding
编码比较方便。
# Get number of unique entries in each column with categorical data
object_nunique = list(map(lambda col: label_x_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))
# Print number of unique entries by column, in ascending order
sorted(d.items(), key=lambda x: x[1])
# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in object_cols if x[col].nunique() < 10]
# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))
print('Categorical columns that will be one-hot encoded:', low_cardinality_cols)
print('\nCategorical columns that will be dropped from the dataset:', high_cardinality_cols)
from sklearn.preprocessing import OneHotEncoder
# Use as many lines of code as you need!
oh_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
print(x[low_cardinality_cols].describe())
print(x_test[low_cardinality_cols].describe())
OH_col_train = pd.DataFrame(oh_encoder.fit_transform(x[low_cardinality_cols]))
OH_col_valid = pd.DataFrame(oh_encoder.transform(x_test[low_cardinality_cols]))
OH_col_train.index = x.index
OH_col_valid.index = x_test.index
num_x_train = x.drop(low_cardinality_cols, axis=1)
num_x_valid = x_test.drop(low_cardinality_cols, axis=1)
num_x_train = num_x_train.drop(high_cardinality_cols, axis=1)
num_x_valid = num_x_valid.drop(high_cardinality_cols, axis=1)
OH_x_train = pd.concat([num_x_train, OH_col_train], axis=1)
OH_x_valid = pd.concat([num_x_valid, OH_col_valid], axis=1)
管道搭建模型
# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if
X_train_full[cname].nunique() < 10 and
X_train_full[cname].dtype == "object"]
# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if
X_train_full[cname].dtype in ['int64', 'float64']]
# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')
# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
使用XGBoost训练
# Define the model
my_model_2 = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4)
# Fit the model
my_model_2.fit(X_train,
y_train,
early_stopping_rounds=5,
eval_set=[(X_valid, y_valid)],
verbose=False) # Your code here
# Get predictions
predictions_2 = my_model_2.predict(X_valid)
# Calculate MAE
mae_2 = mean_absolute_error(predictions_2, y_valid)
# print MAE
print("Mean Absolute Error:" , mae_2)
本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!