你的第二个机器学习项目:房价预测

数据集的下载地址为:

train.csv

test.csv

首先导入我们程序所需要的几个包:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

准备数据

加载数据,输出训练集和测试集的形状。

x = pd.read_csv('./data/train.csv', index_col='Id')
x_test = pd.read_csv('./data/test.csv', index_col='Id')
print('Train data size:{}'.format(x.shape))
print('Test data size:{}'.format(x_test.shape))
Train data size:(1460, 80)
Test data size:(1459, 79)

去掉输出值为空的行(样本),然后将输出值赋值给y变量,在训练集中删除输出值,使训练集和测试集的特征数量相等。
然后将训练数据分为训练集和验证集,分别占比80%20%

x.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = x['SalePrice']
x.drop(['SalePrice'], axis=1, inplace=True)

x_train, x_val, y_train, y_val = train_test_split(x, y, train_size=0.8, test_size=0.2, random_state=0)
print('Train data size:{}'.format(x_train.shape))
print('Test data size:{}'.format(x_val.shape))

数据预处理

统计包含空值的属性列

miss_count = x_train.isnull().sum()
cols_with_missing = list(miss_count[miss_count > 0].index)
print('Numbers of missing values:{}'.format(len(cols_with_missing)))
print('Missing features:', cols_with_missing)

统计包含分类变量的属性列

categorical_features = (x_train.dtypes == 'object')
categorical_features_list = list(categorical_features[categorical_features].index)
print('Numbers of categorical features:{}'.format(len(categorical_features_list)))
print('Categorical features:', categorical_features_list)

在处理空值的方法中,主要有以下几种:

1.直接删掉空值属性;

# Fill in the line below: get names of columns with missing values
column_missing = [col for col in x_train.columns if x_train[col].isnull().any()] # Your code here

# Fill in the lines below: drop columns in training and validation data
reduced_x_train = x_train.drop(column_missing, axis=1)
reduced_x_val = x_val.drop(column_missing, axis=1)

2.特殊值填充属性中的空值,例如:中位数;

# Fill in the lines below: imputation
my_impute = SimpleImputer(strategy='median') # Your code here
imputed_X_train = pd.DataFrame(my_impute.fit_transform(x_train))
imputed_X_valid = pd.DataFrame(my_impute.transform(x_val))

# Fill in the lines below: imputation removed column names; put them back
imputed_X_train.columns = x_train.columns
imputed_X_valid.columns = x_val.columns

3. 改进版的特殊值填充,设置标志位

# Make copy to avoid changing original data (when imputing)
x_train_plus = x_train.copy()
x_val_plus = x_val.copy()

# Make new columns indicating what will be imputed
for col in cols_with_missing:
    x_train_plus[col + '_was_missing'] = x_train_plus[col].isnull()
    x_val_plus[col + '_was_missing'] = x_val_plus[col].isnull()

# Imputation
my_imputer = SimpleImputer()
imputed_x_train_plus = pd.DataFrame(my_imputer.fit_transform(x_train_plus))
imputed_x_val_plus = pd.DataFrame(my_imputer.transform(x_val_plus))

# Imputation removed column names; put them back
imputed_x_train_plus.columns = x_train_plus.columns
imputed_x_val_plus.columns = x_val_plus.columns

在处理分类变量的方法中,主要有以下几种:

1.直接删掉分类变量的属性;

drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])

2.使用Label Encoding;

在对特征进行编码时,要考虑训练集的分类变量范围是否和训练集的分类变量范围相同,不多不少。如果不相同的情况,则直接删除掉范围不同的属性。

# All categorical columns
object_cols = [col for col in x.columns if x_test[col].dtype == "object"]

# Columns that can be safely label encoded
good_label_cols = [col for col in object_cols if 
                   set(x[col]) == set(x_test[col])]
        
# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(object_cols)-set(good_label_cols))
        
print('Categorical columns that will be label encoded:\n', good_label_cols)
print('\nCategorical columns that will be dropped from the dataset:\n', bad_label_cols)
from sklearn.preprocessing import LabelEncoder

# Drop categorical columns that will not be encoded
label_x_train = x_train.drop(bad_label_cols, axis=1)
label_x_val = x_val.drop(bad_label_cols, axis=1)
label_x_test = x_test.drop(bad_label_cols, axis=1)
print(label_x_train.shape)
print(label_x_val.shape)
print(label_x_test.shape)

# label_x_train = x_train.select_dtypes(include=['object'])
# label_x_val = x_val.select_dtypes(include=['object'])
# label_x_test = x_test.select_dtypes(include=['object'])

# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder()
for col in good_label_cols:
    label_x_train[col] = label_encoder.fit_transform(x_train[col])
    label_x_val[col] = label_encoder.transform(x_val[col])
    label_x_test[col] = label_encoder.transform(x_test[col])

3.使用One-Hot Encoding;

在使用One-Hot编码之前先查看每个属性的基数,我们只将基数小的属性进行One-Hot编码,基数较大的使用Label Encoding编码比较方便。

# Get number of unique entries in each column with categorical data
object_nunique = list(map(lambda col: label_x_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))

# Print number of unique entries by column, in ascending order
sorted(d.items(), key=lambda x: x[1])
# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in object_cols if x[col].nunique() < 10]

# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))

print('Categorical columns that will be one-hot encoded:', low_cardinality_cols)
print('\nCategorical columns that will be dropped from the dataset:', high_cardinality_cols)
from sklearn.preprocessing import OneHotEncoder

# Use as many lines of code as you need!
oh_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
print(x[low_cardinality_cols].describe())
print(x_test[low_cardinality_cols].describe())
OH_col_train = pd.DataFrame(oh_encoder.fit_transform(x[low_cardinality_cols]))
OH_col_valid = pd.DataFrame(oh_encoder.transform(x_test[low_cardinality_cols]))

OH_col_train.index = x.index
OH_col_valid.index = x_test.index

num_x_train = x.drop(low_cardinality_cols, axis=1)
num_x_valid = x_test.drop(low_cardinality_cols, axis=1)
num_x_train = num_x_train.drop(high_cardinality_cols, axis=1)
num_x_valid = num_x_valid.drop(high_cardinality_cols, axis=1)

OH_x_train = pd.concat([num_x_train, OH_col_train], axis=1)
OH_x_valid = pd.concat([num_x_valid, OH_col_valid], axis=1)

管道搭建模型

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if
                    X_train_full[cname].nunique() < 10 and 
                    X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if 
                X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

使用XGBoost训练

# Define the model
my_model_2 = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4)

# Fit the model
my_model_2.fit(X_train, 
               y_train, 
               early_stopping_rounds=5, 
               eval_set=[(X_valid, y_valid)],
               verbose=False) # Your code here

# Get predictions
predictions_2 = my_model_2.predict(X_valid)

# Calculate MAE
mae_2 = mean_absolute_error(predictions_2, y_valid)

# print MAE
print("Mean Absolute Error:" , mae_2)