Kaggler日志-Day3

进度24/12/13

昨日复盘：
Pandas课程：完成4/6
改进“泰坦尼克号”

今日记录
Pandas课程完成
Intermediate Mechine Learning

Pandas课程

数据类型和缺失值
重命名和组合

data.col.dtype
data.dtypes
# str column is shown as object type
# astype("str") and dtype will be object
data.col.astype('float64')  # change the type# Missing Values
data[pd.isnull(data.col)]
data.col.fillna("Unknown")  # use str"Unknown" to fill na in col
data.col.replace("old_value", "vew_value")# rename
data.rename(columns={"old_col_name": "new_col_name"}, index={0: "firstEntry"})
data.rename_axis("row_index_name", axis="rows").rename_axis("col_index_name", axis="columns") # r/c index can have their own name too.# combining
# three mthods: cancat(), join(), merge()
# cancat can smuch two DataFrame is they have the sama columns
pd.cancat([df1, df2])
# join can combine different DF objects which have an index in common
left = data.set_index(['title', 'trending_date'])
right = data.set_index(['title', 'trending_date'])
left.join(right, lsuffix='_CAN', rsuffix='_UK')

关于count和size

count()输出的是非空值个数，并且针对每一列都会有一个结果
而size（）只对应一个整数结果，就是记录的行数。

Pandas 是数据处理中非常重要的库，有了它可以让工作效率翻倍，同时与许多机器学习库都能够兼容。

Intermediate Machine Learning

学习后将可以：

解决现实世界中常遇见的数据类型（缺失值、类型数据）
设计改进模型代码的工作流
使用更高级的模型验证技术（交叉验证）
搭建在Kaggle中常被使用的SOTA模型(XGBoost)
避免在数据科学中常见和重要的错误（泄露）

Introduction

将使用来自Housing Prices Competition for Kaggle Learn Users数据集，使用79个可解释的变量预测房价。

走了一遍基本流程，成功提交一次

Missing Values

处于各种现实因素，许多数据集中都会含有大量缺失值，常见的深度学习库都无法处理缺失值，需要自己针对缺失值进行预处理。
以下介绍三种策略：

策略一：丢弃含缺失值的列，当某列缺失值占大部分时还能接受，否则可能丢失大量有效信息
策略二：推测值，使用平均值或更合理的值直接填充缺失的位置，虽然准确度不高，但通常训练效果优于策略一
策略三：拓展推测值，增加新的列指示出哪些行的值是推测的，效果大小看实际情况。

对比不同的策略

# 策略一，丢弃含缺失值的列
cols_with_missing = [col for col in X_train_columns if X_train[col].isnull().any()]reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduct_X_valid = X_valid.drop(cols_with_missing, axis=1)
print("MAE from Approach 1(Drop with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))
#MAE from Approach 1 (Drop columns with missing values): 183550.22137772635# 策略二， 简单推测
from sklearn.impute import SimpleImputer
# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns
print("MAE from Approach 2 (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))
#MAE from Approach 2 (Imputation): 178166.46269899711# 策略三：拓展推测
# Make copy to avoid changing original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()
# Make new columns indicating what will be imputed
for col in cols_with_missing:X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()
# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))
# Imputation removed column names; put them back
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns
print("MAE from Approach 3 (An Extension to Imputation):")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))
# MAE from Approach 3 (An Extension to Imputation): 178927.503183954

尝试解释为什么策略二优于策略一：

# Shape of training data (num_rows, num_columns)
print(X_train.shape)# Number of missing values in each column of training data
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])
"""
(10864, 12)
Car               49
BuildingArea    5156
YearBuilt       4307
dtype: int64
"""

实战预处理流程：

import pandas as pd
from sklearn.model_selection import train_test_split
# Read the data
X_full = pd.read_csv('../input/train.csv', index_col='Id')
X_test_full = pd.read_csv('../input/test.csv', index_col='Id')
# Remove rows with missing target, separate target from predictors
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_full.SalePrice
X_full.drop(['SalePrice'], axis=1, inplace=True)
# To keep things simple, we'll use only numerical predictors
X = X_full.select_dtypes(exclude=['object'])
X_test = X_test_full.select_dtypes(exclude=['object'])
# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,random_state=0)# step1 初步探索：数据规模和缺失值情况
# Shape of training data (num_rows, num_columns)
print(X_train.shape)
# Number of missing values in each column of training data
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):model = RandomForestRegressor(n_estimators=100, random_state=0)model.fit(X_train, y_train)preds = model.predict(X_valid)return mean_absolute_error(y_valid, preds)# step 2用测评函数分别比较不同处理策略的结果 ...

尽管先前猜测使用预测值填入效果会优于丢且列，但是结果正好相反。分析可能的原因：

可能是预测值引入了噪音
也可能是使用的预测方法不适合当前数据集，也就是使用平均值填充空缺值并不合理。

最终工作流：

选择自己最优的预处理方式生成X_trian, X_valid, y_train, y_valid
验证模型效果
使用相同的方法处理X_test并生成最终提交文件

自由实战部分

参照上述工作流自行选择尽可能优秀的缺失值处理方式。
首先查看数据集描述信息，确定需要选择的训练信息和空缺值填补方式。

对于示例中选择的信息的理解
['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
自己选择的信息
方便起见就是原有信息加上缺失值中的任意项
空缺值填补策略：
['LotFrontage', 'MasVnrArea', 'GarageYrBlt']

LotFrontage：与物业相连的线性街道，数值类型–应该是有的，只是记录缺失，采用平均值填补

count    956.000000
mean      69.614017
std       22.946069
min       21.000000
25%       59.000000
50%       69.000000
75%       80.000000
max      313.000000
Name: LotFrontage, dtype: float64

MasVnrArea：砌砖面积，数值类型–确实有的就是，决定采用0填补。

count    956.000000
mean      69.614017
std       22.946069
min       21.000000
25%       59.000000
50%       69.000000
75%       80.000000
max      313.000000
Name: LotFrontage, dtype: float64

GarageYrBlt：花园修建的时间–确实可能是因为就没有花园，这个的填补策略比较棘手

count    1110.000000
mean     1978.140541
std        24.877265
min      1900.000000
25%      1961.000000
50%      1979.000000
75%      2002.000000
max      2010.000000
Name: GarageYrBlt, dtype: float64

问题

时间空缺值的填补

详细检查：

G_info = X_train[['GarageYrBlt', 'GarageCars', 'GarageArea']]
G_info_focus = G_info[G_info.GarageYrBlt.isnull()]
G = G_info[~G_info.GarageYrBlt.isnull()]
# print(G_info_focus)
print(G_info_foucs.loc[G_info_focus.GarageArea!=0])
print(G.loc[G.GarageArea==0])

发现如果修建时间为空，那么对应的花园面积一定为0，反之花园面积一定大于零，这个就先不使用了，后续可以学学确实时间的填补策略

策略定型：
在原有基础上增加两列含缺失值的数据，分别使用均值填充和0填充。

mean_imputer = SimpleImputer(strategy='mean')
zero_imputer = SimpleImputer(strategy='constant', fill_value=0)X_train_copy = X_train.copy()
X_valid_copy = X_valid.copy()X_train_copy[["MasVnrArea"]] = zero_imputer.fit_transform(X_train_copy[["MasVnrArea"]])
X_train_copy[["LotFrontage"]] = mean_imputer.fit_transform(X_train_copy[["LotFrontage"]])
X_valid_copy[["MasVnrArea"]] = zero_imputer.fit_transform(X_valid_copy[["MasVnrArea"]])
X_valid_copy[["LotFrontage"]] = mean_imputer.fit_transform(X_valid_copy[["LotFrontage"]])# features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd', 'LotFrontage', 'MasVnrArea']
old_features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
# dorp some
# features = [fea for fea in X_train_copy.columns if fea != "GarageYrBlt"]
features = [fea for fea in X_train_copy.columns if fea not in missing_col_names]

old_fea: MAE = 23740
fea: MAR = 23778
结果反而不如从前，这两列缺失值填补后效果不大。

考虑从完整特征集中减去某些列

从全部特征中减去唯一一个有空缺的：17924
再减去两个预测填补的空缺列：17837

Note 注意，SampleImputer的用法，在train上是用fit_transformer, 在其它地方用transformer!!!

出现问题：

问题

test集中在新的列出现了空缺值！！！
如何处理？
先简单处理，将这些特征从featrues里剔除
"BsmtFinSF1", "BsmtFinSF2", "BsmtUnfSF", "BsmtUnfSF", "TotalBsmtSF", "BsmtFullBath", "BsmtHalfBath", "GarageCars", "GarageArea"

减去test中有空缺的列：18934

排名最后上升到1964
第一能到0
第二到七百多
第三就一万出头了
在这里插入图片描述