进度24/12/13
昨日复盘:
Pandas
课程:完成4/6
改进“泰坦尼克号”
今日记录
Pandas
课程完成
Intermediate Mechine Learning
Pandas课程
- 数据类型和缺失值
- 重命名和组合
data.col.dtype
data.dtypes
# str column is shown as object type
# astype("str") and dtype will be object
data.col.astype('float64') # change the type# Missing Values
data[pd.isnull(data.col)]
data.col.fillna("Unknown") # use str"Unknown" to fill na in col
data.col.replace("old_value", "vew_value")# rename
data.rename(columns={"old_col_name": "new_col_name"}, index={0: "firstEntry"})
data.rename_axis("row_index_name", axis="rows").rename_axis("col_index_name", axis="columns") # r/c index can have their own name too.# combining
# three mthods: cancat(), join(), merge()
# cancat can smuch two DataFrame is they have the sama columns
pd.cancat([df1, df2])
# join can combine different DF objects which have an index in common
left = data.set_index(['title', 'trending_date'])
right = data.set_index(['title', 'trending_date'])
left.join(right, lsuffix='_CAN', rsuffix='_UK')
关于count和size
count()
输出的是非空值个数,并且针对每一列都会有一个结果
而size()
只对应一个整数结果,就是记录的行数。
Pandas 是数据处理中非常重要的库,有了它可以让工作效率翻倍,同时与许多机器学习库都能够兼容。
Intermediate Machine Learning
学习后将可以:
- 解决现实世界中常遇见的数据类型(缺失值、类型数据)
- 设计改进模型代码的工作流
- 使用更高级的模型验证技术(交叉验证)
- 搭建在Kaggle中常被使用的SOTA模型(XGBoost)
- 避免在数据科学中常见和重要的错误(泄露)
Introduction
将使用来自Housing Prices Competition for Kaggle Learn Users
数据集,使用79个可解释的变量预测房价。
走了一遍基本流程,成功提交一次
Missing Values
处于各种现实因素,许多数据集中都会含有大量缺失值,常见的深度学习库都无法处理缺失值,需要自己针对缺失值进行预处理。
以下介绍三种策略:
- 策略一:丢弃含缺失值的列,当某列缺失值占大部分时还能接受,否则可能丢失大量有效信息
- 策略二:推测值,使用平均值或更合理的值直接填充缺失的位置,虽然准确度不高,但通常训练效果优于策略一
- 策略三:拓展推测值,增加新的列指示出哪些行的值是推测的,效果大小看实际情况。
对比不同的策略
# 策略一,丢弃含缺失值的列
cols_with_missing = [col for col in X_train_columns if X_train[col].isnull().any()]reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduct_X_valid = X_valid.drop(cols_with_missing, axis=1)
print("MAE from Approach 1(Drop with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))
#MAE from Approach 1 (Drop columns with missing values): 183550.22137772635# 策略二, 简单推测
from sklearn.impute import SimpleImputer
# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns
print("MAE from Approach 2 (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))
#MAE from Approach 2 (Imputation): 178166.46269899711# 策略三:拓展推测
# Make copy to avoid changing original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()
# Make new columns indicating what will be imputed
for col in cols_with_missing:X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()
# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))
# Imputation removed column names; put them back
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns
print("MAE from Approach 3 (An Extension to Imputation):")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))
# MAE from Approach 3 (An Extension to Imputation): 178927.503183954
尝试解释为什么策略二优于策略一:
# Shape of training data (num_rows, num_columns)
print(X_train.shape)# Number of missing values in each column of training data
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])
"""
(10864, 12)
Car 49
BuildingArea 5156
YearBuilt 4307
dtype: int64
"""
实战预处理流程:
import pandas as pd
from sklearn.model_selection import train_test_split
# Read the data
X_full = pd.read_csv('../input/train.csv', index_col='Id')
X_test_full = pd.read_csv('../input/test.csv', index_col='Id')
# Remove rows with missing target, separate target from predictors
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_full.SalePrice
X_full.drop(['SalePrice'], axis=1, inplace=True)
# To keep things simple, we'll use only numerical predictors
X = X_full.select_dtypes(exclude=['object'])
X_test = X_test_full.select_dtypes(exclude=['object'])
# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,random_state=0)# step1 初步探索:数据规模和缺失值情况
# Shape of training data (num_rows, num_columns)
print(X_train.shape)
# Number of missing values in each column of training data
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):model = RandomForestRegressor(n_estimators=100, random_state=0)model.fit(X_train, y_train)preds = model.predict(X_valid)return mean_absolute_error(y_valid, preds)# step 2用测评函数分别比较不同处理策略的结果 ...
尽管先前猜测使用预测值填入效果会优于丢且列,但是结果正好相反。分析可能的原因:
- 可能是预测值引入了噪音
- 也可能是使用的预测方法不适合当前数据集,也就是使用平均值填充空缺值并不合理。
最终工作流:
- 选择自己最优的预处理方式生成
X_trian, X_valid, y_train, y_valid
- 验证模型效果
- 使用相同的方法处理X_test并生成最终提交文件
自由实战部分
参照上述工作流自行选择尽可能优秀的缺失值处理方式。
首先查看数据集描述信息,确定需要选择的训练信息和空缺值填补方式。
-
对于示例中选择的信息的理解
['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
-
自己选择的信息
方便起见就是原有信息加上缺失值中的任意项 -
空缺值填补策略:
['LotFrontage', 'MasVnrArea', 'GarageYrBlt']
- LotFrontage:与物业相连的线性街道,数值类型–应该是有的,只是记录缺失,采用平均值填补
count 956.000000
mean 69.614017
std 22.946069
min 21.000000
25% 59.000000
50% 69.000000
75% 80.000000
max 313.000000
Name: LotFrontage, dtype: float64
- MasVnrArea:砌砖面积,数值类型–确实有的就是,决定采用0填补。
count 956.000000
mean 69.614017
std 22.946069
min 21.000000
25% 59.000000
50% 69.000000
75% 80.000000
max 313.000000
Name: LotFrontage, dtype: float64
- GarageYrBlt:花园修建的时间–确实可能是因为就没有花园,这个的填补策略比较棘手
count 1110.000000
mean 1978.140541
std 24.877265
min 1900.000000
25% 1961.000000
50% 1979.000000
75% 2002.000000
max 2010.000000
Name: GarageYrBlt, dtype: float64
问题
时间空缺值的填补
详细检查:
G_info = X_train[['GarageYrBlt', 'GarageCars', 'GarageArea']]
G_info_focus = G_info[G_info.GarageYrBlt.isnull()]
G = G_info[~G_info.GarageYrBlt.isnull()]
# print(G_info_focus)
print(G_info_foucs.loc[G_info_focus.GarageArea!=0])
print(G.loc[G.GarageArea==0])
发现如果修建时间为空,那么对应的花园面积一定为0,反之花园面积一定大于零,这个就先不使用了,后续可以学学确实时间的填补策略
策略定型:
在原有基础上增加两列含缺失值的数据,分别使用均值填充和0填充。
mean_imputer = SimpleImputer(strategy='mean')
zero_imputer = SimpleImputer(strategy='constant', fill_value=0)X_train_copy = X_train.copy()
X_valid_copy = X_valid.copy()X_train_copy[["MasVnrArea"]] = zero_imputer.fit_transform(X_train_copy[["MasVnrArea"]])
X_train_copy[["LotFrontage"]] = mean_imputer.fit_transform(X_train_copy[["LotFrontage"]])
X_valid_copy[["MasVnrArea"]] = zero_imputer.fit_transform(X_valid_copy[["MasVnrArea"]])
X_valid_copy[["LotFrontage"]] = mean_imputer.fit_transform(X_valid_copy[["LotFrontage"]])# features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd', 'LotFrontage', 'MasVnrArea']
old_features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
# dorp some
# features = [fea for fea in X_train_copy.columns if fea != "GarageYrBlt"]
features = [fea for fea in X_train_copy.columns if fea not in missing_col_names]
- old_fea: MAE = 23740
- fea: MAR = 23778
结果反而不如从前,这两列缺失值填补后效果不大。
考虑从完整特征集中减去某些列
- 从全部特征中减去唯一一个有空缺的:17924
- 再减去两个预测填补的空缺列:17837
Note 注意,SampleImputer的用法,在train上是用fit_transformer, 在其它地方用transformer!!!
出现问题:
问题
test集中在新的列出现了空缺值!!!
如何处理?
先简单处理,将这些特征从featrues里剔除
"BsmtFinSF1", "BsmtFinSF2", "BsmtUnfSF", "BsmtUnfSF", "TotalBsmtSF", "BsmtFullBath", "BsmtHalfBath", "GarageCars", "GarageArea"
减去test中有空缺的列:18934
排名最后上升到1964
第一能到0
第二到七百多
第三就一万出头了