欢迎来到尧图网

客户服务 关于我们

您的位置:首页 > 文旅 > 明星 > 【统计方法】LASSO筛变量

【统计方法】LASSO筛变量

2025/10/25 17:55:56 来源:https://blog.csdn.net/weixin_46623488/article/details/146992958  浏览:    关键词:【统计方法】LASSO筛变量

比较原始做LASSO包是library(glmnet)

若目标是纯 LASSO 分析,alpha 必须设为 ​​1

​​标准化数据​​:LASSO 对特征的尺度敏感,需对数据进行标准化(均值为0,方差为1)。

cv.glmnet​获得的lambda.min 或者 lambda.1se 传递给
glmnet::glmnet(lambda = ???)

# 加载数据(以 mtcars 为例)
data(mtcars)
x <- as.matrix(mtcars[, -1])  # 特征矩阵(mpg 是响应变量)
y <- mtcars$mpg# 交叉验证选择最优 lambda(自动 LASSO)
cv_fit <- cv.glmnet(x, y, alpha = 1)
best_lambda <- cv_fit$lambda.min# 用最优 lambda 训练最终模型
final_model <- glmnet(x, y, alpha = 1, lambda = best_lambda)# 查看筛选的变量
selected_vars <- rownames(coef(final_model))[coef(final_model)[, 1] != 0]
print(selected_vars)

手动标准化特征矩阵
x_scaled <- scale(x)

分类变量区别-测试


library(glmnet)data(iris)
str(iris$Species)
df=iris
design_matrix <- model.matrix(~ Species, data = df)
x<-as.matrix(data.frame(Sepal.Width=df$Sepal.Width, Petal.Length=df$Petal.Length,Petal.Width=df$Petal.Width,design_matrix))fit1 <- cv.glmnet(x = x,y = df$Sepal.Length)
fit1
plot(fit1)iris$Species_num <- as.numeric(iris$Species)
x2 <- as.matrix(iris[, c(2, 3, 4, 5)])
fit2 <- cv.glmnet(x = x, y = iris$Sepal.Length)
fit2
plot(fit2)

食管癌的

# -----01-Lasso----
set.seed(123)
train_index <- caret::createDataPartition(1:nrow(df), p = 0.7, list = T)[["Resample1"]]
test_index= setdiff(1:nrow(df), train_index)library(glmnet)
df <- read.csv("tab.csv")
library(glmnet)
# 先进行参数查找
cv.glmnet()# 
names(df)
df[,4:15]<-lapply(df[,4:15],as.factor)paste(names(df[,4:15]),collapse = "+")
design_matrix <- model.matrix(~ Smoking_status+Alcohol_consumption+Tea_consumption+Sex+Ethnic.group+Residence+Education+Marital.status+History_of_diabetes+Family_history_of_cancer+Occupation+Physical_Activity, data = df)
df[,16:48] <- scale(df[,16:48])
summary(df$AAvsEPA);sd(df$AAvsEPA)
x <- as.matrix(data.frame(df[,16:48],design_matrix))fit1 <- cv.glmnet(x = x[train_index,],y = df[train_index,]$Group,alpha=1, nfolds = 5,type.measure = "mse",family="binomial")
plot(fit1)
fit1
mean(fit1$cvm)
best_lambda <- fit1$lambda.1se
coeficients <- coef(fit1, s = best_lambda)
selected_vars <- rownames(coeficients)[coeficients[, 1] != 0]
print("Selected variables in test prediction:")
print(selected_vars)lasso_pred <- predict(fit1, s = best_lambda, newx = x[test_index,])mse <- mean((lasso_pred - df[test_index,]$Group)^2)
cat("Test MSE:", mse, "")fit<- glmnet(x, df$Group, family =  "cox", maxit = 1000)plot(fit)final_model <- glmnet(x[train_index,], df[train_index,]$Group,  # 重新运行 glmnet(使用相同的 lambda 值)lambda = fit1$lambda,alpha = 1)
plot(final_model,label = T)
plot(final_model, xvar = "lambda", label = TRUE)
plot(final_model, xvar = "dev", label = TRUE)

Feature selection
We found 44 potential features, including demographics and clinical and laboratory variables (Table 1). We performed feature selection using the least absolute shrinkage and selection operator (LASSO), which is among the most widely used feature selection techniques. LASSO constructs a penalty function that compresses some of the regression coefcients, i.e., it forces the sum of the absolute values of the coefcients to be less than some fxed value while setting some regression coefcients at zero, thus obtaining a more refned model. LASSO retains the advantage of subset shrinkage as a biased estimator that deals with data with complex covariance. This algorithm uses LassoCV, a fvefold cross-validation approach, to automatically eliminate factors with zero coefcients (Python version: sklearn 0.22.1)

2.2.2. Feature Selection.
Feature selection was performed by using least absolute shrinkage and selection operator
(LASSO) regression. The LASSO regression model improves the prediction performance by adjusting the hyperparameter λ to compress the regression coefficients to zero and selecting the feature set that performs best in DN prediction. To determine the best λ value, λ was selected by minimum mean error using 10-fold cross-validation.

Detailed steps were as follows: (1) Screening characteristic factors: First, R software (glmnet4.1.2) was used to conduct the least absolute shrinkage and selection operator (LASSO) regression analysis and adjusting the variable screening and complexity. Then, LASSO regression analysis results were used to conduct multifactor
logistic regression analysis with SPSS, and finally, we obtained the characteristic factors of p < 0.05. (2) Data division: Pyskthon (0.22.1) random number method was used to randomly divide the gout patients into training set and test set according to the ratio of 7:3, of which 491 were in the training set and 211 were in the testing set. (3) Classified multi-model comprehensive analysis: eXtreme Gradient Boosting (XGBoost)

版权声明:

本网仅为发布的内容提供存储空间,不对发表、转载的内容提供任何形式的保证。凡本网注明“来源:XXX网络”的作品,均转载自其它媒体,著作权归作者所有,商业转载请联系作者获得授权,非商业转载请注明出处。

我们尊重并感谢每一位作者,均已注明文章来源和作者。如因作品内容、版权或其它问题,请及时与我们联系,联系邮箱:809451989@qq.com,投稿邮箱:809451989@qq.com

热搜词