Python 爬虫突破反爬虫机制实战

2025/12/2 5:30:57 来源：https://blog.csdn.net/2509_91142605/article/details/146937217 浏览: 次关键词：Python 爬虫突破反爬虫机制实战

做开发的朋友大概都知道，爬取网页数据时，经常会碰到网站的反爬虫机制，我最近在做一个电商数据爬取项目时，就被这 “拦路虎” 折腾得够呛。下面跟大伙唠唠我是怎么攻克它的。

我这次的任务是爬取某电商平台商品的详细信息，一开始，我用 Python 写了个简单的爬虫脚本，主要用requests库发送请求获取网页内容，用BeautifulSoup库解析数据。代码如下：

import requests
from bs4 import BeautifulSoupurl = 'https://www.example.com/products'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')products = soup.find_all('div', class_='product-item')
for product in products:name = product.find('a', class_='product-name').textprice = product.find('span', class_='product-price').textprint(f'商品名称: {name}, 价格: {price}')

满心欢喜运行代码，结果只返回了一个验证码页面，根本获取不到商品信息。很明显，网站检测到这是爬虫请求，启动了反爬虫机制。

我首先想到的是伪装请求头，模拟浏览器行为。给requests.get()方法添加请求头参数：

headers = {'User - Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers)

本以为这下万事大吉，可运行后还是被拦截了。看来网站的反爬虫机制比我想象的更复杂。

一番研究后，我发现网站还会检测请求频率。于是，我在代码里添加time.sleep()方法，控制请求间隔：

import timefor i in range(10):response = requests.get(url, headers=headers)# 处理响应数据time.sleep(3)

虽说能获取部分数据了，但爬取速度太慢，效率低得可怜。

后来我发现网站使用了 JavaScript 动态加载数据，requests库无法执行 JavaScript 代码，导致获取的页面内容不完整。我引入Selenium库，借助浏览器驱动来执行 JavaScript。安装selenium和 ChromeDriver 后，代码改成这样：

from selenium import webdriver
from bs4 import BeautifulSoupdriver = webdriver.Chrome()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')products = soup.find_all('div', class_='product-item')
for product in products:name = product.find('a', class_='product-name').textprice = product.find('span', class_='product-price').textprint(f'商品名称: {name}, 价格: {price}')
driver.quit()

这下，成功突破了反爬虫机制，顺利获取到商品数据。

这次经历让我深刻认识到，反爬虫与爬虫之间就像一场没有硝烟的 “战争”。遇到问题别慌，多查阅资料，不断尝试新方法，总能找到解决办法。

Python 爬虫突破反爬虫机制实战

相关资讯

热文排行

最新新闻

推荐新闻

热搜词