本文共 14100 字,大约阅读时间需要 47 分钟。
中间件位于引擎与下载器、引擎与spider之间,是处理scrapy中两个重要对象Request、Response及数据数据对象Item的重要的扩展。
那么中间件分类两类就不难理解了,其中一类在引擎与下载器之间我们可以称之为下载中间件、另一个在引擎与spider之间我们可以称之为爬虫中间件;下载中间件和spider中间件都对Request、Response请求处理,根据位置不同,他们主要负责的职能也不同。
spider中间件(主职过滤)对Request、Response的主要作用在过滤,可以对特定路径的URL请求丢弃、对特定页面响应过滤、同时对一些不含有指定信息的item过滤,当然pipeline也能实现item的过滤。
下载中间件(主职加工)主要作用是加工,如给Request添加代理、添加UA、添加cookie,对Response返回数据编码解码、压缩解压缩、格式化等预处理。
下面我们将从这两个中间件出发来详细讲解其中奥妙。
spider中间件
在创建scrapy项目后会自动在middlewares.py文件下生成一个spider中间件和下载中间件模板,查看代码:
class ProxyExampleSpiderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_spider_input(self, response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. return None def process_spider_output(self, response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, dict or Item objects. for i in result: yield i def process_spider_exception(self, response, exception, spider): # Called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Response, dict # or Item objects. pass def process_start_requests(self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated. # Must return only requests (not items). for r in start_requests: yield r def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)
process_spider_input(response, spider)
当Response传递给spider的解析函数之前,该函数执行,返回结果为None或异常
process_spider_output(response, result, spider)
当解析函数完成对Response处理后,该函数执行,接受被解析的Response响应及其对应解析出来的迭代对象result(result可以使yield Request或者yield Item)
process_spider_exception(response, exception, spider)
当spider中间件抛出异常及spider解析函数出现异常,这个方法被调用,返回None或可迭代对象的Request、dict、Item,如果返回None将继续被其他spider中间件的异常处理
from_crawler(cls, crawler)
读取配置文件中的参数进行中间件配置
看一个scrapy内置的spider中间件源码
scrapy.spidermiddlewares.httperror.HttpErrorMiddleware
过滤出所有失败(错误)的HTTP response,因此spider不需要处理这些reques
class HttpErrorMiddleware(object): @classmethod def from_crawler(cls, crawler): return cls(crawler.settings) def __init__(self, settings): self.handle_httpstatus_all = settings.getbool('HTTPERROR_ALLOW_ALL') self.handle_httpstatus_list = settings.getlist('HTTPERROR_ALLOWED_CODES') def process_spider_input(self, response, spider): if 200 <= response.status < 300: # common case return meta = response.meta if 'handle_httpstatus_all' in meta: return if 'handle_httpstatus_list' in meta: allowed_statuses = meta['handle_httpstatus_list'] elif self.handle_httpstatus_all: return else: allowed_statuses = getattr(spider, 'handle_httpstatus_list', self.handle_httpstatus_list) if response.status in allowed_statuses: return raise HttpError(response, 'Ignoring non-200 response') def process_spider_exception(self, response, exception, spider): if isinstance(exception, HttpError): spider.crawler.stats.inc_value('httperror/response_ignored_count') spider.crawler.stats.inc_value( 'httperror/response_ignored_status_count/%s' % response.status ) logger.info( "Ignoring response %(response)r: HTTP status code is not handled or not allowed", { 'response': response}, extra={ 'spider': spider}, ) return []
开启自定义spider中间件方式,在配置文件setting.py中添加命名为SPIDER_MIDDLEWARES的字典,其中key为下载器路径,value为优先级,数字越小越靠近引擎,process_spider_input()优先处理,数字越大越靠近spider,process_spider_output()优先处理
内置spider 中间件
SPIDER_MIDDLEWARES_BASE = { # Engine side 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50, 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500, 'scrapy.spidermiddlewares.referer.RefererMiddleware': 700, 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800, 'scrapy.spidermiddlewares.depth.DepthMiddleware': 900, # Spider side }
下载中间件
下载中间件的使用频率远高于spider中间件,他是我们设置反爬虫措施的主要战场,同样在创建scrapy项目的同时会生成一个下载中间件的模板,如下:
class ProxyExampleDownloaderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called return None def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)
from_crawler(cls, crawler)
这个类方法通常是访问settings和signals的入口函数
process_request(self, request, spider)
当引擎将请求发送给下载器之前调用,用于对request请求加工,返回值为None、Request、Response、IgnoreRequest异常。
返回为None时其他的下载中间件的process_request方法执行,直到内置的一个下载器方法返回Request对象为止
当返回Request对象后,其他的process_request方法不再执行,而是将翻译的Request发往调度器排队等待新一轮的process_request依次执行。
当返回Response,这时更加简单了,相当于告诉引擎我已经得到结果了,不需要其他process_request执行,接下来依次执行process_resposne
process_response(request, response, spider)
当process_request返回resposne对象后,该函数的流水线登场对resposne处理
若返回Response对象,它会被下个中间件中的process_response()处理
若返回Request对象,中间链停止,然后返回的Request会被重新调度下载
抛出IgnoreRequest,回调函数 Request.errback将会被调用处理,若没处理,将会忽略
当然如果所有的中间件都没意见那么将交给引擎转交给spider的解析函数
process_exception(request, exception, spider)
当下载处理模块或process_request()抛出一个异常(包括IgnoreRequest异常)时,该方法被调用通常返回None,它会一直处理异常
总结:下载中间件,是两条流水线,一条是process_request、一条是process_requests,这两条流水线的最末端是内置的方法,当一个Request请求发来,将被篇process_request流水线上的工位逐个检查处理,有的说缺少材料,补一份清单流水线重新检查,有的说这个已经是response成品了不用检查了,并贴上response标签,有的说这个是残次品按照残次品处理,当然如果所有工位都没有意见那么最后一个工位的员工将按照最后流程贴上一个response成品标签,从而进入Process_response流水线。
在process_resposne流水线上同上处理,如果有的工位说还是Request半成品,那么返回process_request流水线在加工;如果有的说确实是response产品,那么给下一个工位逐一确定,直到最后一个内置工位确认确实是response产品,那么交给引擎。
中间件就是这样的流水线思想,我们自定义的中间件只不过是流水线上的一个工位,当然你可以给这个工位分配顺序和权限,可以直接给产品贴标签,贴上不同的标签,将进入不同的流水线再处理。
常用自定义中间件源码:
user-agent中间件
from faker import Faker class UserAgent_Middleware(): def process_request(self, request, spider): f = Faker() agent = f.firefox() request.headers['User-Agent'] = agent
代理ip中间件
class Proxy_Middleware(): def process_request(self, request, spider): try: xdaili_url = spider.settings.get('XDAILI_URL') r = requests.get(xdaili_url) proxy_ip_port = r.text request.meta['proxy'] = 'https://' + proxy_ip_port except requests.exceptions.RequestException: print('获取讯代理ip失败!') spider.logger.error('获取讯代理ip失败!')
scrapy中对接selenium
from scrapy.http import HtmlResponse from selenium import webdriver from selenium.common.exceptions import TimeoutException from gp.configs import * class ChromeDownloaderMiddleware(object): def __init__(self): options = webdriver.ChromeOptions() options.add_argument('--headless') # 设置无界面 if CHROME_PATH: options.binary_location = CHROME_PATH if CHROME_DRIVER_PATH: self.driver = webdriver.Chrome(chrome_options=options, executable_path=CHROME_DRIVER_PATH) # 初始化Chrome驱动 else: self.driver = webdriver.Chrome(chrome_options=options) # 初始化Chrome驱动 def __del__(self): self.driver.close() def process_request(self, request, spider): try: print('Chrome driver begin...') self.driver.get(request.url) # 获取网页链接内容 return HtmlResponse(url=request.url, body=self.driver.page_source, request=request, encoding='utf-8', status=200) # 返回HTML数据 except TimeoutException: return HtmlResponse(url=request.url, request=request, encoding='utf-8', status=500) finally: print('Chrome driver end...')
scrapy对接cookie中间件
class WeiBoMiddleWare(object): def __init__(self, cookies_pool_url): self.logging = logging.getLogger("WeiBoMiddleWare") self.cookies_pool_url = cookies_pool_url def get_random_cookies(self): try: response = requests.get(self.cookies_pool_url) except Exception as e: self.logging.info('Get Cookies failed: {}'.format(e)) else: # 在中间件中,设置请求头携带的Cookies值,必须是一个字典,不能直接设置字符串。 cookies = json.loads(response.text) self.logging.info('Get Cookies success: {}'.format(response.text)) return cookies @classmethod def from_settings(cls, settings): obj = cls( cookies_pool_url=settings['WEIBO_COOKIES_URL'] ) return obj def process_request(self, request, spider): request.cookies = self.get_random_cookies() return None
上面源码来自(https://blog.csdn.net/BF02jgtRS00XKtCx/article/details/82141627、https://blog.csdn.net/qq_42336549/article/details/80991814)用以示例。
使用方法,在setting.py文件中配置DOWNLOADERMIDDLEWARES 字典,键为中间件路径,值是优先级,数字越小,越靠近引擎,数字越大越靠近下载器,所以数字越小的,process_request()
优先处理;数字越大的,process_response()
优先处理.
内置下载中间件如下:
最后再附上一个内置的下载中间件源码:
默认开启,Scrapy将记录所有在request(Cookie
请求头)发送的cookies及response接收到的cookies(Set-Cookie
接收头)。
class CookiesMiddleware(object): """This middleware enables working with sites that need cookies""" def __init__(self, debug=False): self.jars = defaultdict(CookieJar) self.debug = debug @classmethod def from_crawler(cls, crawler): if not crawler.settings.getbool('COOKIES_ENABLED'): raise NotConfigured return cls(crawler.settings.getbool('COOKIES_DEBUG')) def process_request(self, request, spider): if request.meta.get('dont_merge_cookies', False): return cookiejarkey = request.meta.get("cookiejar") jar = self.jars[cookiejarkey] cookies = self._get_request_cookies(jar, request) for cookie in cookies: jar.set_cookie_if_ok(cookie, request) # set Cookie header request.headers.pop('Cookie', None) jar.add_cookie_header(request) self._debug_cookie(request, spider) def process_response(self, request, response, spider): if request.meta.get('dont_merge_cookies', False): return response # extract cookies from Set-Cookie and drop invalid/expired cookies cookiejarkey = request.meta.get("cookiejar") jar = self.jars[cookiejarkey] jar.extract_cookies(response, request) self._debug_set_cookie(response, spider) return response def _debug_cookie(self, request, spider): if self.debug: cl = [to_native_str(c, errors='replace') for c in request.headers.getlist('Cookie')] if cl: cookies = "\n".join("Cookie: {}\n".format(c) for c in cl) msg = "Sending cookies to: {}\n{}".format(request, cookies) logger.debug(msg, extra={ 'spider': spider}) def _debug_set_cookie(self, response, spider): if self.debug: cl = [to_native_str(c, errors='replace') for c in response.headers.getlist('Set-Cookie')] if cl: cookies = "\n".join("Set-Cookie: {}\n".format(c) for c in cl) msg = "Received cookies from: {}\n{}".format(response, cookies) logger.debug(msg, extra={ 'spider': spider}) def _format_cookie(self, cookie): # build cookie string cookie_str = '%s=%s' % (cookie['name'], cookie['value']) if cookie.get('path', None): cookie_str += '; Path=%s' % cookie['path'] if cookie.get('domain', None): cookie_str += '; Domain=%s' % cookie['domain'] return cookie_str def _get_request_cookies(self, jar, request): if isinstance(request.cookies, dict): cookie_list = [{ 'name': k, 'value': v} for k, v in \ six.iteritems(request.cookies)] else: cookie_list = request.cookies cookies = [self._format_cookie(x) for x in cookie_list] headers = { 'Set-Cookie': cookies} response = Response(request.url, headers=headers) return jar.make_cookies(response, request)
·END·