多spider
创建多个spider,需要为每个spider分配不同的管道(不要在settings.py中添加管道),首先在pipelines.py中定义不同的类,然后修改对应的py文件:
class PythonSpider(scrapy.Spider):
name = 'csdn'
allowed_domains = ['www.csdn.net']
start_urls = ['https://www.csdn.net/nav/python']
custom_settings = {
'ITEM_PIPELINES': {'csdnspider.pipelines.csdnspider': 301},
}
class JavaSpider(scrapy.Spider):
name = 'java'
allowed_domains = ['www.csdn.net']
start_urls = ['https://www.csdn.net/nav/java']
custom_settings = {
'ITEM_PIPELINES': {'csdnspider.pipelines.javaspider': 300},
}
主要是添加custom_settings属性。
运行多个spider
在project目录下新建一个commands文件夹(与spiders文件夹同级),在该文件夹下新建crawlall.py文件:
from scrapy.commands import BaseRunSpiderCommand
class Command(BaseRunSpiderCommand):
requires_project = True
def syntax(self):
return "[options] <spider>"
def short_desc(self):
return "Run all of the spiders"
def run(self, args, opts):
# 获取爬虫列表
spd_loader_list = self.crawler_process.spider_loader.list()
# 遍历各爬虫
for spname in spd_loader_list:
print("正在运行的Spider:" + spname)
self.crawler_process.crawl(spname, **opts.spargs)
self.crawler_process.start()
修改settings.py,添加如下语句:
COMMANDS_MODULE = 'projectname.commands'
进入命令行,cd到工程目录,输入scrapy -h,能看到crawlall选项,说明成功了。如果不行的话尝试在commands目录下添加一个__init__.py
文件。