Scrapy框架的自定义配置


多spider

创建多个spider,需要为每个spider分配不同的管道(不要在settings.py中添加管道),首先在pipelines.py中定义不同的类,然后修改对应的py文件:

class PythonSpider(scrapy.Spider):
    name = 'csdn'
    allowed_domains = ['www.csdn.net']
    start_urls = ['https://www.csdn.net/nav/python']
    custom_settings = {
        'ITEM_PIPELINES': {'csdnspider.pipelines.csdnspider': 301},
    }
class JavaSpider(scrapy.Spider):
    name = 'java'
    allowed_domains = ['www.csdn.net']
    start_urls = ['https://www.csdn.net/nav/java']
    custom_settings = {
        'ITEM_PIPELINES': {'csdnspider.pipelines.javaspider': 300},
    }

主要是添加custom_settings属性。

运行多个spider

在project目录下新建一个commands文件夹(与spiders文件夹同级),在该文件夹下新建crawlall.py文件:

from scrapy.commands import BaseRunSpiderCommand

class Command(BaseRunSpiderCommand):

    requires_project = True

    def syntax(self):
        return "[options] <spider>"

    def short_desc(self):
        return "Run all of the spiders"

    def run(self, args, opts):
        # 获取爬虫列表
        spd_loader_list = self.crawler_process.spider_loader.list()
        # 遍历各爬虫
        for spname in spd_loader_list:
            print("正在运行的Spider:" + spname)
            self.crawler_process.crawl(spname, **opts.spargs)
        self.crawler_process.start()

修改settings.py,添加如下语句:

COMMANDS_MODULE = 'projectname.commands'

进入命令行,cd到工程目录,输入scrapy -h,能看到crawlall选项,说明成功了。如果不行的话尝试在commands目录下添加一个__init__.py文件。