Scrapy 爬虫使用指南完全教程

152次阅读

没有评论

共计 5391 个字符，预计需要花费 14 分钟才能阅读完成。

scrapy note

command

全局命令:

startproject：在 project_name 文件夹下创建一个名为 project_name 的 Scrapy 项目。
```
scrapy startproject myproject
```
settings：在项目中运行时，该命令将会输出项目的设定值，否则输出 Scrapy 默认设定。
runspider：在未创建项目的情况下，运行一个编写在 Python 文件中的 spider。
shell：以给定的 URL(如果给出)或者空 (没有给出 URL) 启动 Scrapy shell。
fetch：使用 Scrapy 下载器 (downloader) 下载给定的 URL，并将获取到的内容送到标准输出。
```
scrapy fetch --nolog --headers http://www.example.com/
```
view：在浏览器中打开给定的 URL，并以 Scrapy spider 获取到的形式展现。
```
scrapy view http://www.example.com/some/page.html
```
version：输出 Scrapy 版本。

项目 (Project-only) 命令:

crawl：使用 spider 进行爬取。
scrapy crawl myspider
check：运行 contract 检查。
scrapy check -l
list：列出当前项目中所有可用的 spider。每行输出一个 spider。
edit

parse：获取给定的 URL 并使用相应的 spider 分析处理。如果您提供 –callback 选项，则使用 spider 的该方法处理，否则使用 parse。

--spider=SPIDER: 跳过自动检测 spider 并强制使用特定的 spider
--a NAME=VALUE: 设置 spider 的参数(可能被重复)
--callback or -c: spider 中用于解析返回 (response) 的回调函数
--pipelines: 在 pipeline 中处理 item
--rules or -r: 使用 CrawlSpider 规则来发现用来解析返回 (response) 的回调函数
--noitems: 不显示爬取到的 item
--nolinks: 不显示提取到的链接
--nocolour: 避免使用 pygments 对输出着色
--depth or -d: 指定跟进链接请求的层次数(默认: 1)
--verbose or -v: 显示每个请求的详细信息
scrapy parse http://www.example.com/ -c parse_item

genspider：在当前项目中创建 spider。

scrapy genspider [-t template] <name> <domain>
scrapy genspider -t basic example example.com

deploy：将项目部署到 Scrapyd 服务。
bench：运行 benchmark 测试。

使用选择器(selectors)

body = '<html><body><span>good</span></body></html>'
Selector(text=body).xpath('//span/text()').extract()

response = HtmlResponse(url='http://example.com', body=body)
Selector(response=response).xpath('//span/text()').extract()

Scrapy 提供了两个实用的快捷方式: response.xpath() 及 response.css()

>>>response.xpath('//base/@href').extract()
>>>response.css('base::attr(href)').extract()
>>>response.xpath('//a[contains(@href,"image")]/@href').extract()
>>>response.css('a[href*=image]::attr(href)').extract()
>>>response.xpath('//a[contains(@href,"image")]/img/@src').extract()
>>>response.css('a[href*=image] img::attr(src)').extract()

嵌套选择器(selectors)

选择器方法 (.xpath() or .css()) 返回相同类型的选择器列表，因此你也可以对这些选择器调用选择器方法。下面是一个例子:

links = response.xpath('//a[contains(@href,"image")]')
for index, link in enumerate(links):
        args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
        print 'Link number %d points to url %s and image %s' % args

结合正则表达式使用选择器(selectors)

Selector 也有一个 .re() 方法，用来通过正则表达式来提取数据。然而，不同于使用 .xpath() 或者 .css() 方法, .re() 方法返回 unicode 字符串的列表。所以你无法构造嵌套式的 .re() 调用。

>>> response.xpath('//a[contains(@href,"image")]/text()').re(r'Name:\s*(.*)')

使用相对 XPaths

>>>for p in divs.xpath('//p'):  # this is wrong - gets all <p> from the whole document
...     print p.extract()
>>>for p in divs.xpath('.//p'):  # extracts all <p> inside
...     print p.extract()
>>>for p in divs.xpath('p'): #gets all <p> from the whole document
...     print p.extract()

例如在 XPath 的 starts-with() 或 contains() 无法满足需求时，test() 函数可以非常有用。

>>>sel.xpath('//li//@href').extract()
>>>sel.xpath('//li[re:test(@class,"item-\d$")]//@href').extract()

XPATH TIPS

Avoid using contains(.//text(),‘search text’) in your XPath conditions. Use contains(.,‘search text’) instead.
Beware of the difference between //node[1] and (//node)[1]
When selecting by class, be as specific as necessary，When querying by class, consider using CSS
Learn to use all the different axes
Useful trick to get text content

Item Loaders

populate items

def parse(self, response):
    l = ItemLoader(item=Product(), response=response)
    l.add_xpath('name', '//div[@class="product_name"]')
    l.add_xpath('name', '//div[@class="product_title"]')
    l.add_xpath('price', '//p[@id="price"]')
    l.add_css('stock', 'p#stock]')
    l.add_value('last_updated', 'today') # you can also use literal values
    return l.load_item()

Item Pipeline

清理 HTML 数据
验证爬取的数据(检查 item 包含某些字段)
查重(并丢弃)
将爬取结果保存到数据库中

编写你自己的 item pipeline

每个 item pipeline 组件都需要调用该方法，这个方法必须返回一个 Item (或任何继承类)对象，或是抛出 DropItem 异常，被丢弃的 item 将不会被之后的 pipeline 组件所处理。
参数:

item (Item 对象) – 被爬取的 item
spider (Spider 对象) – 爬取该 item 的 spider

Write items to MongoDB

import pymongo

class MongoPipeline(object):

    def__init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

@classmethod
    def from_crawler(cls, crawler):
        return cls(mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        collection_name = item.__class__.__name__
        self.db[collection_name].insert(dict(item))
        return item

为了启用一个 Item Pipeline 组件，你必须将它的类添加到 ITEM_PIPELINES 配置，就像下面这个例子:

ITEM_PIPELINES = {'myproject.pipelines.PricePipeline': 300,
    'myproject.pipelines.JsonWriterPipeline': 800,
}

分配给每个类的整型值，确定了他们运行的顺序，item 按数字从低到高的顺序，通过 pipeline，通常将这些数字定义在 0 -1000 范围内。

实践经验

同一进程运行多个 spider

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

runner = CrawlerRunner(get_project_settings())
dfs = set()
for domain in ['scrapinghub.com', 'insophia.com']:
    d = runner.crawl('followall', domain=domain)
    dfs.add(d)

defer.DeferredList(dfs).addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished