scrapy爬取cnblogs博客文章(保存json)

本文为从某人cnblog的文章列表中爬取文章题目，url，摘要。

因为作者本人爬取的就是他自己的博客，所以我这里不做更改，只是为学习和记录另一种scrapy的方法。

1.创建project

tenshine@tenshine:~$ scrapy startproject cnblogs
2015-09-08 10:20:43 [scrapy] INFO: Scrapy 1.0.2 started (bot: scrapybot)
2015-09-08 10:20:43 [scrapy] INFO: Optional features available: ssl, http11
2015-09-08 10:20:43 [scrapy] INFO: Overridden settings: {}
New Scrapy project 'cnblogs' created in:
    /home/tenshine/cnblogs
You can start your first spider with:
    cd cnblogs
    scrapy genspider example example.com

目录结构

tenshine@tenshine:~$ tree cnblogs/
cnblogs/
├── cnblogs
│   ├── __init__.py
│   ├── items.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── cnblogs.py
│       └── __init__.py
└── scrapy.cfg
2 directories, 7 files

item.py编写

这里提取文章标题，文章链接，文章摘要

import scrapy
class CnblogsItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()
    pass

pipelines.py编写

import codecs
import json
class CnblogsPipeline(object):
    def __init__(self):
        self.file = codecs.open('cnblogs.json', 'w', encoding='utf-8')
    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(line)
        return item
    def spider_closed(self, spider):
        self.file.close()

修改settings.py

对于setting文件，他作为配置文件，主要是至执行对spider的配置。一些容易被改变的配置参数可以放在spider类的编写中，而几乎在爬虫运行过程中不改变的参数在settings.py中进行配置。
将ITEM_PIPELINES的注释去掉，并将其中内容改为：
'cnblogs.pipelines.CnblogsPipeline': 300,，这个CnblogsPipeline就是在pipelines.py中定义的。

2.编写爬虫

这个爬虫是从文章列表页中爬取每篇文章的标题，url，和摘要，但是列表不止一页，那么我们这就是和爬取csdn博客文章不同的地方了，csdn博客是从指定位置找到下一篇文章的url，然后直接访问就行了，而在这里，scrapy是将每个页面中的所有连接都拿到，然后根据我们自定义的规则（rules）来从所有的url中筛选爬虫接下来要访问哪一个url。

爬取方式既然变了，那么我们爬虫的基类也要变，变成了from scrapy.spiders import CrawlSpider，并且回调函数就不可以用parse了，而需要自己重新定义并指定。
在spider文件夹中创建cnblogs_spider.py文件，文件代码如下：

from scrapy.selector import Selector
try:
    from scrapy.spiders import Spider
except:
    from scrapy.spider import BaseSpider as Spider
from scrapy.utils.response import get_base_url
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.sgml import SgmlLinkExtractor as sle
from cnblogs.items import *
class CnblogsSpider(CrawlSpider):
    #定义爬虫的名称
    name = "cnblogs"
    #定义允许抓取的域名,如果不是在此列表的域名则放弃抓取
    allowed_domains = ["cnblogs.com"]
    #定义抓取的入口urls
    start_urls = ["http://www.cnblogs.com/rwxwsblog/default.html?page=1"]
    # 定义爬取URL的规则，并指定回调函数为parse_item
    rules = [
        #此处要注意?号的转换，复制过来需要对?号进行转换。
        Rule(sle(allow=("/rwxwsblog/default.html\?page=\d{1,}")),
			 follow=True,
			 callback='parse_item')
    ]
    #定义回调函数
    def parse_item(self, response):
        items = []
        sel = Selector(response)
        #base_url = get_base_url(response)#获取当页来自哪个url
        #print base_url
        #div.day=div的class属性是day,然后空格表示下级在找div的class为postTitle的所有元素
        postTitle = sel.css('div.day div.postTitle')
        postCon = sel.css('div.postCon div.c_b_p_desc')
        for index in range(len(postTitle)):
            item = CnblogsItem()
            item['title'] = postTitle[index].css("a").xpath('text()').extract()[0]
            item['link'] = postTitle[index].css('a').xpath('@href').extract()[0]
            item['desc'] = postCon[index].xpath('text()').extract()[0]
            items.append(item)
        return items

先说下这个爬虫的运行过程，我们将此人的文章列表的第一页作为这个爬虫的start_urls，在这里我们没有明确指定爬虫要进入的下一个url是什么，所以默认爬虫会进入它拿到的所有url，这样对我们来说是没用的，因为我们只想爬虫进入此人的所有文章列表页而已，通过观察cnblogs文章列表的页很有规律http://www.cnblogs.com/rwxwsblog/default.html?page=?，后面？对应的就是每一页的文章列表。所以我们在用正则表达式定义了一个规则：/rwxwsblog/default.html\?page=\d{1,}匹配page的数字是[0-9]1次到无限次。也就是说爬虫会根据这个规则来决定进入哪个url。

上面做的就是可以将此人的所有文章列表页都可以进入了，下一步就是在每一个文章列表页中爬取我们需要的信息了，在这里是是用的xpath和css选择器协同使用，如不了解请查看xpath，css选择器。

3.运行

进入cnblogs工程目录，运行：

1	scrapy crawl cnblogs(spider中定义的名称)

下面是保存在json文件中的运行结果：

4.分析CrawlSpider

概念与作用
它是spiders 的派生类，首先在说下spiders ，它是所有爬虫的基类，对于它的设计原则是只爬取start_url列表中的网页，而从爬取的网页中获取link并继续爬取的工作CrawlSpider类更适合。

使用
它与Spider类的最大不同是多了一个rules参数，其作用是定义提取动作。在rules中包含一个或多个Rule对象。

1 2	def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity):

其中：

link_extractor为LinkExtractor，用于定义需要提取的链接。
callback参数：当link_extractor获取到链接时参数所指定的值作为回调函数。

callback参数使用注意：
当编写爬虫规则时，请避免使用parse作为回调函数。于CrawlSpider使用parse方法来实现其逻辑，如果您覆盖了parse方法，crawlspider将会运行失败。
follow：指定了根据该规则从response提取的链接是否需要跟进。当callback为None,默认值为true。
process_links：主要用来过滤由link_extractor获取到的链接。
process_request：主要用来过滤在rule中提取到的request。

LinkExtractor
顾名思义，链接提取器。它的作用及时从response对象中获取链接，并且该链接会被接下来爬取。
可以通过SmglLinkExtractor提取希望获取的链接。
1
allow=(),deny=(),allow_domains=(),deny_domains=(),deny_extensions=None,restrict_xpaths=(),tags=('a','area'),attrs=('href'),canonicalize=True,unique=True,process_value=None)
- allow：满足括号中“正则表达式”的值会被提取，如果为空，则全部匹配。
- deny：与这个正则表达式(或正则表达式列表)不匹配的URL一定不提取。
- allow_domains：会被提取的链接的domains。
- deny_domains：一定不会被提取链接的domains。
- restrict_xpaths：使用xpath表达式，和allow共同作用过滤链接。

此工程的github地址：https://github.com/lowkeynic4/crawl/tree/master/cnblogs

本文转自：http://www.cnblogs.com/rwxwsblog/p/4567052.html 加上本人新增理解和改动