scrapy爬取cnblogs博客文章(保存mysql)

这篇文章接上篇scrapy爬取cnblogs博客文章(保存json)

大部分和上篇文章没有区别，我直接说不同的地方，那就是处理数据的地方。

1.在item中新建一个字段

linkmd5id=scrapy.Field()
它的作用是用每篇文章的url作为唯一值，如果这个url在mysql数据库中没有存储，就将这条数据全部存进数据库，如果这个url已经存过了，就将这条信息update进数据库，没准这条信息中的题目或者摘要改变了呢。

2.更改pipelines.py

因为我们的数据要存入到数据库中了，所以数据的处理就不能和json一样了，代码如下：

from twisted.enterprise import adbapi
from datetime import datetime
from hashlib import md5
import MySQLdb
import MySQLdb.cursors
class MySQLStoreCnblogsPipeline(object):
    def __init__(self, dbpool):
        self.dbpool = dbpool
    @classmethod
    def from_settings(cls, settings):
        dbargs = dict(
            host=settings['MYSQL_HOST'],
            db=settings['MYSQL_DBNAME'],
            user=settings['MYSQL_USER'],
            passwd=settings['MYSQL_PASSWD'],
            charset='utf8',
	        cursorclass = MySQLdb.cursors.DictCursor,
            use_unicode= True,
        )
        dbpool = adbapi.ConnectionPool('MySQLdb', **dbargs)
        return cls(dbpool)
    #获取url的md5编码
    def _get_linkmd5id(self, item):#url进行md5处理，为避免重复采集设计
        return md5(item['link']).hexdigest()
    #pipeline默认调用
    def process_item(self, item, spider):
        d = self.dbpool.runInteraction(self._do_upinsert, item, spider)
        d.addErrback(self._handle_error, item, spider)
        d.addBoth(lambda _: item)
        return d
    #将每行更新或写入数据库中
    def _do_upinsert(self, conn, item, spider):
        linkmd5id = self._get_linkmd5id(item)
        now = datetime.utcnow().replace(microsecond=0).isoformat(' ')
        conn.execute("""select 1 from cnblogsinfo where linkmd5id = %s""", (linkmd5id, ))
        ret = conn.fetchone()
        if ret:
            conn.execute("""update cnblogsinfo set title = %s, description = %s, link = %s, updated = %s where linkmd5id = %s	""",
                         (item['title'], item['desc'], item['link'], now, linkmd5id))
        else:
            conn.execute("""insert into cnblogsinfo(linkmd5id, title, description, link,  updated)values(%s, %s, %s, %s, %s)""",
                         (linkmd5id, item['title'], item['desc'], item['link'],  now))
    def _handle_error(self, failue, item, spider):#异常处理
        pass

上面的代码首先从setting文件中读取出数据库的配置信息（在下面添加）
_get_linkmd5id是获取文章url的md5值
在插入数据库之前会先查询一次上面的md5值是否已经存在数据库中，如果不存在，那么这就是一条新的数据，直接插入即可；如果已经存在，那么就将这条信息更新一下。

3.修改settings.py

将ITEM_PIPELINES中改为'cnblogs.pipelines.MySQLStoreCnblogsPipeline': 300,
再添加数据库的配置信息如下：

# start MySQL database configure setting
MYSQL_HOST = 'localhost'
MYSQL_DBNAME = 'cnblogsdb'
MYSQL_USER = 'root'
MYSQL_PASSWD = 'root'
# end of MySQL database configure setting

当然上面的信息都是根据自己机器的配置填写。这里面用的数据库是cnblogsdb，所以需要先新建一个数据库和其中的表:sql语句如下：

CREATE DATABASE cnblogsdb DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
CREATE TABLE `cnblogsinfo` (
  `linkmd5id` char(32) NOT NULL COMMENT 'url md5编码id',
  `title` text COMMENT '标题',
  `description` text COMMENT '描述',
  `link` text  COMMENT 'url链接',
  `updated` datetime DEFAULT NULL  COMMENT '最后更新时间',
  PRIMARY KEY (`linkmd5id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

此文件会在工程目录中给出。

以上就是对于之前保存成json工程的所有更改：

4.运行

scrapy crawl cnblogs

数据库中的运行结果如下：

此工程的github地址为:https://github.com/lowkeynic4/crawl/tree/master/cnblogs%28mysql%29