CSDN 博客备份工具

- 前言
- 核心
  - 登录模块
  - 备份模块
  - 博文扫描模块
- 演示
  - 如何使用
  - 效果
- 总结

前言

近段时间以来，听群友博友都在谈论着一件事："CSDN 博客怎么没有备份功能啊？"。这其实也在一定程度上表征着大家对于文章这种知识性产品的重视度越来越高，也对于数据的安全提高了重视。

所以我就尝试着写了这么一个工具。专门用来备份 CSDN 博友的博客。

核心

说起来是核心，其实也就那么回事吧。严格来说也就是一对代码，不能称之为核心啦。

登录模块

为什么需要登陆模块可能是正在看这篇文章的你的第一个疑惑之处。

其实原因是这样的，如果没有登录的话，从博文接口那里是获取不到相关的文章内容的。所以为了更省事，就添加了一个获取登录之后的 session 来帮助我们爬取文章内容。

不过也不用担心账号密码的安全性什么的，这个工具不会记忆关于您的任何信息。可以放心使用（不信可以看看代码哈）。

登录模块的代码部分也很简单，就是一个模拟登陆 CSDN 的逻辑实现。

# coding: utf8
# @Author: 郭 璞
# @File: login.py                                                                 
# @Time: 2017/4/28                                   
# @Contact: 1064319632@qq.com
# @blog: http://blog.csdn.net/marksinoberg
# @Description: CSDN login for returning the same session for backing up the blogs.
importrequestsfrombs4importBeautifulSoupimportjsonclass Login(object):
    """
    Get the same session for blog's backing up. Need the special username and password of your account.
    """
    def __init__(self, username, password):
        ifusernameandpassword:
            self.username = username
            self.password = password# the common headers for this login operation.self.headers = {'Host':'passport.csdn.net','User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ApplewebKit/537.36 (Khtml, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
            }else:raiseException('Need Your username and password!')def login(self):loginurl ='https://passport.csdn.net/account/login'
        # get the 'token' for webflowself.session = requests.Session()
        response = self.session.get(url=loginurl, headers=self.headers)
        soup = BeautifulSoup(response.text,'html.parser')# Assemble the data for posting operation used in logining.self.token = soup.find('input', {'name':'lt'})['value']
        payload = {'username': self.username,'password': self.password,'lt': self.token,'execution': soup.find('input', {'name':'execution'})['value'],'_eventId':'submit'}
        response = self.session.post(url=loginurl, data=payload, headers=self.headers)# get the session
        returnself.sessionifresponse.status_code==200 else None
    def getSource(self, url):
        """
        测试内容， 可删去，(*^__^*) 嘻嘻……
        :param url:
        :return:
        """username, id = url.split('/')[3], url.split('/')[-1]# print(username, id)backupurl ='http://write.blog.csdn.net/mdeditor/getArticle?id={}&username={}'.format(id, username)
        tempheaders = self.headers
        tempheaders['Referer'] ='http://write.blog.csdn.net/mdeditor'tempheaders['Host'] ='write.blog.csdn.net'tempheaders['X-Requested-With'] ='XMLHttpRequest'response = self.session.get(url=backupurl, headers=tempheaders)
        soup = json.loads(response.text)return{'title': soup['data']['title'],'markdowncontent': soup['data']['markdowncontent'],
        }

通过模拟登陆，获取到一个已登录状态的 session 就可以了，接下来会用得到。

备份模块

一开始我想的是直接获取网页的源码，解析出相应的文章段内容，然后通过一些逻辑实现 HTML 代码到 Markdown 文件的转换，但是对于复杂内容的 HTML 代码，嵌套的层次也比较深，对于表格形式更是有点心有余而力不足。所以技术上还是有难度。

然后很偶然的发现了可以通过这么一个接口来获取到文章相关的 json 数据，里面包括了文章标题，文章初始的 Markdown 文件内容。

'http://write.blog.csdn.net/mdeditor/getArticle?id={}&username={}'.format(id, username)

这简直是太方便了。然后下面是具体的备份逻辑。

# coding: utf8
# @Author: 郭 璞
# @File: backup.py                                                                 
# @Time: 2017/4/28                                   
# @Contact: 1064319632@qq.com
# @blog: http://blog.csdn.net/marksinoberg
# @Description: Back up the blog for getting and stroaging the markdown file.
importjsonimportosimportreclass Backup(object):
    """
    Get the special url for getting markdown file.
    """
    def __init__(self, session, backupurl):self.headers = {'Referer':'http://write.blog.csdn.net/mdeditor','Host':'passport.csdn.net','User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
        }# constructor the url: get article id and the username
        # http://blog.csdn.net/marksinoberg/article/details/70432419username, id = backupurl.split('/')[3], backupurl.split('/')[-1]
        self.backupurl ='http://write.blog.csdn.net/mdeditor/getArticle?id={}&username={}'.format(id, username)
        self.session = sessiondef getSource(self):
        # get title and content for the assigned url.tempheaders = self.headers
        tempheaders['Referer'] ='http://write.blog.csdn.net/mdeditor'tempheaders['Host'] ='write.blog.csdn.net'tempheaders['X-Requested-With'] ='XMLHttpRequest'response = self.session.get(url=self.backupurl, headers=tempheaders)
        soup = json.loads(response.text)return{'title': soup['data']['title'],'markdowncontent': soup['data']['markdowncontent'],
        }def downloadpic(self, picurl, outputpath):tempheaders = self.headers
        tempheaders['Host'] ='img.blog.csdn.net'tempheaders['Upgrade-Insecure-Requests'] ='1'response = self.session.get(url=picurl, headers=tempheaders)
        print(response.status_code)# change the seperator of your OSoutputpath = outputpath.replace(os.sep,'/')
        print(outputpath)ifresponse.status_code ==200:withopen(outputpath,'wb')asf:
                f.write(response.content)
                f.close()
                print("{} saved in {} succeed!".format(picurl, outputpath))else:raiseException("Picture Url: {} downloading failed!".format(picurl))def getpicurls(self):pattern = re.compile("\!\[.*?\]\((.*)?\)")
        markdowncontent = self.getSource()['markdowncontent']returnre.findall(pattern=pattern, string=markdowncontent)def backup(self, outputpath='./'):
        try:
            source = self.getSource()
            foldername = source['title']
            foldername = os.path.join(outputpath, foldername)if notos.path.exists(foldername):
                os.mkdir(foldername)# write filefilename = os.path.join(foldername, source['title'])withopen(filename+".md",'w', encoding='utf8')asf:
                f.write(source['markdowncontent'])
                f.close()# save picturesimgfolder = os.path.join(foldername,'img')if notos.path.exists(imgfolder):
                os.mkdir(imgfolder)forindex, picurlinenumerate(self.getpicurls()):
                imgpath = imgfolder + os.sep+str(index)+'.png'
                try:
                    self.downloadpic(picurl=picurl, outputpath=imgpath)except:# 有可能出现： requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
                    pass
        exceptExceptionase:
            print('恩，又出错了。详细信息为：{}'.format(e))pass

博文扫描模块

博文扫描模块原理上是不用登录的，根据自己的用户名就可以一层层的获取到所有的博客链接。然后保存下来配合上面的备份逻辑，循环着跑一遍就可以了。

# coding: utf8
# @Author: 郭 璞
# @File: blogscan.py                                                                 
# @Time: 2017/4/28                                   
# @Contact: 1064319632@qq.com
# @blog: http://blog.csdn.net/marksinoberg
# @Description: Scan the domain of your blog domain, get the all links of your blogs.
importrequestsfrombs4importBeautifulSoupimportreclass BlogScanner(object):
    """
    Scan for all blogs
    """
    def __init__(self, domain):self.username = domain
        self.rooturl ='http://blog.csdn.net'self.bloglinks = []
        self.headers = {'Host':'blog.csdn.net','Upgrade - Insecure - Requests':'1','User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
        }def scan(self):
        # get the page countresponse = requests.get(url=self.rooturl+"/"+self.username, headers=self.headers)
        soup = BeautifulSoup(response.text,'html.parser')
        pagecontainer = soup.find('div', {'class':'pagelist'})
        pages = re.findall(re.compile('(\d+)'), pagecontainer.find('span').get_text())[-1]# construnct the blog list. Likes: http://blog.csdn.net/Marksinoberg/article/list/2
        forindexinrange(1, int(pages)+1):# get the blog link of each list pagelisturl ='http://blog.csdn.net/{}/article/list/{}'.format(self.username, str(index))
            response = requests.get(url=listurl, headers=self.headers)
            soup = BeautifulSoup(response.text,'html.parser')try:
                alinks = soup.find_all('span', {'class':'link_title'})# print(alinks)
                foralinkinalinks:
                    link = alink.find('a').attrs['href']
                    link = self.rooturl +link
                    self.bloglinks.append(link)exceptExceptionase:
                print('出现了点意外！\n'+e)continue
        returnself.bloglinks

如此，三大模块就算是搞定了。

演示

接下来演示一下如何使用这个工具吧。

如何使用

第一步肯定是要先下载源代码了。
然后借鉴一下下面的代码

# coding: utf8
# @Author: 郭 璞
# @File: Main.py                                                                 
# @Time: 2017/4/28                                   
# @Contact: 1064319632@qq.com
# @blog: http://blog.csdn.net/marksinoberg
# @Description: The entrance of this blog backup tool.
fromcsdnbackup.loginimportLoginfromcsdnbackup.backupimportBackupfromcsdnbackup.blogscanimportBlogScannerimportrandomimporttimeimportgetpass
username = input('请输入账户名：')
password = getpass.getpass(prompt='请输入密码：')
loginer = Login(username=username, password=password)
session = loginer.login()
scanner = BlogScanner(username)
links = scanner.scan()forlinkinlinks:
    backupper = Backup(session=session, backupurl=link)
    timefeed = random.choice([1,3,5,7,2,4,6,8])
    print('随即休眠{}秒'.format(timefeed))
    time.sleep(timefeed)
    backupper.backup(outputpath='./')

最后一步
```
python Main.py
```

效果

下面看下运行结果。

首先是 "总览"（还没测试完，先下载了这几个）
然后是单篇文章
再是文章 Markdown 内容展示
单篇文章图片内容
图片查看

总结

最后来反思一下这个工具还有那些不足之处。

博客名称引起的创建文件夹异常：这点做了异常处理。
访问过快引起的服务器反制：添加了随机休眠时延，但不是治本之术。
还未添加日志模块，对于备份失败的文章应该予以记录。在文章备份操作完成后，对错误日志进行解析，再次尝试备份操作。
测试还不够充分，我自己这边虽然可以跑起来，但是对于其他人有可能会出现一些奇奇怪怪的问题。

最后，放下源码链接，有兴趣的给点个 star 咯。

https://github.com/guoruibiao/csdn-blog-backup-tool

来源: http://blog.csdn.net/marksinoberg/article/details/70946107

与本文相关文章

暂无,快来抢沙发吧！