近段时间以来,听群友博友都在谈论着一件事:"CSDN 博客怎么没有备份功能啊?"。这其实也在一定程度上表征着大家对于文章这种知识性产品的重视度越来越高,也对于数据的安全提高了重视。
所以我就尝试着写了这么一个工具。专门用来备份 CSDN 博友的博客。
说起来是核心,其实也就那么回事吧。严格来说也就是一对代码,不能称之为核心啦。
为什么需要登陆模块可能是正在看这篇文章的你的第一个疑惑之处。
其实原因是这样的,如果没有登录的话,从博文接口那里是获取不到相关的文章内容的。所以为了更省事,就添加了一个获取登录之后的 session 来帮助我们爬取文章内容。
不过也不用担心账号密码的安全性什么的,这个工具不会记忆关于您的任何信息。可以放心使用(不信可以看看代码哈)。
登录模块的代码部分也很简单,就是一个模拟登陆 CSDN 的逻辑实现。
- # coding: utf8
- # @Author: 郭 璞
- # @File: login.py
- # @Time: 2017/4/28
- # @Contact: 1064319632@qq.com
- # @blog: http://blog.csdn.net/marksinoberg
- # @Description: CSDN login for returning the same session for backing up the blogs.
- importrequestsfrombs4importBeautifulSoupimportjsonclass Login(object):
- """
- Get the same session for blog's backing up. Need the special username and password of your account.
- """
- def __init__(self, username, password):
- ifusernameandpassword:
- self.username = username
- self.password = password# the common headers for this login operation.self.headers = {'Host':'passport.csdn.net','User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ApplewebKit/537.36 (Khtml, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
- }else:raiseException('Need Your username and password!')def login(self):loginurl ='https://passport.csdn.net/account/login'
- # get the 'token' for webflowself.session = requests.Session()
- response = self.session.get(url=loginurl, headers=self.headers)
- soup = BeautifulSoup(response.text,'html.parser')# Assemble the data for posting operation used in logining.self.token = soup.find('input', {'name':'lt'})['value']
- payload = {'username': self.username,'password': self.password,'lt': self.token,'execution': soup.find('input', {'name':'execution'})['value'],'_eventId':'submit'}
- response = self.session.post(url=loginurl, data=payload, headers=self.headers)# get the session
- returnself.sessionifresponse.status_code==200 else None
- def getSource(self, url):
- """
- 测试内容, 可删去,(*^__^*) 嘻嘻……
- :param url:
- :return:
- """username, id = url.split('/')[3], url.split('/')[-1]# print(username, id)backupurl ='http://write.blog.csdn.net/mdeditor/getArticle?id={}&username={}'.format(id, username)
- tempheaders = self.headers
- tempheaders['Referer'] ='http://write.blog.csdn.net/mdeditor'tempheaders['Host'] ='write.blog.csdn.net'tempheaders['X-Requested-With'] ='XMLHttpRequest'response = self.session.get(url=backupurl, headers=tempheaders)
- soup = json.loads(response.text)return{'title': soup['data']['title'],'markdowncontent': soup['data']['markdowncontent'],
- }
通过模拟登陆,获取到一个已登录状态的 session 就可以了,接下来会用得到。
一开始我想的是直接获取网页的源码,解析出相应的文章段内容,然后通过一些逻辑实现 HTML 代码到 Markdown 文件的转换,但是对于复杂内容的 HTML 代码,嵌套的层次也比较深,对于表格形式更是有点心有余而力不足。所以技术上还是有难度。
然后很偶然的发现了可以通过这么一个接口来获取到文章相关的 json 数据,里面包括了文章标题,文章初始的 Markdown 文件内容。
- 'http://write.blog.csdn.net/mdeditor/getArticle?id={}&username={}'.format(id, username)
这简直是太方便了。然后下面是具体的备份逻辑。
- # coding: utf8
- # @Author: 郭 璞
- # @File: backup.py
- # @Time: 2017/4/28
- # @Contact: 1064319632@qq.com
- # @blog: http://blog.csdn.net/marksinoberg
- # @Description: Back up the blog for getting and stroaging the markdown file.
- importjsonimportosimportreclass Backup(object):
- """
- Get the special url for getting markdown file.
- """
- def __init__(self, session, backupurl):self.headers = {'Referer':'http://write.blog.csdn.net/mdeditor','Host':'passport.csdn.net','User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
- }# constructor the url: get article id and the username
- # http://blog.csdn.net/marksinoberg/article/details/70432419username, id = backupurl.split('/')[3], backupurl.split('/')[-1]
- self.backupurl ='http://write.blog.csdn.net/mdeditor/getArticle?id={}&username={}'.format(id, username)
- self.session = sessiondef getSource(self):
- # get title and content for the assigned url.tempheaders = self.headers
- tempheaders['Referer'] ='http://write.blog.csdn.net/mdeditor'tempheaders['Host'] ='write.blog.csdn.net'tempheaders['X-Requested-With'] ='XMLHttpRequest'response = self.session.get(url=self.backupurl, headers=tempheaders)
- soup = json.loads(response.text)return{'title': soup['data']['title'],'markdowncontent': soup['data']['markdowncontent'],
- }def downloadpic(self, picurl, outputpath):tempheaders = self.headers
- tempheaders['Host'] ='img.blog.csdn.net'tempheaders['Upgrade-Insecure-Requests'] ='1'response = self.session.get(url=picurl, headers=tempheaders)
- print(response.status_code)# change the seperator of your OSoutputpath = outputpath.replace(os.sep,'/')
- print(outputpath)ifresponse.status_code ==200:withopen(outputpath,'wb')asf:
- f.write(response.content)
- f.close()
- print("{} saved in {} succeed!".format(picurl, outputpath))else:raiseException("Picture Url: {} downloading failed!".format(picurl))def getpicurls(self):pattern = re.compile("\!\[.*?\]\((.*)?\)")
- markdowncontent = self.getSource()['markdowncontent']returnre.findall(pattern=pattern, string=markdowncontent)def backup(self, outputpath='./'):
- try:
- source = self.getSource()
- foldername = source['title']
- foldername = os.path.join(outputpath, foldername)if notos.path.exists(foldername):
- os.mkdir(foldername)# write filefilename = os.path.join(foldername, source['title'])withopen(filename+".md",'w', encoding='utf8')asf:
- f.write(source['markdowncontent'])
- f.close()# save picturesimgfolder = os.path.join(foldername,'img')if notos.path.exists(imgfolder):
- os.mkdir(imgfolder)forindex, picurlinenumerate(self.getpicurls()):
- imgpath = imgfolder + os.sep+str(index)+'.png'
- try:
- self.downloadpic(picurl=picurl, outputpath=imgpath)except:# 有可能出现: requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
- pass
- exceptExceptionase:
- print('恩,又出错了。详细信息为:{}'.format(e))pass
博文扫描模块原理上是不用登录的,根据自己的用户名就可以一层层的获取到所有的博客链接。然后保存下来配合上面的备份逻辑,循环着跑一遍就可以了。
- # coding: utf8
- # @Author: 郭 璞
- # @File: blogscan.py
- # @Time: 2017/4/28
- # @Contact: 1064319632@qq.com
- # @blog: http://blog.csdn.net/marksinoberg
- # @Description: Scan the domain of your blog domain, get the all links of your blogs.
- importrequestsfrombs4importBeautifulSoupimportreclass BlogScanner(object):
- """
- Scan for all blogs
- """
- def __init__(self, domain):self.username = domain
- self.rooturl ='http://blog.csdn.net'self.bloglinks = []
- self.headers = {'Host':'blog.csdn.net','Upgrade - Insecure - Requests':'1','User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
- }def scan(self):
- # get the page countresponse = requests.get(url=self.rooturl+"/"+self.username, headers=self.headers)
- soup = BeautifulSoup(response.text,'html.parser')
- pagecontainer = soup.find('div', {'class':'pagelist'})
- pages = re.findall(re.compile('(\d+)'), pagecontainer.find('span').get_text())[-1]# construnct the blog list. Likes: http://blog.csdn.net/Marksinoberg/article/list/2
- forindexinrange(1, int(pages)+1):# get the blog link of each list pagelisturl ='http://blog.csdn.net/{}/article/list/{}'.format(self.username, str(index))
- response = requests.get(url=listurl, headers=self.headers)
- soup = BeautifulSoup(response.text,'html.parser')try:
- alinks = soup.find_all('span', {'class':'link_title'})# print(alinks)
- foralinkinalinks:
- link = alink.find('a').attrs['href']
- link = self.rooturl +link
- self.bloglinks.append(link)exceptExceptionase:
- print('出现了点意外!\n'+e)continue
- returnself.bloglinks
如此,三大模块就算是搞定了。
接下来演示一下如何使用这个工具吧。
- # coding: utf8
- # @Author: 郭 璞
- # @File: Main.py
- # @Time: 2017/4/28
- # @Contact: 1064319632@qq.com
- # @blog: http://blog.csdn.net/marksinoberg
- # @Description: The entrance of this blog backup tool.
- fromcsdnbackup.loginimportLoginfromcsdnbackup.backupimportBackupfromcsdnbackup.blogscanimportBlogScannerimportrandomimporttimeimportgetpass
- username = input('请输入账户名:')
- password = getpass.getpass(prompt='请输入密码:')
- loginer = Login(username=username, password=password)
- session = loginer.login()
- scanner = BlogScanner(username)
- links = scanner.scan()forlinkinlinks:
- backupper = Backup(session=session, backupurl=link)
- timefeed = random.choice([1,3,5,7,2,4,6,8])
- print('随即休眠{}秒'.format(timefeed))
- time.sleep(timefeed)
- backupper.backup(outputpath='./')
- python Main.py
下面看下运行结果。
最后来反思一下这个工具还有那些不足之处。
最后,放下源码链接,有兴趣的给点个 star 咯。
https://github.com/guoruibiao/csdn-blog-backup-tool
来源: http://blog.csdn.net/marksinoberg/article/details/70946107