python 爬虫思路

lis string 元组 tuple 字符串数据类型解析 html number 可迭代对象 requests

python2

爬虫:从网页上采取数据

爬虫模块:urllib,urllib2,re,bs4,requests,scrapy,xlml

.urllib
.request
.bs4

4. 正则 re

5 种数据类型

(1) 数字 Number

(2) 字符串 String

(3) 列表 List[] 中文在可迭代对象就是 unicode 对象

(4) 元组 Tuple()

(5) 字典 Set{}

爬虫思路:

1. 静态 urlopen 打开网页 ------ 获取源码 read

2.requests(模块) get/post 请求 ---- 获取源码 text() 方法 content() 方法 (建议)

3.bs4 能够解析 HTML 和 XML

#-- coding:utf-8 --
from bs4 import BeautifulSoup
#1
#html="
2018.1.8 14:03

#soup=BeautifulSoup(html,'html.parser') #解析网页
#print soup.div
#2 从文件中读取
html=''
soup=BeautifulSoup(open('index.html'),'html.parser')
print soup.prettify()

4. 获取所需信息

python 爬虫思路

来源: http://www.bubuko.com/infodetail-2464320.html

与本文相关文章

暂无,快来抢沙发吧！