1, 解析 JSON 数据
Python 把 JSON 转换成字典, JSON 数组转换成列表, JSON 字符串转换成 Python 字符串.
下面的例子演示了使用 Python 的 JSON 解析库, 处理 JSON 字符串中可能出现的不同数据类型:
- >>> import json
- >>> jsonString = '{"arrayOfNums":[{"number":0},{"number":1},{"number":2}],"arrayOfFruits":[{"fruit":"apple"},{"fruit":"banana"},{"fruit":"pear"}]}'
- >>> jsonObj = json.loads(jsonString)
- >>> print(jsonObj.get("arrayOfNums"))
- [{'number': 0}, {'number': 1}, {'number': 2}]
- >>> print(jsonObj.get("arrayOfNums")[1])
- {'number': 1}
- >>> print(jsonObj.get("arrayOfNums")[1].get("number")+jsonObj.get("arrayOfNums")[2].get("number"))
- 3
- >>> print(jsonObj.get("arrayOfFruits")[2].get("fruit"))
- pear
第一行输出是一个组词典构成的列表对象, 第二行是一个词典对象, 第三行是一个整数 (第一行词典列表里整数的和), 第四行是一个字符串.
使用 Python 的 JSON 解析函数来解码, 可以打印出 IP 地址为 50.78.253.58 的国家代码.
- # -*- coding: utf-8 -*-
- import json
- from urllib.request import urlopen
- def getCountry(ipAddress):
- response = urlopen("http://freegeoip.net/json/"+ipAddress).read().decode('utf-8')
- responseJson = json.loads(response)
- return responseJson.get("country_code")
- print(getCountry("50.78.253.58"))
- >>>
- US
2, 维基百科词条的编辑历史页面
做一个采集维基百科的基本程序, 寻找编辑历史页面, 然后把编辑历史里面的 IP 地址找出来, 查询 IP 地址所属的国家代码.
- # -*- coding: utf-8 -*-
- import re
- import datetime
- import random
- import json
- from urllib.request import urlopen
- from bs4 import BeautifulSoup
- random.seed(datetime.datetime.now())
- def getLinks(articleUrl):
- html = urlopen("http://en.wikipedia.org"+articleUrl)
- bsObj = BeautifulSoup(html, "lxml")
- return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))
- def getHistoryIPs(pageUrl):
- # 编辑历史页面 URL 链接格式是:
- # http://en.wikipedia.org/w/index.php?title=Title_in_URL&action=history
- pageUrl = pageUrl.replace("/wiki/", "")
- historyUrl = "http://en.wikipedia.org/w/index.php?title="+pageUrl+"&action=history"
- print("history url is:"+historyUrl)
- html = urlopen(historyUrl)
- bsObj = BeautifulSoup(html, "lxml")
- # 找出 class 属性是 "mw-anonuserlink" 的链接
- # 它们用 IP 地址代替用户名
- ipAddresses = bsObj.findAll("a", {"class":"mw-anonuserlink"})
- addressList = set()
- for ipAddress in ipAddresses:
- addressList.add(ipAddress.get_text())
- return addressList
- def getCountry(ipAddress):
- try:
- response = urlopen("http://freegeoip.net/json/"+ipAddress).read().decode('utf-8')
- except HTTPError:
- return None
- responseJson = json.loads(response)
- return responseJson.get("country_code")
- links = getLinks("/wiki/Python_(programming_language)")
- while(len(links) > 0):
- for link in links:
- print("-------------------")
- historyIPs = getHistoryIPs(link.attrs["href"])
- for historyIP in historyIPs:
- #print(historyIP)
- country = getCountry(historyIP)
- if country is not None:
- print(historyIP+"is from"+country)
- newLink = links[random.randint(0, len(links)-1)].attrs["href"]
- links = getLinks(newLink)
首先获取起始词条连接的所有词条的编辑历史 (示例中是 Python programminglanguage 词条). 然后, 随机选择一个词条作为起始点, 再获取这个页面连接的所有词条的编辑历史, 查询编辑者的 IP 地址所属的国家和地区. 重复这个过程直到页面没有连接维基词条为止.
其中, 函数 getHistoryIPs 搜索所有 mw-anonuserlin 类里面的链接信息 (匿名用户的 IP 地址, 不是用户名), 返回一个链接列表.
获得了编辑历史的 IP 地址数据, 把它们与上一节的 getCountry 函数结合起来, 查询 IP 地址所属的国家和地区.
以下是部分输出结果:
- -------------------
- history url is: http://en.wikipedia.org/w/index.php?title=Programming_paradigm&action=history
- 168.216.130.133 is from US
- 223.104.186.241 is from CN
- 31.203.136.191 is from KW
- 192.117.105.47 is from IL
- 193.80.242.220 is from AT
- 223.230.96.108 is from IN
- 39.36.182.41 is from PK
- 68.151.180.83 is from CA
- 218.17.157.55 is from CN
- 110.55.67.15 is from PH
- 42.111.56.168 is from IN
- 92.115.222.143 is from MD
- 197.255.127.246 is from GH
2605:6000:ec0f:c800:edfd:179f:b648:b4b9 is from US
2a02:c7d:a492:f200:e126:2b36:53ca:513a is from GB
- -------------------
- history url is: http://en.wikipedia.org/w/index.php?title=Object-oriented_programming&action=history
- 103.74.23.139 is from PK
- 217.225.8.24 is from DE
- 223.230.215.145 is from IN
- 162.204.116.16 is from US
- 170.142.177.246 is from US
- 205.251.185.250 is from US
- 117.239.185.50 is from IN
- 119.152.87.84 is from PK
- 93.136.125.208 is from HR
- 113.199.249.237 is from NP
- 112.200.199.62 is from PH
- 103.241.244.36 is from IN
- 27.251.109.234 is from IN
- 103.16.68.215 is from IN
- 121.58.212.157 is from PH
2605:a601:474:600:2088:fbde:7512:53b2 is from US
-------------------
来源: http://www.bubuko.com/infodetail-2579769.html