1.1 用图表分析单变量数据

一, 获取数据

本次使用到的数据量并不多, 不过还是按照常规思路, 通过爬虫获取.

import urllib.request
 import re
 def crawler(url):
     headers = {
         "User-Agent": "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10"
     }
     req = urllib.request.Request(url, headers=headers)
     response = urllib.request.urlopen(req)
     html = response.read().decode('utf-8')
     print(type(HTML))
     pat = r'<tr align="center">(.*?)</tr>'
     re_html = re.compile(pat, re.S) # re.S 可以使匹配换行
     trslist = re_html.findall(HTML) # 匹配出每条信息的数据
     x = []
     y = []
     for tr in trslist:
         re_i = re.compile(r'<div align="center">(.*?)</div>', re.S)
         i = re_i.findall(tr)
         x.append(int(i[1].strip())) # 从每条数据中取出所需要的两个数据年份和诉求数量
         y.append(int(i[2].strip()) if i[2] != '' else 0) # 当匹配到空字符串时就是数据缺失部分, 用 0 代替
     print(x,y) # 查看结果发现第一组和第四组数据有误, 看源码发现他们两个的分类名不是使用的 center 标签, 为了简便, 手动添加这两个数据
     x[0] = 1946
     y[0] = 41
     x[3] = 1949
     y[3] = 28
     return x, y
 url = "http://www.presidency.ucsb.edu/data/sourequests.PHP"
 x, y = crawler(url)

得到的数据:

x:[41, 1947, 1948, 28, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960,
1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975,
1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990,
1991, 1992, 1993, 1994, 1995, 1996, 1997]
y:[16, 23, 16, 17, 20, 11, 19, 14, 39, 32, 0, 14, 0, 16, 6, 25, 24, 18, 17, 38, 31, 27, 26,
17, 21, 20, 17, 23, 16, 13, 13, 21, 11, 13, 11, 8, 8, 14, 9, 7, 5, 5, 54, 34, 18, 20, 27,
30, 22, 25, 19, 26]

二, 绘制图形观察趋势

import numpy as np
 import matplotlib.pyplot as plt
 from matplotlib.pylab import frange
 plt.figure(1)
 plt.title("All data")
 plt.plot(x, y, 'ro')
 plt.xlabel('year')
 plt.ylabel('No Presedential Request')

根据获取到的数据绘制出散点图, 观察其分布情况, 发现有一个极大的异常点, 和两个为零的异常点(获取数据时的缺失值, 默认填充为 0)．

三, 计算百分位数

＃使用 numpy 中的求分位数函数分别计算

perc_25 = np.percentile(y, 25)
 perc_50 = np.percentile(y, 50)
 perc_75 = np.percentile(y, 75)
 print("25th Percentile = %.2f" % perc_25)
 print("50th Percentile = %.2f" % perc_50)
 print("75th Percentile = %.2f" % perc_75)
 '''
 结果:
 25th Percentile = 13.00
 50th Percentile = 18.50
 75th Percentile = 25.25
 '''
  上面已经求得各分位数值, 分别在图中画出来, 为了在上面原始图中画出, 要放在一起执行:
# 在图中画出第 25,50,75 位的百分位水平线
 # ----------------------------------------
 plt.figure(1)
 plt.title("All data")
 plt.plot(x, y, 'ro')
 plt.xlabel('year')
 plt.ylabel('No Presedential Request')
 # ----------------------------------------
 plt.axhline(perc_25, label='25th perc', c='r')
 plt.axhline(perc_50, label='50th perc', c='g')
 plt.axhline(perc_75, label='75th perc', c='m')
 plt.legend(loc='best')
四, 检查异常点
# 检查生成的图形中是否有异常点, 若有, 使用 mask 函数将其删除
 # 0 是在起初获取数据时候的缺失值的填充, 根据图像看到 y=54 的点远远高出其他, 也按异常值处理
 y = np.array(y) # 起初发现 y 为 0 的点没有被删掉, 考虑到他是对数组进行隐藏, 而本来的 y 是个列表, 因此又加了这一句, 果然去掉了两个零点
 y_masked = np.ma.masked_where(y==0, y)
 y_masked = np.ma.masked_where(y_masked==54, y_masked)
 print(type(y),type(y_masked))
 ''' <class'numpy.ndarray'> <class'numpy.ma.core.MaskedArray'>
 '''

重新绘制图像:

# 重新绘制图像
 plt.figure(2)
 plt.title("Masked data")
 plt.plot(x, y_masked, 'ro')
 plt.xlabel('year')
 plt.ylabel('No Presedential Request')
 plt.ylim(0, 60)
 # 在图中画出第 25,50,75 位的百分位的水平线
 plt.axhline(perc_25, label='25th perc', c='r')
 plt.axhline(perc_50, label='50th perc', c='g')
 plt.axhline(perc_75, label='75th perc', c='m')
 plt.legend(loc='best')
 plt.show()

得到的最后的图像, 就是去除了 0 和 54 的三个异常点后的结果.

五, 知识点

plot
plt.close('all') # 关闭之前打开的所有图形
 plt.figure(1) # 给图形编号, 在绘制多个图形的时候有用
 plt.title('All data') # 设置标题
 plt.plot(x, y, 'ro') # "ro" 表示使用红色 (r) 的点 (o) 来绘图

百分位数

一组 n 个观测值按数值大小排列. 如, 处于 p% 位置的值称第 p 百分位数. p=50, 等价于中位数; p=0, 等价于最小值; p=100, 等价于最大值.

plt.axhline()

给定 y 的位置, 从 x 的最小值一直画到 x 的最大值

label 设置名称

c 参数设置线条颜色

eg:perc_25 = 13.00
plt.axhline(perc_25, label='25th perc', c='r')
legend(loc)

plt.legend() 是将图中一些标签显示出来

loc 参数让 pyplot 决定最佳放置位置, 以免影响读图

numpy-mask 函数

删除异常点

y_masked = np.ma.masked_where(y==0, y)

ma.masked_where 函数接受两个参数, 他将数组中符合条件的点进行隐藏, 而不需要删除

来源: https://www.cnblogs.com/yudanqu/p/9727257.html

与本文相关文章

暂无,快来抢沙发吧！