当前位置：

首页
/
IT
/
程序
/
Python
/
处理特殊字段

处理特殊字段

智能决策上手系列教程索引

在上一篇, 我们读取了数百条拉勾网关于 "人工智能" 职位招聘的信息, 存储了数百个职位信息文件, 这一篇我们提取一些有用的数据放到 Excel 文件进行观察和整理, 分析.

上一篇: 网络数据抓取 - 拉勾网职位列表和详情 - requests 案例

确定需要提取的数据

参照上一篇, 我们使用下面的代码随便读取并输出一个职位数据文件 xxxxx.JSON 的数据, 注意目录和文件名可能需要调整:

#cell-1
import JSON
def readJob(fileName):
    with open(fileName,'r') as f:
        job=JSON.load(f)
        print(JSON.dumps(job,indent=2,ensure_ascii=False))
readJob('./data/lagou_ai/jobs/2363876.JSON')

运行得到一个 JSON 对象, 开始部分类似:

image.PNG

我们只提取以下几个内容:

职位编号 positionId

职位名 positionName

工作经验 workYear

学历要求 education

发布时间 createTime

薪资 salary

工作地 city

公司名 companyFullName

公司规模 companySize

公司融资 financeStage

职位分类 firstType

职位二级分类 secondType

职位详情 details

在上面工作经验, 薪资, 公司规模都是一个范围, 这不太方便分析, 我们把它拆开, 比如薪资 "15k-30k" 拆分成 salaryLow=15 和 salaryHigh=30:

#cell-0.5
def range2two(s, unit):
    s = s.replace(unit, '')  #去掉单位 k, 年, 人
    s = s.replace('','')  #去掉空格
    l = s.split('-')
    res = {
        'low': str(l[0]),
        'high': str(l[1]) if len(l)> 1 else 'None'
    }
    print(res)
range2two('12k-20k','k')
range2two('13','k')

最后两行是测试, 运行输出以下内容表示正常:

image.PNG

测试成功后整理代码:

#cell-0.5
def range2two(s, unit):
    s = s.replace(unit, '')  #去掉单位 k, 年, 人
    s = s.replace('','')  #去掉空格
    l = s.split('-')
    res = {
        'low': str(l[0]),
        'high': str(l[1]) if len(l)> 1 else 'None'
    }
    return res

完善 readJob 函数, 返回 CSV 的一行字符串

修改和完善提取数据的代码, 把表头分为可以直接读取的 labels 和需要转化的 labels2:

#cell-1
import JSON
labels=['positionId','positionName','workYear','education','createTime','salary','city','companyFullName','companySize','financeStage','firstType','secondType','details']
labels2=['salary_low','salary_high','workYear_low','workYear_high','companySize_low','companySize_high']
def readJob(fileName):
    with open(fileName,'r') as f:
        job=JSON.load(f)
        line=[]
        for key in labels:
            line.append(str(job[key]).replace(',',',')) #添加所有 labels 的字段, 用中文逗号替换英文逗号, 避免分割混乱
        salaries=range2two(job['salary'],'k')
        workYears=range2two(job['workYear'],'年')
        companySizes=range2two(job['companySize'],'人')
        line+=[salaries['low'],salaries['high']]
        line+=[workYears['low'],workYears['high']]
        line+=[companySizes['low'],companySizes['high']]
        return ','.join(line)
test=readJob('./data/lagou_ai/jobs/2363876.JSON')
print(test)

因为. CSV 文件使用英文逗号分割每个表格单元的, 所以如果数据里面有英文逗号就会造成混乱,.replace(',',',') 可以用中文逗号替换掉英文逗号, 避免混乱.

最后两行是测试, 注意输出结果应该有 6 个单独的数字:

image.PNG

读取并保存一个职位到 CSV 文件

测试成功后去掉结尾 test 那两行.

新建一个 cell 编写存储. CSV 文件的代码:

#cell-3
with open('./data/lagou_ai/jobs.CSV', 'w', encoding="gb18030") as f:
    text = ''    text +=','.join(labels + labels2)
    text += '\n'
    text += readJob('./data/lagou_ai/jobs/2363876.JSON')
    f.write(text)
    f.close()
    print('>>OK!')

这将生成一个 jobs.CSV 文件, 用 Excel 打开它可以看到有两行内容, 注意检查最后几列是否正常:

image.PNG

读取全部职位存储到 jobs.CSV

首先要取得 jobs 文件夹下所有的文件名, 然后循环操作就可以了.

#cell-3
import os
files = os.listdir('./data/lagou_ai/jobs/')  #获取所有文件列表
with open('./data/lagou_ai/jobs.CSV', 'w', encoding="gb18030") as f:
    text = ''    text +=','.join(labels + labels2)
    for fname in files:
        if not fname.find('.JSON')==-1:
            text += '\n'
            text += readJob('./data/lagou_ai/jobs/' + fname)
    f.write(text)
    f.close()
    print('>>OK!')

代码说明:

os.listdir('./data/lagou_ai/jobs/')

得到这个文件夹下所有的文件的列表, 甚至包含了隐藏文件. 类似

['2178923982.JSON','237218937.JSON',...]
if not fname.find('.JSON')==-1:

如果不是 xxx.JSON 格式的文件名就不执行操作, 比如对于隐藏文件就不操作.'123'.find('23') 等于 1, 左数从 0 开始, 第 1 位置上就是 23,'123'.find('a') 等于 - 1, 因为根本找不到.

运行上面代码得到一个 Excel 表, 包含了数百行职位信息.

image.PNG

清理 Excel 表中错误数据

我们看到其实搜索 "人工智能" 得到的很多职位和人工智能都不相关, 甚至有很多销售, 行政类的职位.

我们可以根据 firstType 和 secondType 过滤到这些错误数据, 把它们直接删除.

点击 Excel 表格列的顶端, 然后 [开始 - 排序和筛选] , 升序降序任意, 弹窗默认 [扩展选定区域] , 然后确定, 相同 secondType 的就会排在一起, 根据需要可以把 "销售, 运营, 推广, 行政" 等明显有问题的职位删除掉, 得到比较有效的内容.

image.PNG

如果只想保留 secondaType 是 "人工智能" 的职位, 可以把它们一起复制剪切出来另存一个. CSV 表格.

如果你想用代码直接实现也可以, 需要增加 if job['secondtype']=='人工智能':, 具体请自己试试看~

回顾总结

总体代码 (主义文件夹路径):

# coding: utf-8
# ### 读取一个文件
# In[16]:
def range2two(s, unit):
    s = s.replace(unit, '')  #去掉单位 k, 年, 人
    s = s.replace('','')  #去掉空格
    l = s.split('-')
    res = {
        'low': str(l[0]),
        'high': str(l[1]) if len(l)> 1 else 'None'
    }
    return res
# In[25]:
import JSON
labels = [
    'positionId', 'positionName', 'workYear', 'education', 'createTime',
    'salary', 'city', 'companyFullName', 'companySize', 'financeStage',
    'firstType', 'secondType', 'details'
]
labels2 = [
    'salary_low', 'salary_high', 'workYear_low', 'workYear_high',
    'companySize_low', 'companySize_high'
]
def readJob(fileName):
    with open(fileName, 'r') as f:
        job = JSON.load(f)
        line = []
        for key in labels:
            line.append(str(job[key]).replace(',',','))  #添加所有 labels 的字段, 用中文逗号替换英文逗号, 避免分割混乱
        salaries = range2two(job['salary'], 'k')
        workYears = range2two(job['workYear'], '年')
        companySizes = range2two(job['companySize'], '人')
        line += [salaries['low'], salaries['high']]
        line += [workYears['low'], workYears['high']]
        line += [companySizes['low'], companySizes['high']]
        return ','.join(line)
# In[37]:
import os
files = os.listdir('./data/lagou_ai/jobs/')  #获取所有文件列表
with open('./data/lagou_ai/jobs.CSV', 'w', encoding="gb18030") as f:
    text = ''    text +=','.join(labels + labels2)
    for fname in files:
        if not fname.find('.JSON')==-1:
            text += '\n'
            text += readJob('./data/lagou_ai/jobs/' + fname)
    f.write(text)
    f.close()
    print('>>OK!')

来源: http://www.jianshu.com/p/353374a0f5b7

与本文相关文章

暂无,快来抢沙发吧！