一, 编程环境
- Win10
- Python3.6
- Jupyter Notebook
- Graphviz (简介和安装请参考 https://www.jianshu.com/p/b559dc689b7f)
二, 数据源
http://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html
把这个网址里的数据拷贝到 csv 文件中, 并命名为 dataset_uncleaned.csv
三, 清洗数据
1 将疾病和对应的多个症状放到字典里, key 为疾病, value 为多个症状.
注意, 有些疾病和症状包含了特殊符号'^', 需要先处理成'_'再切割.
- import csv
- from collections import defaultdict
- disease_list = []
- def return_list(disease):
- disease_list = []
- match = disease.replace('^','_').split('_')
- ctr = 1
- for group in match:
- if ctr%2==0:
- disease_list.append(group)
- ctr = ctr + 1
- return disease_list
- with open("Scraped-Data/dataset_uncleaned.csv") as csvfile:
- reader = csv.reader(csvfile)
- disease=""
- weight = 0
- disease_list = []
- dict_wt = {}
- dict_=defaultdict(list)
- for row in reader:
- if row[0]!="\xc2\xa0" and row[0]!="":
- disease = row[0]
- disease_list = return_list(disease)
- weight = row[1]
- if row[2]!="\xc2\xa0" and row[2]!="":
- symptom_list = return_list(row[2])
- for d in disease_list:
- for s in symptom_list:
- dict_[d].append(s)
- dict_wt[d] = weight
- print (dict_)
2 将疾病 - 症状 - 样本数写到 dataset_clean.csv 中, 注意, 每个疾病对应着一个样本数和多个症状.
- with open("Scraped-Data/dataset_clean.csv","w") as csvfile:
- writer = csv.writer(csvfile)
- for key,values in dict_.items():
- for v in values:
- #key = str.encode(key)
- key = str.encode(key).decode('utf-8')
- #.strip()
- #v = v.encode('utf-8').strip()
- #v = str.encode(v)
- writer.writerow([key,v,dict_wt[key]])
注意, 此时看到的 csv 中, 每行数据下面有一行空行, 这个先不用处理, 下面的步骤会处理.
3 给数据表 dataset_clean.csv 中的每列数据加上列标题
- columns = ['Source','Target','Weight']
- data = pd.read_csv("Scraped-Data/dataset_clean.csv",names=columns, encoding ="ISO-8859-1")
- data.head()
- data.to_csv("Scraped-Data/dataset_clean.csv",index=False)
此时, 每行下面的空行消失了.
4 标注数据并存到 nodetable.csv 中
数据分为三列, 第一列 ID 是疾病名称或症状名称; 第二列 Label 是疾病名称或症状名称, 与 ID 完全一样; 第三标属性标明了这个 ID 或 Label 是病症或症状.
- slist = []
- dlist = []
- with open("Scraped-Data/nodetable.csv","w") as csvfile:
- writer = csv.writer(csvfile)
- for key,values in dict_.items():
- for v in values:
- if v not in slist:
- writer.writerow([v,v,"symptom"])
- slist.append(v)
- if key not in dlist:
- writer.writerow([key,key,"disease"])
- dlist.append(key)
- nt_columns = ['Id','Label','Attribute']
- nt_data = pd.read_csv("Scraped-Data/nodetable.csv",names=nt_columns, encoding ="ISO-8859-1",)
- nt_data.head()
- nt_data.to_csv("Scraped-Data/nodetable.csv",index=False)
四, 分析清洗好的数据
- data = pd.read_csv("Scraped-Data/dataset_clean.csv", encoding ="ISO-8859-1")
- len(data['Source'].unique())
- len(data['Target'].unique())
- df = pd.DataFrame(data)
- df_1 = pd.get_dummies(df.Target)
- df_1
- df
- df_s = df['Source']
- df_pivoted = pd.concat([df_s,df_1], axis=1)
- df_pivoted.drop_duplicates(keep='first',inplace=True)
- df_pivoted
- len(df_pivoted)
- cols = df_pivoted.columns
- print(cols)
- df_pivoted = df_pivoted.groupby('Source').sum()
- df_pivoted = df_pivoted.reset_index()
- df_pivoted
- len(df_pivoted)
- df_pivoted.to_csv("Scraped-Data/df_pivoted.csv")
这此代码主要是分析数据, 比如疾病有多少种, 症状有多少种. 每种疾病对应的症状标记为 1, 没对应上的症状标记为 0, 将这些数据合并后存到 df_pivoted.csv 中.
五, 用朴素贝叶斯来训练模型
- x = df_pivoted[cols]
- y = df_pivoted['Source']
- import pandas as pd
- import seaborn as sns
- import matplotlib.pyplot as plt
- %matplotlib inline
- from sklearn.naive_bayes import MultinomialNB
- from sklearn.cross_validation import train_test_split
- x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
- mnb = MultinomialNB()
- mnb = mnb.fit(x_train, y_train)
- mnb.score(x_test, y_test)
得分为 0, 意味着没有预测能力.
这是因为, 对于 149 条数据 (对应着 149 种疾病), 被预测的那 1/3 的疾病是没有见过的, 所以算法没有办法对没见过的疾病进行预测.
改为用全部的数据进行训练, 并用全部的数据进行预测
- mnb_tot = MultinomialNB()
- mnb_tot = mnb_tot.fit(x, y)
- mnb_tot.score(x, y)
得分率为 0.8993288590604027
打印出预测不准确的疾病
- disease_pred = mnb_tot.predict(x)
- disease_real = y.values
- for i in range(0, len(disease_real)):
- if disease_pred[i]!=disease_real[i]:
- print ('Pred: {0} Actual:{1}'.format(disease_pred[i].ljust(30), disease_real[i]))
运行结果:
- Pred: HIV Actual:acquired immuno-deficiency syndrome
- Pred: biliary calculus Actual:cholelithiasis
- Pred: coronary arteriosclerosis Actual:coronary heart disease
- Pred: depression mental Actual:depressive disorder
- Pred: HIV Actual:hiv infections
- Pred: carcinoma breast Actual:malignant neoplasm of breast
- Pred: carcinoma of lung Actual:malignant neoplasm of lung
- Pred: carcinoma prostate Actual:malignant neoplasm of prostate
- Pred: carcinoma colon Actual:malignant tumor of colon
- Pred: candidiasis Actual:oralcandidiasis
- Pred: effusion pericardial Actual:pericardial effusion body substance
- Pred: malignant neoplasms Actual:primary malignant neoplasm
- Pred: sepsis (invertebrate) Actual:septicemia
- Pred: sepsis (invertebrate) Actual:systemic infection
- Pred: tonic-clonic epilepsy Actual:tonic-clonic seizures
六, 用决策树来训练模型
- from sklearn.tree import DecisionTreeClassifier, export_graphviz
- dt = DecisionTreeClassifier()
- clf_dt=dt.fit(x,y)
- print ("Acurracy:", clf_dt.score(x,y))
得到的分数为 0.8993288590604027, 这与上面用朴素贝叶斯算法得到的结果一样.
下面要可视化决策树的节点分布
1 生成 tree.dot
- from sklearn import tree
- from sklearn.tree import export_graphviz
- export_graphviz(dt,
- out_file='DOT-files/tree.dot',
- feature_names=cols)
在工程目录下的 DOT-files 目录下, 可以看到生成了 tree.dot 文件.
打开 cmd 终端, 进入到 tree.dot 所在的目录, 即 DOT-files / 中, 执行
dot -Tpng tree.dot -o ..\tree.png
会得到 tree.png
但是如果 tree.dot 太大的话, 有可能报内存不够的错误:
dot: failure to create cairo surface: out of memory
2 在 jupyter notebook 中显示 tree.png
- from IPython.display import Image
- Image(filename='tree.png')
来源: http://www.jianshu.com/p/882ee4db4e40