还是没能忍住,想看一下用 JAVA 语言处理上一篇文章的任务能快多少,毕竟编译语言远快于脚本语言。废话不多说,直接上代码:
程序运行结果:
- import java.io.FileReader;
- import java.io.BufferedReader;
- import java.io.BufferedWriter;
- import java.io.FileWriter;
- import java.io.IOException;
- public class Split {
- public static void main(String[] args) throws IOException {
- long startTime = System.currentTimeMillis();
- BufferedReader read_line = new BufferedReader(new FileReader("head_10000000.vcf"), 5000000);
- BufferedWriter write_line = new BufferedWriter(new FileWriter("result.tsv"), 5000000);
- String current_line = read_line.readLine();
- while (current_line != null) {
- while (current_line.startsWith("#")) {
- current_line = read_line.readLine();
- }
- String[] split1 = current_line.split("\t");
- String info = split1[7];
- String[] split2 = info.split(";AF=");
- String str1 = split2[1];
- String[] split3 = str1.split(";");
- write_line.write(current_line + " " + split3[0]);
- write_line.newLine();
- current_line = read_line.readLine();
- }
- write_line.flush();
- write_line.close();
- read_line.close();
- long endTime = System.currentTimeMillis();
- System.out.println("run time:" + (endTime - startTime) + "ms");
- }
- }
检验结果:
- $ wc -l result.tsv
- 10000000 result.tsv
- $ sed -n '3435534p' result.tsv
- 2 29509274 rs114511873 C A 100 PASS AA=C;AN=2184;AVGPOST=0.9997;VT=SNP;THETA=0.0006;AC=14;SNPSOURCE=LOWCOV;LDAF=0.0065;ERATE=0.0003;RSQ=0.9798;AF=0.01;AFR_AF=0.03 0.01
- $ sed -n '7546563p' result.tsv
- 3 84580386 rs191768644 T C 100 PASS RSQ=0.6088;AA=T;AN=2184;VT=SNP;AVGPOST=0.9991;SNPSOURCE=LOWCOV;AC=1;THETA=0.0007;ERATE=0.0002;LDAF=0.0008;AF=0.0005;AFR_AF=0.0020 0.0005
- $ sed -n '987345p' result.tsv
- 1 74709013 rs185004386 A C 100 PASS AN=2184;LDAF=0.0018;THETA=0.0005;VT=SNP;AA=A;SNPSOURCE=LOWCOV;RSQ=0.7110;ERATE=0.0003;AVGPOST=0.9987;AC=3;AF=0.0014;ASN_AF=0.01 0.0014
我们检查了文件的总行数以及随机抽取了若干行,发现结果正确。相比较于前面的 R 语言计算效率,这个结果表示十分震惊! 相差太远!!!
Time(java 代码编写 + 编译 + 运行) <Time(R 脚本运行)
来源: http://www.bubuko.com/infodetail-2436200.html