本文首发于 vivo 互联网技术 微信公众号?
链接: https://mp.weixin.qq.com/s/8f34CaTp--Wz5pTHKA0Xeg https://mp.weixin.qq.com/s/8f34CaTp--Wz5pTHKA0Xeg
作者: vivo 官网商城开发团队
众所周知, Oracle JDK? 是 Java 语言的绝对权威, 很多时候 JDK 与 Java 语言近似一个概念. 但我们始终要保持实事求是的精神, 敢于质疑. 本文记录了一次线上 troubleshoot 实战, 包含问题分析, 解决并提交? Oracle JDK https://www.oracle.com/java ?bug 的核心过程.
一, 背景现象?
总之 就是某系统上线后 CLOSE_WAIT 数量随着时间增加而大量增加, 持续触发多个告警.
二, 分析定位过程
部署了一个节点, 用来复现之前出现的问题.
Step1 问题聚焦
先查看到底是哪些 IP 之间的连接产生了大量 CLOSE_WAIT, 另外系统还会涉及调第三方, 总之要确认连接建立的双方.
执行命令:? ?
netstat -np | grep tcp" grep "CLOSE_WAIT"
结果:?
- (ps:xxx,yyy,zzz 均无含义, 基于信息安全考虑, 屏蔽掉 ip).
- tcp 3547 0 10.107.17.xxx:34602 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:59088 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:58028 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:51962 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 3563 0 10.107.17.xxx:46962 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:34608 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:46496 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:50774 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:59904 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:40208 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:41064 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:36994 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 3547 0 10.107.17.xxx:45080 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 6235 0 10.107.17.xxx:60966 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:56178 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 3547 0 10.107.17.xxx:39922 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:43270 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:40926 zzz.202.32.242:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:44472 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 2891 0 10.107.17.xxx:43036 zzz.202.32.241:443 CLOSE_WAIT 19819/java
- ........
- ........
- tcp 38 0 10.107.17.xxx:33472 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:51976 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:57788 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:35638 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:43778 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:46418 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:49914 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:49258 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:48718 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:51480 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:59816 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:49266 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:50246 yyy.12.230.115:443 CLOSE_WAIT 19819/java
- tcp 38 0 10.107.17.xxx:39324 yyy.12.230.115:443 CLOSE_WAIT 19819/java
总之:?
- yyy.12.230.115
- zzz.202.32.241
- zzz.202.32.241
这个三个 IP 是导火索.
Step2 问题分析
这三个 IP 具体是谁? 具体是请求了哪个接口?
暂时无法直接获知! 最直接的导火索暂时断了线索. 接着从侧面开始查看更多信息,
JVM 信息
外部资源, 线程 什么的都看了, 未发现明显异常
抓包
要抓包获取更多线索了. 对于很久没有碰过 TCP 层, 有些吃力.
得到线索: 发现大量的 RST
那么是什么操作会导致 CLOSE_WAIT 呢? 什么样的连接导致大量 RST 呢 (可参考 RST 通常原因)??
Step3 代码分析定位
运维大佬的协助查询, 得知这三个 IP 是图片 CDN 服务.
至此, 可以定位到具体代码逻辑, 图片 CDN 请求可以排查代码.
仔细分析这部分源码后, 推测因为服务器 发起 URL 请求, 请求不存在, 导致抛出异常, 但是 JDK 中却没有地方关闭 Socket.
- javax.imageio.read(URL)
- /**
- * Returns a <code>BufferedImage</code> as the result of decoding
- * a supplied <code>URL</code> with an <code>ImageReader</code>
- * chosen automatically from among those currently registered. An
- * <code>InputStream</code> is obtained from the <code>URL</code>,
- * which is wrapped in an <code>ImageInputStream</code>. If no
- * registered <code>ImageReader</code> claims to be able to read
- * the resulting stream, <code>null</code> is returned.
- *
- * <p> The current cache settings from <code>getUseCache</code>and
- * <code>getCacheDirectory</code> will be used to control caching in the
- * <code>ImageInputStream</code> that is created.
- *
- * <p> This method does not attempt to locate
- * <code>ImageReader</code>s that can read directly from a
- * <code>URL</code>; that may be accomplished using
- * <code>IIORegistry</code> and <code>ImageReaderSpi</code>.
- *
- * @param input a <code>URL</code> to read from.
- *
- * @return a <code>BufferedImage</code> containing the decoded
- * contents of the input, or <code>null</code>.
- *
- * @exception IllegalArgumentException if <code>input</code> is
- * <code>null</code>.
- * @exception IOException if an error occurs during reading.
- */
- public static BufferedImage read(URL input) throws IOException {
- if (input == null) {
- throw new IllegalArgumentException("input == null!");
- }
- InputStream istream = null;
- try {
- // 此处, 建立 TCP 连接! 并且直接获取流, 因为流数据不存在, 进入 cache 块, 抛出!
- istream = input.openStream();
- } catch (IOException e) {
- throw new IIOException("Can't get input stream from URL!", e);
- }
- ImageInputStream stream = createImageInputStream(istream);
- BufferedImage bi;
- try {
- bi = read(stream);
- if (bi == null) {
- stream.close();
- }
- } finally {
- istream.close();
- }
- return bi;
- }
可以看到 JDK 并没有关闭 ImageIO.read(url) 代码中封装的 Socket 连接! CDN 会请求超时关闭导致服务器处于 CLOSE_WAIT? 限于网络经验有限, 并不能 100% 确认我的想法. 所以模拟下吧.
Step4? 复现与模拟
根据系统业务源码, 快速模拟:
- public static void main(String[] args) throws InterruptedException {
- ExecutorService ex = Executors.newFixedThreadPool(100);
- for (int i = 0; i < 5000; i++) {
- ex.execute(task());
- }
- }
- /**
- * @throws IOException
- * @throws MalformedURLException
- */
- private static Runnable task() {
- return new Runnable() {
- @Override
- public void run() {
- // domain must exists,but file doesnot.
- String vivofsUrl = "https://vivobbs.xx.yy.zz/wiwNWYCFW9ieGbWq/20181129/3a2adfde12cd328d81f965088890eeffff.jpg";
- File file = null;
- BufferedImage image = null;
- try {
- file = File.createTempFile("abc", "jpg");
- URL url1 = new URL(vivofsUrl);
- image = ImageIO.read(url1);
- } catch (Throwable e) {
- e.printStackTrace();
- } finally {
- if (null != file) {
- file.delete();
- }
- if (null != image) {
- image.flush();
- image = null;
- }
- }
- }
- };
- }
抓包
TCP 查看
问题复现!
Step5 沟通后提报 bug
report 给 Oracle.
三, Oracle 沟通
提单之后, Oracle 跟我联系沟通. 截取部分邮件内容, 仅供参考.
已被采纳
四, 疑点与不足
TCP 状态机的流转不够熟悉透彻. 导致一些问题不能从 TCP 状态机分析推理, 知识的全面精通需要不断提高.
来源: http://www.bubuko.com/infodetail-3634089.html