阿里试用排序
前景提要
说来简直丢尽了钢铁直男的脸, 没错, 昨晚我在愉快的做着外包的活 (中国移动的小程序, 自由职业, 喂),11 点多了, 女友突然脑子一抽:"你能不能帮我把这个玩意排序一下给我用啊, 我好薅点羊毛, 技术能实现嘛?"
我比较无奈的看了看, 阿里试用咩? 什么鬼, 哦哦哦, 就这玩意啊, 爬虫爬一下就是了. 我是前端......
回道:"没问题啊, 爬虫呗."
她:"哇, 多久能做出来啊?"
我:"我现在在忙诶, 1-2 小时吧."
她:"行了, 你别忙了, 赶紧帮我弄一下出来!"
我看了看她的脸, 羞耻的最小化《微信开发者工具》...
页面展示
你要是觉得这也是广告, 那真是太抬举我了.
爬虫搞起来
Node.JS 爬虫, 百度一下, 到处都是现成的代码, 我也就不一一分析了, 拿出简书的一段代码, 来自 埃米莉 Emily:
- const express = require('express');
- // 调用 express 实例, 它是一个函数, 不带参数调用时, 会返回一个 express 实例, 将这个变量赋予 App 变量.
- const superagent = require('superagent');
- const cheerio = require('cheerio');
- const App = express();
- App.get('/', (req, res, next) => {
- console.log(req)
- superagent.get('https://www.v2ex.com/')
- .end((err, sres) => {
- // 常规的错误处理
- if (err) {
- return next(err);
- }
- // sres.text 里面存储着网页的 HTML 内容, 将它传给 cheerio.load 之后
- // 就可以得到一个实现了 jQuery 接口的变量, 我们习惯性地将它命名为 `$`
- // 剩下就都是 jQuery 的内容了
- let $ = cheerio.load(sres.text);
- let items = [];
- $('.item_title a').each((idx, element) => {
- let $element = $(element);
- items.push({
- title: $element.text(),
- href: $element.attr('href')
- });
- });
- res.send(items);
- });
- });
- App.listen(3000, function () {
- console.log('app is listening at port 3000');
- });
嘛, express 用 Node.JS 的不可能不知道, superagent 理解成可以在 Node 里面做对外请求即可, cheerio 嗯, Node 专用 JQ.
首爬
把上面的请求地址换成: https://try.taobao.com/, 查看页面标签结构, 找到想要的选择器结构:
.tb-try-wd-item-info> .detail, 把这个替换上面选择器 .item_title a, 走起:
...... 我不想展示结果, 因为只有六个, 页面实际展示是 10 个, 找了半天, 发现两个问题:
如上, 第一个是爬到的 6 个是推荐, 喵的, 不是下面列表;
第二个, 下面列表是后面通过 POST 单独请求来的数据, 怎么看都是某框架的 SSR 干的好事.
于是爬虫不成, 得换战略.
模拟 POST
OK, 既然是 POST, 就好弄了, 直接把连接跟参数刨出来, 然后 superagent 模拟:
- superagent
- .post(
- `https://try.taobao.com/api3/call?what=show&page=${paylaod.page}&pageSize&api=x/search`
- )
- .set('content-type', 'application/x-www-form-urlencoded; charset=UTF-8')
- .end((err, sres) => {
- // 常规的错误处理
- if (err) {
- return next(err)
- }
- const result = JSON.parse(sres.text).result // 返回结构树
- resolve(result)
- })
content-type 源自:
哼哼哼, 你没猜错, 失败了, 如下:
想想是必然的, 怎么可能给你随便请求呢, 然后该怎么做? 研究? nonono, 老夫上来就是一梭子, 不就是 Content-Type 么!
- superagent
- .post(
- `https://try.taobao.com/api3/call?what=show&page=${paylaod.page}&pageSize&api=x/search`
- )
- .set(
- 'user-agent',
- 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
- )
- .set('accept', 'pplication/json, text/javascript, */*; q=0.01')
- .set('accept-encoding', 'gzip, deflate, br')
- .set(
- 'accept-language',
- 'zh-CN,zh;q=0.9,en;q=0.8,la;q=0.7,zh-TW;q=0.6,da;q=0.5'
- )
- // .set('content-length', '8')
- .set('content-type', 'application/x-www-form-urlencoded; charset=UTF-8')
- .set(
- 'cookie',
- 'your cookie'
- )
- .set('origin', 'https://try.taobao.com')
- .set('referer', 'https://try.taobao.com')
- .set('x-csrf-token', 'f0b8e7443eb7e')
- .set('x-requested-with', 'XMLHttpRequest')
- .end((err, sres) => {
- // 常规的错误处理
- if (err) {
- return next(err)
- }
- const result = JSON.parse(sres.text).result
- resolve(result)
- })
依据就是下面这个:
不就是头么, 不就是源么, 不就是用户代理么, 用个 HTTPS 还没有你办法了?
注意上面 .set('content-length', '8'), 不知道那边怎么玩, 加上这个就超时......
于是, 交代了吧:
- {
- "pages": {
- "paging": {
- "n": 2182,
- "page": 1,
- "pages": 219
- },
- "items": [
- {
- "shopUserId": "2450112357",
- "title": "凯度高端款嵌入式蒸烤箱",
- "status": 1,
- "totalNum": 1,
- "requestNum": 15530,
- "acceptNum": 0,
- "reportNum": 0,
- "isApplied": false,
- "shopName": "casdon 凯度旗舰店",
- "showId": "2561626",
- "startTime": 1539619200000,
- "endTime": 1540220400000,
- "id": "34530215",
- "type": 1,
- "pic": "//img.alicdn.com/bao/uploaded/TB1ycS2eMDqK1RjSZSyXXaxEVXa.jpg",
- "shopItemId": "559771706359",
- "price": 13850
- },
- {
- "shopUserId": "3189770892",
- "title": "皇家美素佳儿老包装 2 段 400g",
- "status": 1,
- "totalNum": 50,
- "requestNum": 2079,
- "acceptNum": 0,
- "reportNum": 0,
- "isApplied": false,
- "shopName": "皇家美素佳儿旗舰店",
- "showId": "2551240",
- "startTime": 1539619200000,
- "endTime": 1540220400000,
- "id": "34396042",
- "type": 1,
- "pic": "//img.alicdn.com/bao/uploaded/TB1YrSZaVYqK1RjSZLeXXbXppXa.jpg",
- "shopItemId": "547114874458",
- "price": 189
- },
- {
- "shopUserId": "1077716829",
- "title": "关注店铺优先审水密码幻彩隔离",
- "status": 1,
- "totalNum": 10,
- "requestNum": 6907,
- "acceptNum": 0,
- "reportNum": 0,
- "isApplied": false,
- "shopName": "水密码旗舰店",
- "showId": "2568391",
- "startTime": 1539619200000,
- "endTime": 1540220400000,
- "id": "34784086",
- "type": 1,
- "pic": "//img.alicdn.com/bao/uploaded/TB16_4ChmzqK1RjSZPxXXc4tVXa.jpg",
- "shopItemId": "559005882880",
- "price": 599
- },
- {
- "shopUserId": "725786863",
- "title": "精品皮草派克大衣",
- "status": 1,
- "totalNum": 1,
- "requestNum": 11793,
- "acceptNum": 0,
- "reportNum": 0,
- "isApplied": false,
- "shopName": "美瑞蓓特",
- "showId": "2557886",
- "startTime": 1539619200000,
- "endTime": 1540220400000,
- "id": "34574078",
- "type": 1,
- "pic": "//img.alicdn.com/bao/uploaded/TB1zVLMdCrqK1RjSZK9XXXyypXa.jpg",
- "shopItemId": "577418950477",
- "price": 5980
- },
- {
- "shopUserId": "3000840351",
- "title": "保友智能新品 Pofit 电脑椅",
- "status": 1,
- "totalNum": 1,
- "requestNum": 12895,
- "acceptNum": 0,
- "reportNum": 0,
- "isApplied": false,
- "shopName": "保友办公家具旗舰店",
- "showId": "2557100",
- "startTime": 1539619200000,
- "endTime": 1540220400000,
- "id": "34528042",
- "type": 1,
- "pic": "//img.alicdn.com/bao/uploaded/TB1bYZEg6TpK1RjSZKPXXa3UpXa.png",
- "shopItemId": "577598687971",
- "price": 5408
- },
- {
- "shopUserId": "791732485",
- "title": "TEK 手持吸尘器 A8",
- "status": 1,
- "totalNum": 1,
- "requestNum": 17195,
- "acceptNum": 0,
- "reportNum": 0,
- "isApplied": false,
- "shopName": "泰怡凯旗舰店",
- "showId": "2552265",
- "startTime": 1539619200000,
- "endTime": 1540220400000,
- "id": "34444014",
- "type": 1,
- "pic": "//img.alicdn.com/bao/uploaded/TB1D6bWbhTpK1RjSZFGXXcHqFXa.jpg",
- "shopItemId": "547653053965",
- "price": 5199
- },
- {
- "shopUserId": "3229583972",
- "title": "椰富海南冷炸椰子油食用油 1L",
- "status": 1,
- "totalNum": 20,
- "requestNum": 4451,
- "acceptNum": 0,
- "reportNum": 0,
- "isApplied": false,
- "shopName": "椰富食品专营店",
- "showId": "2561698",
- "startTime": 1539619200000,
- "endTime": 1540220400000,
- "id": "34532250",
- "type": 1,
- "pic": "//img.alicdn.com/bao/uploaded/TB1VjLSePDpK1RjSZFrXXa78VXa.jpg",
- "shopItemId": "578653506446",
- "price": 256
- },
- {
- "shopUserId": "855223948",
- "title": "卡西欧立式家用电钢琴 PX770",
- "status": 1,
- "totalNum": 1,
- "requestNum": 16762,
- "acceptNum": 0,
- "reportNum": 0,
- "isApplied": false,
- "shopName": "世纪音缘乐器专营店",
- "showId": "2551326",
- "startTime": 1539619200000,
- "endTime": 1540220400000,
- "id": "34420041",
- "type": 1,
- "pic": "//img.alicdn.com/bao/uploaded/TB1CC6aa9zqK1RjSZFpXXakSXXa.jpg",
- "shopItemId": "562405126383",
- "price": 4838
- },
- {
- "shopUserId": "4065939832",
- "title": "关注宝贝送轻奢沙发床",
- "status": 1,
- "totalNum": 1,
- "requestNum": 17436,
- "acceptNum": 0,
- "reportNum": 0,
- "isApplied": false,
- "shopName": "贝兮旗舰店",
- "showId": "2559904",
- "startTime": 1539619200000,
- "endTime": 1540220400000,
- "id": "34532170",
- "type": 1,
- "pic": "//img.alicdn.com/bao/uploaded/TB1AzxYegHqK1RjSZFPXXcwapXa.jpg",
- "shopItemId": "577798067313",
- "price": 4399
- },
- {
- "shopUserId": "807974445",
- "title": "森海塞尔 CX6 蓝牙耳机",
- "status": 1,
- "totalNum": 4,
- "requestNum": 22557,
- "acceptNum": 0,
- "reportNum": 0,
- "isApplied": false,
- "shopName": "sennheiser 旗舰店",
- "showId": "2559701",
- "startTime": 1539619200000,
- "endTime": 1540220400000,
- "id": "34532161",
- "type": 1,
- "pic": "//img.alicdn.com/bao/uploaded/TB1HET6d7voK1RjSZFwXXciCFXa.jpg",
- "shopItemId": "564408956766",
- "price": 999
- }
- ]
- }
- }
细心的小伙伴应该看到, 我没有发送 form 给他, 一样可以请求到需要的数据, page 挂在了 query 上...... 这里, 允许我严重怀疑一下他们的技术能力.
展示部分
数据拿到, 就简单了, 其实就是这一个接口实现剩下的功能了, 没错, 记住我是前端.
- <!DOCTYPE HTML>
- <HTML lang="en">
- <head>
- <meta charset="UTF-8">
- <meta name="viewport" content="width=device-width, initial-scale=1.0">
- <meta http-equiv="X-UA-Compatible" content="ie=edge">
- <title>
- tb try
- </title>
- <style>
- .warning { color: red; } button { width: 100px; height: 44px; margin-right:
- 44px; } table { border: 1px solid #d8d8d8; border-collapse: collapse; }
- tr { border-bottom: 1px solid #d8d8d8; cursor: pointer; } tr:last-child
- { border: 0; }
- </style>
- </head>
- <body>
- <button onclick="postPage()">
- 下一页
- </button>
- <span id="currentPage">
- </span>
- <table>
- <tbody>
- <tr>
- <th>
- 序号 (倒序)
- </th>
- <th>
- 概率
- </th>
- <th>
- 名字
- </th>
- </tr>
- </tbody>
- <tbody id="results">
- </tbody>
- </table>
- <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js">
- </script>
- <script>
- let currentPage = 0 // 当前页面
- let allItems = [] // 全部数据
- let currentTime = 0 // 锁频率使用, 标记上次时间
- const loopInterval = 2 // 锁频率步长, 单位秒
- const results = document.querySelector('#results') const currentPageText = document.querySelector('#currentPage') const reFullTBody = arr = >{
- let innerHtml = ''arr.forEach((item, i) = >{
- item.rate = item.totalNum / item.requestNum * 100 let tr = ` < tr onclick = "window.open('https://try.taobao.com/item.htm?id=${item.id}')" > <td > $ {
- i + 1
- } < /td>
- <td>${item.rate.toFixed(3) + '%'}</td > <td > $ {
- item.title
- } < /td>
- </tr > `
- if (item.rate > 5) tr = tr.replace('<tr', '<tr class="warning"') innerHtml += tr
- }) currentPageText.innerText = `当前页: $ {
- currentPage
- }`results.innerHTML = innerHtml // 原谅我 mvvm 的臭毛病
- }
- const postPage = () = >{
- // 锁频率步长内取消请求
- const newTime = new Date().getTime() const shoudBack = newTime - currentTime < loopInterval * 1000
- if (shoudBack) {
- alert(loopInterval + '秒内不要多次点击哦.') return
- }
- currentTime = newTime $.post('/table', {
- page: currentPage
- },
- res = >{
- if (res.length < 1) {
- alert('今天结束的已经筛选完了') return
- }
- allItems = [...allItems, ...res] allItems.sort((a, b) = >b.rate - a.rate) reFullTBody(allItems) currentPage--
- })
- }
- $.get('/total', res = >{
- currentPage = res.pages postPage()
- })
- </script>
- </body>
- </HTML>
长这个样子:
我多人性化, 可以点击跳转, 概率超过 5% 红色展示, 还告诉你当前所在页码, 点太快还给你提示....................................
就是这么好用, 喜欢的赶紧体验吧!
线上: 点我体验 http://only-u.site:8000/
GitHub: https://github.com/ZweiZhao/Spider
觉得有用, 不要吝惜 star 哦.
来源: https://www.cnblogs.com/ZweiZhao/p/9798008.html