如何为「纽约时报」开发基于内容的推荐系统

我们在帮助纽约时报 (The New York Times, 以下简称 NYT) 开发一套基于内容的推荐系统, 大家可以把这套系统看作一个非常简单的推荐系统开发示例. 依托用户近期的文章浏览数据, 我们会为其推荐适合阅读的新文章, 而想做到这一点, 只需以这篇文章的文本数据为基础, 推荐给用户类似的内容.

数据检验以下是数据集中第一篇 NYT 文章中的摘录, 我们已经做过文本处理.

'TOKYO - State-backed Japan Bank for International Cooperation [JBIC.UL] will lend about 4 billion yen ($39 million) to Russia's Sberbank, which is subject to Western sanctions, in the hope of advancing talks on a territorial dispute, the Nikkei business daily said on Saturday, [...]"

首先需要解决的问题是, 该如何将这段内容矢量化, 并且设计诸如 Parts-of-Speech ,N-grams ,sentiment scores 或 Named Entities 等新特征.

显然 NLP tunnel 有深入研究的价值, 甚至可以花费很多时间在既有方案上做实验. 但真正的科学往往是从试水最简单可行的方案开始的, 这样后续的迭代才会愈加完善.

而在这篇文章中, 我们就开始执行这个简单可行的方案.

数据拆分我们需要将标准数据进行预加工, 方法是确定数据库中符合要求的特征, 打乱顺序, 然后将这些特征分别放入训练和测试集.

# move articles to an array
articles = df.body.values
# move article section names to an array
sections = df.section_name.values
# move article web_urls to an array
web_url = df.web_url.values
# shuffle these three arrays
articles, sections, web_ur = shuffle(articles, sections, web_url, random_state=4)
# split the shuffled articles into two arrays
n = 10
# one will have all but the last 10 articles -- think of this as your training set/corpus
X_train = articles[:-n]
X_train_urls = web_url[:-n]
X_train_sections = sections[:-n]
# the other will have those last 10 articles -- think of this as your test set/corpus
X_test = articles[-n:]
X_test_urls = web_url[-n:]
X_test_sections = sections[-n:]

文本矢量化系统可以从 Bag-of-Words(BoW),Tf-Idf,Word2Vec 等几种不同的文本矢量化系统中选择.

我们选择 Tf-Idf 的原因之一是, 不同于 BoW,Tf-Idf 识别词汇重要性的方式除文本频率外, 还包括逆文档频率.

举例, 一个像 "Obama" 这样的词汇虽然在文章中仅出现几次(不包括类似 "a","the" 这样不能传达太多信息的词汇), 但出现在多篇不同的文章中, 那么就应该得到更高的权重值.

因为 "Obama" 既不是停用词, 也不是日常用语(即说明该词汇与文章主题高度相关).

相似性准则确定相似性准则时有好几种方案, 比如将 Jacard 和 Cosine 做对比.

Jacard 的实现依靠两集之间的比较及重叠元素选择. 考虑到已选择 Tf-Idf 作为文本矢量化系统, 作为选项, Jacard 相似性并无意义. 如果选择 BoWs 矢量化, 可能 Jacard 可能才能发挥作用.

因此, 我们尝试将 Cosine 作为相似性准则.

从 Tf-Idf 为每篇文章中的每个标记分配权重开始, 就能够从不同文章标记的权重之间取点积了.

如果文章 A 中类似 "Obama" 或者 "White House" 这样的标记权重较高, 并且文章 B 中也是如此, 那么相对于文章 B 中相同标记权重低的情况来说, 两者的相似性乘积将得出一个更大的数值.

建立推荐系统根据用户已读文章和所有语料库中的其他文章 (即训练数据) 的相似性数值, 现在你就可以建立一个输出前 N 篇文章的函数, 然后开始给用户推荐了.

def get_top_n_rec_articles(X_train_tfidf, X_train, test_article, X_train_sections, X_train_urls, n = 5):
    '''This function calculates similarity scores between a document and a corpus
       INPUT: vectorized document corpus, 2D array
              text document corpus, 1D array
              user article, 1D array
              article section names, 1D array
              article URLs, 1D array
              number of articles to recommend, int
       OUTPUT: top n recommendations, 1D array
               top n corresponding section names, 1D array
               top n corresponding URLs, 1D array
               similarity scores bewteen user article and entire corpus, 1D array
              '''    # calculate similarity between the corpus (i.e. the"test"data) and the user's article
    similarity_scores = X_train_tfidf.dot(test_article.toarray().T)
    # get sorted similarity score indices
    sorted_indicies = np.argsort(similarity_scores, axis = 0)[::-1]
    # get sorted similarity scores
    sorted_sim_scores = similarity_scores[sorted_indicies]
    # get top n most similar documents
    top_n_recs = X_train[sorted_indicies[:n]]
    # get top n corresponding document section names
    rec_sections = X_train_sections[sorted_indicies[:n]]
    # get top n corresponding urls
    rec_urls = X_train_urls[sorted_indicies[:n]]
    # return recommendations and corresponding article meta-data
    return top_n_recs, rec_sections, rec_urls, sorted_sim_scores

以下是该函数的执行步骤:

1. 计算用户文章和语料库的相似性;

2. 将相似性分值从高到低排序;

3. 得出前 N 篇最相似的文章;

4. 获取对应前 N 篇文章的小标题及 URL;

5. 返回前 N 篇文章, 小标题, URL 和分值

结果验证现在我们已经可以根据用户正在阅读的内容, 为他们推荐可供阅读的文章来检测结果是否可行了.

# similarity scores
sorted_sim_scores[:5]
# OUTPUT:
# 0.566
# 0.498
# 0.479
# .
# .

# user's article's section name
X_test_sections[k]
# OUTPUT:
'U.S'
# corresponding section names for top n recs
rec_sections
# OUTPUT:
'World'
'U.S'
'World'
'World'
'U.S.'

来源: https://juejin.im/post/5bc96c62f265da0ac55e8305

与本文相关文章

暂无,快来抢沙发吧！