使用 CNN,RNN 和 HAN 进行文本分类的对比报告

介绍

你好, 世界!! 我最近加入 Jatana.ai https://www.jatana.ai/ 担任 NLP 研究员(实习生 and), 并被要求使用深度学习模型研究文本分类用例.

在本文中, 我将分享我的经验和学习, 同时尝试各种神经网络架构.

我将介绍 3 种主要算法, 例如:

卷积神经网络(CNN)

递归神经网络(RNN)

分层注意网络(HAN)

对具有丹麦语, 意大利语, 德语, 英语和土耳其语的数据集进行文本分类.

我们来吧.

关于自然语言处理(NLP)

在不同业务问题中广泛使用的

自然语言处理和监督机器学习(ML)

任务之一是 "文本分类", 它是监督机器学习任务的一个例子, 因为包含文本文档及其标签的标记数据集用于训练分类器.

文本分类的目标是自动将文本文档分类为一个或多个预定义类别.

文本分类的一些示例是:

从社交媒体中了解受众情绪()

检测垃圾邮件和非垃圾邮件

自动标记客户查询

将新闻文章分类为预定义主题

文本分类是学术界和工业界非常活跃的研究领域. 在这篇文章中, 我将尝试介绍一些不同的方法, 并比较它们的性能, 其中实现基于 keras https://keras.io/ .

所有源代码和实验结果都可以在 jatana_research 存储库中找到.

端到端文本分类管道由以下组件组成:

培训文本: 它是我们的监督学习模型能够学习和预测所需课程的输入文本.

特征向量: 特征向量是包含描述输入数据特征的信息的向量.

def clean_str(string):
    string = re.sub(r"\\","",string)
    string = re.sub(r"\'","",string)
    string = re.sub(r"\" ",","string"
    return string.strip().lower()
texts = []; labels = []
for i in range(df.message.shape [0]):
    text = BeautifulSoup(df.message [i ])
    text.append(clean_str(str(text.get_text().encode())))
for for in df ['class']:
    labels.append(i)

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
l_cov1= Conv1D(128, 5, activation='relu')(embedded_sequences)
l_pool1 = MaxPooling1D(5)(l_cov1)
l_cov2 = Conv1D(128, 5, activation='relu')(l_pool1)
l_pool2 = MaxPooling1D(5)(l_cov2)
l_cov3 = Conv1D(128, 5, activation='relu')(l_pool2)
l_pool3 = MaxPooling1D(35)(l_cov3)  # global max pooling
l_flat = Flatten()(l_pool3)
l_dense = Dense(128, activation='relu')(l_flat)
preds = Dense(len(macronum), activation='softmax')(l_dense)

http
:
//colah.GitHub.io/posts/2015-08-Understanding-LSTMs/

MAX_NB_WORDS = 20000
tokenizer = Tokenizer (num_words=MAX_NB_WORDS) tokenizer.fit_on_texts(texts)

https
https://arxiv.org/pdf/1506.01057v2.pdf
:
//arxiv.org/PDF/1506.01057v2.PDF
https://arxiv.org/pdf/1506.01057v2.pdf

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
l_lstm = Bidirectional(LSTM(100))(embedded_sequences)
preds = Dense(len(macronum), activation='softmax')(l_lstm)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',optimizer='rmsprop',  metrics=['acc'])

tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)
data = np.zeros((len(texts), MAX_SENTS, MAX_SENT_LENGTH), dtype='int32')
for i, sentences in enumerate(reviews):
    for j, sent in enumerate(sentences):
        if j< MAX_SENTS:
            wordTokens = text_to_word_sequence(sent)
            k=0
            for _, Word in enumerate(wordTokens):
                if(k<MAX_SENT_LENGTH and tokenizer.word_index[Word]<MAX_NB_WORDS):
                    data[i,j,k] = tokenizer.word_index[Word]
                    k=k+1

embedding_layer=Embedding(len(word_index)+1,EMBEDDING_DIM,weights=[embedding_matrix],
input_length=MAX_SENT_LENGTH,trainable=True)
sentence_input = Input(shape=(MAX_SENT_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sentence_input)
l_lstm = Bidirectional(LSTM(100))(embedded_sequences)
sentEncoder = Model(sentence_input, l_lstm)
review_input = Input(shape=(MAX_SENTS,MAX_SENT_LENGTH), dtype='int32')
review_encoder = TimeDistributed(sentEncoder)(review_input)
l_lstm_sent = Bidirectional(LSTM(100))(review_encoder)
preds = Dense(len(macronum), activation='softmax')(l_lstm_sent)
model = Model(review_input, preds)

来源: https://juejin.im/post/5c6ece35f265da2dec623d7c

与本文相关文章

暂无,快来抢沙发吧！