Python_文本分析_困惑度计算和一致性检验

相关文章推荐

腼腆的荔枝 · Exchange Server ...· 昨天 ·

瘦瘦的山羊 · Nginx “If Is ...· 1 年前 ·

腼腆的圣诞树 · InMemoryDbContextOptio ...· 1 年前 ·

胡子拉碴的大象 · Excel的王炸组合来了，数据透视表+切片器 ...· 1 年前 ·

博学的圣诞树 · c# ...· 1 年前 ·

打篮球的自行车 · git技巧：删除在本地有但在远程库中已经不存 ...· 2 年前 ·

Python_

⽂本分析

困惑度计算和⼀致性检验

在做LDA的过程中⽐较⽐较难的问题就是主题数的确定，下⾯介绍困惑度、⼀致性这两种⽅法的实现。

其中的⼀些LDA的参数需要结合⾃⼰的实际进⾏设定

直接计算出的log_perplexity是负值，是困惑度经过对数去相反数得到的。

import

csv

import

datetime

import

pandas

import

numpy

import

jieba

import

matplotlib

pyplot

plt

import

jieba

posseg

jieba

import

gensim

from

snownlp

import

seg

from

snownlp

import

SnowNLP

from

snownlp

import

sentiment

from

gensim

import

corpora

models

from

gensim

models

import

CoherenceModel

from

sklearn

model_selection

import

train_test_split

from

sklearn

model_selection

import

KFold

from

sklearn

feature_extraction

text

import

TfidfVectorizer

CountVectorizer

from

sklearn

decomposition

import

NMF

LatentDirichletAllocation

import

warnings

filterwarnings

(

"ignore"

)

comment

read_csv

(

"good_1"

header

index_col

False

engine

='python'

encoding

'utf-8'

)

csv_data

comment

[[(

len

(

str

(

))

100

)

for

comment

[

'segment'

]]]

(

csv_data

shape

)

构造

corpus

train

[]

for

range

(

csv_data

shape

[

]):

comment

csv_data

iloc

[

split

()

train

append

(

comment

)

id2word

corpora

Dictionary

(

train

)

corpus

[

id2word

doc2bow

(

sentence

)

for

sentence

train

]

⼀致性和困惑度计算

coherence_values

[]

perplexity_values

[]

model_list

[]

for

topic

range

(

lda_model

gensim

models

LdaMulticore

(

corpus

num_topics

topic

id2word

random_state

100

chunksize

100

passes

r_word_topics

True

)

perplexity

pow

(

lda_model

log_perplexity

(

corpus

))

(

perplexity

end

=' '

)

perplexity_values

append

(

round

(

perplexity

))

model_list

append

(

lda_model

)

coherencemodel

CoherenceModel

(

model

lda_model

texts

train

dictionary

id2word

coherence

='c_v'

)

coherence_values

append

(

round

(

coherencemodel

get_coherence

(),

))

下⾯展⽰⼀种⼀致性可视化的⽅法

推荐文章

腼腆的荔枝 · Exchange Server 2016、Exchange Server 2019 和 Exchange Server SE 中的传输规则中使用的正则表达式 | Microsoft Learn

昨天

瘦瘦的山羊 · Nginx “If Is Evil“详解，以及if使用须知_nginx if is evil-CSDN博客

1 年前

腼腆的圣诞树 · InMemoryDbContextOptionsExtensions.UseInMemoryDatabase Method (Microsoft.EntityFrameworkCore) | Microsoft Learn

1 年前

胡子拉碴的大象 · Excel的王炸组合来了，数据透视表+切片器，天下无敌_图表_字段_动态

1 年前

博学的圣诞树 · c# winform未能找到引用的组件“Excel”的解决办法_12129363的技术博客_51CTO博客

1 年前

打篮球的自行车 · git技巧：删除在本地有但在远程库中已经不存在的分支_xhl_will的博客-CSDN博客

2 年前