Тематическое моделирование с помощью Gensim (2)

В предыдущем посте мы улучшим эту модель, используя версию алгоритма LDA Маллета, а затем сосредоточимся на том, как получить оптимальное количество тем при любом большом текстовом корпусе.

16. Создайте модель LDA Mallet

До сих пор вы видели версию алгоритма LDA, встроенную в Gensim. Однако версия Маллета обычно предоставляет темы более высокого качества.

Gensim предоставляет оболочку для реализации LDA Маллета внутри Gensim. Вам просто нужно скачать zip-файл, разархивировать его и указать путь к маллету в распакованной директории. Посмотрите, как я это делаю ниже.gensim.models.wrappers.LdaMallet

# Download File: http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
mallet_path = 'path/to/mallet-2.0.8/bin/mallet' # update this path
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=id2word)
# Show Topics
pprint(ldamallet.show_topics(formatted=False))# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()print('\nCoherence Score: ', coherence_ldamallet)
[(13,
  [('god', 0.022175351915726671),
   ('christian', 0.017560827817656381),
   ('people', 0.0088794630371958616),
   ('bible', 0.008215251235200895),
   ('word', 0.0077491376899412696),
   ('church', 0.0074112053696280414),
   ('religion', 0.0071198844038407759),
   ('man', 0.0067936049221590383),
   ('faith', 0.0067469935676330757),
   ('love', 0.0064556726018458093)]),
 (1,
  [('organization', 0.10977647987951586),
   ('line', 0.10182379194445974),
   ('write', 0.097397469098389255),
   ('article', 0.082483883409554246),
   ('nntp_post', 0.079894209047330425),
   ('host', 0.069737542931658306),
   ('university', 0.066303010266865026),
   ('reply', 0.02255404338163719),
   ('distribution_world', 0.014362591143681011),
   ('usa', 0.010928058478887726)]),
 (8,
  [('file', 0.02816690014008405),
   ('line', 0.021396171035954908),
   ('problem', 0.013508104862917751),
   ('program', 0.013157894736842105),
   ('read', 0.012607564538723234),
   ('follow', 0.01110666399839904),
   ('number', 0.011056633980388232),
   ('set', 0.010522980454939631),
   ('error', 0.010172770328863986),
   ('write', 0.010039356947501835)]),
 (7,
  [('include', 0.0091670556506405262),
   ('information', 0.0088169700741662776),
   ('national', 0.0085576474249260924),
   ('year', 0.0077667133447435295),
   ('report', 0.0070406099268710129),
   ('university', 0.0070406099268710129),
   ('book', 0.0068979824697889113),
   ('program', 0.0065219646283906432),
   ('group', 0.0058866241377521916),
   ('service', 0.0057180644157460714)]),
 (..truncated..)]Coherence Score:  0.632431683088

Просто изменив алгоритм LDA, мы можем увеличить показатель когерентности с 0,53 до 0,63. хорошо!

17. Как найти оптимальное количество тем для LDA?

Мой способ найти оптимальное количество тем состоит в том, чтобы построить множество моделей LDA с разным количеством тем (k) и выбрать модель LDA с наивысшим значением согласованности.

Выбор «k», который отмечает быстрое увеличение тематической согласованности, часто дает значимые и интерпретируемые темы. Выбор более высокого значения иногда может обеспечить более детализированные подтемы.

Если вы видите повторения одного и того же ключевого слова в нескольких темах, это, вероятно, означает, что «k» слишком велико.

calculate_coherence_values() (см. ниже) обучает несколько моделей LDA и предоставляет модели и соответствующие им показатели корреляции.

def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values
# Can take a long time to run.
model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=data_lemmatized, start=2, limit=40, step=6)
# Show graph
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

Выберите лучшее количество тем LDA

# Print the coherence scoresfor m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))
Num Topics = 2  has Coherence Value of 0.4451Num Topics = 8  has Coherence Value of 0.5943Num Topics = 14  has Coherence Value of 0.6208Num Topics = 20  has Coherence Value of 0.6438Num Topics = 26  has Coherence Value of 0.643Num Topics = 32  has Coherence Value of 0.6478Num Topics = 38  has Coherence Value of 0.6525

Если кажется, что показатель корреляции увеличивается, возможно, имеет смысл выбрать модель, которая дала самый высокий CV до выравнивания. В этом случае.

Поэтому для дальнейших шагов я выберу модель с 20 предметами.

# Select the model and print the topics
optimal_model = model_list[3]
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))
[(0,
  '0.025*"game" + 0.018*"team" + 0.016*"year" + 0.014*"play" + 0.013*"good" + '
  '0.012*"player" + 0.011*"win" + 0.007*"season" + 0.007*"hockey" + '
  '0.007*"fan"'),
 (1,
  '0.021*"window" + 0.015*"file" + 0.012*"image" + 0.010*"program" + '
  '0.010*"version" + 0.009*"display" + 0.009*"server" + 0.009*"software" + '
  '0.008*"graphic" + 0.008*"application"'),
 (2,
  '0.021*"gun" + 0.019*"state" + 0.016*"law" + 0.010*"people" + 0.008*"case" + '
  '0.008*"crime" + 0.007*"government" + 0.007*"weapon" + 0.007*"police" + '
  '0.006*"firearm"'),
 (3,
  '0.855*"ax" + 0.062*"max" + 0.002*"tm" + 0.002*"qax" + 0.001*"mf" + '
  '0.001*"giz" + 0.001*"_" + 0.001*"ml" + 0.001*"fp" + 0.001*"mr"'),
 (4,
  '0.020*"file" + 0.020*"line" + 0.013*"read" + 0.013*"set" + 0.012*"program" '
  '+ 0.012*"number" + 0.010*"follow" + 0.010*"error" + 0.010*"change" + '
  '0.009*"entry"'),
 (5,
  '0.021*"god" + 0.016*"christian" + 0.008*"religion" + 0.008*"bible" + '
  '0.007*"life" + 0.007*"people" + 0.007*"church" + 0.007*"word" + 0.007*"man" '
  '+ 0.006*"faith"'),
 (..truncated..)]

Это темы выбранной модели LDA.

18. Найдите главную тему в каждом предложении

Одним из практических применений тематического моделирования является определение темы данного документа.

Чтобы найти это, мы находим номер темы с наибольшим процентным вкладом в этом документе.

Функция ниже красиво агрегирует эту информацию в презентабельную таблицу.format_topics_sentences()

def format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)


df_topic_sents_keywords = format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=data)# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']# Show
df_dominant_topic.head(10)

Доминирующая тема для каждого документа

19. Найдите наиболее репрезентативные файлы для каждой темы

Иногда ключевых слов темы может быть недостаточно, чтобы понять смысл темы. Таким образом, чтобы помочь понять тему, вы можете найти документацию по данной теме, которая наиболее полезна, и сделать вывод об этой теме, прочитав эту документацию. вызов!

# Group top 5 sentences under each topic
sent_topics_sorteddf_mallet = pd.DataFrame()

sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant_Topic')for i, grp in sent_topics_outdf_grpd:
    sent_topics_sorteddf_mallet = pd.concat([sent_topics_sorteddf_mallet, 
                                             grp.sort_values(['Perc_Contribution'], ascending=[0]).head(1)], 
                                            axis=0)# Reset Index
sent_topics_sorteddf_mallet.reset_index(drop=True, inplace=True)# Format
sent_topics_sorteddf_mallet.columns = ['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Text"]# Show
sent_topics_sorteddf_mallet.head()

Наиболее репрезентативная тема для каждого документа

Приведенная выше таблица на самом деле имеет 20 строк, по одной для каждого предмета. В нем есть тематические номера, ключевые слова и наиболее репрезентативные документы. ДолженPerc_ContributionСтолбцы — это просто процент вклада тем в данный документ.

20. Распространение файлов темы

Наконец, мы хотим знать количество и распределение тем, чтобы судить о масштабах обсуждения. Следующая таблица раскрывает эту информацию.

# Number of Documents for Each Topic
topic_counts = df_topic_sents_keywords['Dominant_Topic'].value_counts()# Percentage of Documents for Each Topic
topic_contribution = round(topic_counts/topic_counts.sum(), 4)# Topic Number and Keywords
topic_num_keywords = df_topic_sents_keywords[['Dominant_Topic', 'Topic_Keywords']]# Concatenate Column wise
df_dominant_topics = pd.concat([topic_num_keywords, topic_counts, topic_contribution], axis=1)# Change Column names
df_dominant_topics.columns = ['Dominant_Topic', 'Topic_Keywords', 'Num_Documents', 'Perc_Documents']# Show
df_dominant_topics

Распределение объема темы

21. Заключение

Мы начинаем понимать, какие темы можно сделать с моделированием. Мы построили базовую модель темы, используя LDA Gensim, и визуализировали темы, используя pyLDAvis. Затем мы построили LDA-реализацию молотка. Вы узнали, как использовать оценку соответствия, чтобы найти оптимальное количество тем, и как понять, как выбрать лучшую модель.

Наконец, мы увидели, как результаты можно агрегировать и представлять, чтобы получить потенциально более полезную информацию.

Надеюсь, вам понравилось читать эту статью. Буду признателен, если вы оставите свои мысли в разделе комментариев ниже.

редактировать:Я вижу, что некоторые из вас получают ошибки при использовании LDA Mallet, но у меня нет решения для некоторых проблем. Итак, я реализовал обходной путь и более полезную визуализацию тематических моделей. Надеюсь, вы найдете это полезным. Адрес обходной модели:woohoo.машинное обучение плюс.com/NLP/topic-no…

Ссылка на предыдущую статью:woohoo.apexcloud.com/is-use-a…

Посмотреть оригинальный английский

Посмотреть больше статей

Общедоступный номер: Galaxy № 1

Контактный адрес электронной почты: public@space-explore.com

(Пожалуйста, не перепечатывайте без разрешения)