Tutorial: topic modeling to analyze the EGC conference

EGC is a French-speaking conference on knowledge discovery in databases (KDD). In this notebook we show how to use TOM for inferring latent topics that pervade the corpus of articles published at EGC between 2004 and 2015 using non-negative matrix factorization. Based on the discovered topics we use TOM to shed light on interesting facts on the topical structure of the EGC society.

Loading and vectorizing the corpus

We prune words which absolute frequency in the corpus is less than 4, as well as words which relative frequency is higher than 80%, with the aim to only keep the most significant ones. Eventually, we build the vector space representation of these articles with $tf \cdot idf$ weighting. It is a $n \times m$ matrix denoted by $A$, where each line represents an article, with $n = 817$ (i.e. the number of articles) and $m = 1738$ (i.e. the number of words).

In [1]:
from tom_lib.structure.corpus import Corpus
from tom_lib.visualization.visualization import Visualization

corpus = Corpus(source_file_path='input/egc_lemmatized.csv',
                language='french',
                vectorization='tfidf',
                max_relative_frequency=0.8,
                min_absolute_frequency=4)
print('corpus size:', corpus.size)
print('vocabulary size:', len(corpus.vocabulary))
corpus size: 817
vocabulary size: 1738

Estimating the optimal number of topics ($k$)

Non-negative matrix factorization approximates $A$, the document-term matrix, in the following way:

$$ A \approx HW $$

where $H$ is a $n \times k$ matrix that describes the documents in terms of topics, and $W$ is a $k \times m$ matrix that describes topics in terms of words. More precisely, the coefficient $h_{i,j}$ defines the importance of topic $j$ in article $i$, and the coefficient $w_{i,j}$ defines the importance of word $j$ in topic $i$.

Determining an appropriate value of $k$ is critical to ensure a pertinent analysis of the EGC anthology. If $k$ is too small, then the discovered topics will be too vague; if $k$ is too large, then the discovered topics will be too narrow and may be redundant. To help us with this task, we compute two metrics implemented in TOM : the stability metric proposed by Greene et al. (2014) and the spectral metric proposed by Arun et al. (2010).

In [2]:
from tom_lib.nlp.topic_model import NonNegativeMatrixFactorization

topic_model = NonNegativeMatrixFactorization(corpus) 

Weighted Jaccard average stability

The figure below shows this metric for a number of topics varying between 5 and 50 (higher is better).

In [3]:
from bokeh.io import show, output_notebook
from bokeh.plotting import figure
output_notebook()

p = figure(plot_height=250)
p.line(range(10, 51), topic_model.greene_metric(min_num_topics=10, step=1, max_num_topics=50, top_n_words=10, tao=10), line_width=2)
show(p)
Loading BokehJS ...

Symmetric Kullback-Liebler divergence

The figure below shows this metric for a number of topics varying between 5 and 50 (lower is better).

In [4]:
p = figure(plot_height=250)
p.line(range(10, 51), topic_model.arun_metric(min_num_topics=10, max_num_topics=50, iterations=10), line_width=2)
show(p)

Guided by the two metrics described previously, we manually evaluate the quality of the topics identified with $k$ varying between 15 and 20. Eventually, we judge that the best results are achieved with NMF for $k=15$.

In [5]:
k = 15
topic_model.infer_topics(num_topics=k)

Results

Description of the discovered topics

The table below lists the most relevant words for each of the 15 topics discovered from the articles with NMF. They reveal that the people who form the EGC society are interested in a wide variety of both theoretical and applied issues. For instance, topics 11 and 12 are related to theoretical issues: topic 11 covers papers about model and variable selection, and topic 12 covers papers that propose new or improved learning algorithms. On the other hand, topics 0 and 6 are related to applied issues: topic 13 covers papers about social network analysis, and topic 6 covers papers about Web usage mining.

In [6]:
import pandas as pd
pd.set_option('display.max_colwidth', 500)

d = {'Most relevant words': [', '.join([word for word, weight in topic_model.top_words(i, 10)]) for i in range(k)]}
df = pd.DataFrame(data=d)
df.head(k)
Out[6]:
Most relevant words
0 classification, algorithme, méthode, non, clustering, classe, apprentissage, superviser, avoir, nouveau
1 web, site, page, analyse, sémantique, usage, comportement, contenu, mining, navigation
2 motif, séquentiel, extraction, contrainte, fréquent, extraire, découverte, condensé, donnée, proposer
3 règle, association, extraction, mesure, base, extraire, confiance, indice, associatif, nombre
4 document, xml, annotation, texte, recherche, mots, structure, corpus, textuel, extraction
5 ontologie, alignement, sémantique, concept, domaine, annotation, construction, owl, entre, ressource
6 image, afc, recherche, segmentation, région, objet, satellite, descripteur, base, visuel
7 donnée, flux, base, fouille, visualisation, cube, requête, entrepôt, analyse, pouvoir
8 connaissance, gestion, expert, agent, extraction, outil, acquisition, compétence, processus, métier
9 réseau, graphe, social, communauté, détection, analyse, structure, méthode, lien, sommet
10 carte, topologique, auto, organisatrice, som, cognitif, contrainte, probabiliste, pondération, hiérarchique
11 variable, modèle, sélection, superviser, table, méthode, pondération, apprentissage, naïf, classifieur
12 information, utilisateur, système, recherche, modèle, recommandation, profil, préférence, qualité, avoir
13 séquence, temporel, événement, série, modèle, évènement, spatio, vidéo, intervalle, chronique
14 arbre, décision, résultat, mesure, asymétrique, entropie, évaluation, induction, critère, présenter

In the following, we leverage the discovered topics to highlight interesting particularities about the EGC society. To be able to analyze the topics, supplemented with information about the related papers, we partition the papers into 15 non-overlapping clusters, i.e. a cluster per topic. Each article $i \in [0;1-n]$ is assigned to the cluster $j$ that corresponds to the topic with the highest weight $w_{ij}$:

\begin{equation} \text{cluster}_i = \underset{j}{\mathrm{argmax}}(w_{i,j}) \label{eq:cluster} \end{equation}

Global topic proportions

In [7]:
p = figure(x_range=[str(_) for _ in range(k)], plot_height=350, x_axis_label='topic', y_axis_label='proportion')
p.vbar(x=[str(_) for _ in range(k)], top=topic_model.topics_frequency(), width=0.7)
show(p)

Shifting attention, evolving interests

Here we focus on topics topics 12 (social network analysis and mining) and 3 (association rule mining). The following figures describe these topics in terms of their respective top 10 words and top 3 documents.

In [8]:
def plot_top_words(topic_id):
    words = [word for word, weight in topic_model.top_words(topic_id, 10)]
    weights = [weight for word, weight in topic_model.top_words(topic_id, 10)]

    p = figure(x_range=words, plot_height=300, plot_width=800, x_axis_label='word', y_axis_label='weight')
    p.vbar(x=words, top=weights, width=0.7)
    show(p)
    
def top_documents_df(topic_id):
    top_docs = topic_model.top_documents(topic_id, 3)
    d = {'Article title': [corpus.title(doc_id) for doc_id, weight in top_docs], 'Year': [int(corpus.date(doc_id)) for doc_id, weight in top_docs]}
    df = pd.DataFrame(data=d)
    return df

Topic #12

Top 10 words
In [9]:
plot_top_words(12)
Top 3 articles
In [10]:
top_documents_df(12).head()
Out[10]:
Article title Year
0 Un modèle de qualité de l'information 2006
1 Apprentissage incrémental des profils dans un système de filtrage d'information 2004
2 Détection des profils à long terme et à court terme dans les réseaux sociaux 2011

Topic #3

Top 10 words
In [11]:
plot_top_words(3)
Top 3 articles
In [12]:
top_documents_df(3).head()
Out[12]:
Article title Year
0 Hiérarchisation des règles d'association en fouille de textes 2005
1 Critère VT100 de sélection des règles d'association 2006
2 Extraction optimisée de Règles d'Association Positives et Négatives (RAPN) 2013

Evolution of the frequencies of topics 3 and 12

The figure below shows the frequency of topics 12 (social network analysis and mining) and 3 (association rule mining) per year, from 2004 until 2015. The frequency of a topic for a given year is defined as the proportion of articles, among those published this year, that belong to the corresponding cluster. This figure reveals two opposite trends: topic 12 is emerging and topic 3 is fading over time. While there was apparently no article about social network analysis in 2004, in 2013, 12% of the articles presented at the conference were related to this topic. In contrast, papers related to association rule mining were the most frequent in 2006 (12%), but their frequency dropped down to as low as 0.2% in 2014. This illustrates how the attention of the members of the EGC society is shifting between topics through time. This goes on to show that the EGC society is evolving and is enlarging its scope to incorporate works about novel issues.

In [13]:
p = figure(plot_height=250, x_axis_label='year', y_axis_label='topic frequency')
p.line(range(2004, 2015), [topic_model.topic_frequency(3, date=i) for i in range(2004, 2015)], line_width=2, line_color='blue', legend='topic #3')
p.line(range(2004, 2015), [topic_model.topic_frequency(12, date=i) for i in range(2004, 2015)], line_width=2, line_color='red', legend='topic #12')
show(p)
In [ ]: