STM Package Walkthrough Part One

Apr 3, 2020 #topic-modal #stm

library(stm)
library(stmCorrViz)

This is our working data.

## Observations: 13,246
## Variables: 5
## $ documents <chr> "After a week of false statements, lies, and dismissiv…
## $ docname   <chr> "at0800300_1.text", "at0800300_2.text", "at0800300_3.t…
## $ rating    <chr> "Conservative", "Conservative", "Conservative", "Conse…
## $ day       <int> 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, …
## $ blog      <chr> "at", "at", "at", "at", "at", "at", "at", "at", "at", …

3.1. Ingest: Reading and processing text data

# produce word indices and their associated counts
processed <- textProcessor(dat$documents, metadata = dat)

## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Stemming... 
## Creating Output...

length(processed$documents) == nrow(dat)

## [1] TRUE

# plot documents, words and tokens removed at various word thresholds
plotRemoved(processed$documents, lower.thresh = seq(1, 200, by = 100))

3.2 Prepare: Associate text with metadata

out   <- prepDocuments(processed$documents, processed$vocab, processed$meta, lower.thresh = 10)

## Removing 111851 of 123990 terms (189793 of 2298953 tokens) due to frequency 
## Your corpus now has 13246 documents, 12139 terms and 2109160 tokens.

3.3 Estimate: Estimating the structural topic model

poliblogPrevFit <-
    stm(
        documents = out$documents,
        vocab = out$vocab,
        K = 20,
        prevalence =  ~ rating + s(day),
        max.em.its = 75,
        data = out$meta,
        init.type = "Spectral",
        seed = 8458159
    )

3.4 Evaluate: Model selection and search

not included here (see part 2)

3.6. Visualize: Presenting STM results

plot(poliblogPrevFit, type = "summary", xlim = c(0, .4))

plot(poliblogPrevFit, type = "labels", topics = c(3, 7, 20))

Interactive visual via stmCorrViz package.

# NOT RUN 
stmCorrViz(
    mod = poliblogPrevFit,
    file_out = "stm-interactive-correlation.html",
    documents_raw = dat$documents,
    documents_matrix = out$documents
)