Twitter Wordcloud Pipeline

At this past week’s Archives Unleashed dataton, I jokingly created some wordclouds of my Co-PI’s timelines.

Mat Kelly asked about the process this morning, so here is a little how-to of the pipeline:

Requirements:

Process:

Use twarc to grab some data.

$ twarc timeline lintool > lintool.jsonl

Extract the tweet text.

$ cat lintool.jsonl | jq -r .full_text > lintool_tweet.txt

Remove all the URLs from the tweets.

$ sed -e 's!http[s]\?://\S*!!g' lintool_tweet.txt > lintool.txt

Create a Wordcloud.

$ wordcloud_cli.py --text lintool.txt --imagefile lintool.png

lintool wordcloud


Nota bene

  • Each of these commands have a whole lot of options. Check them out, and experiment.
  • Yes, there is probably a better way to do this, and you could even make it into a one-liner. I pulled this together as a favour to Mat.
  • We were going to initially include wordclouds of collections in AUK, but wordcloud_cli.py doesn’t perform well at scale. Scale being, feeding it txt files of 5G up to 500G of raw text. Maybe one day we’ll revisit it.
Avatar
Nick Ruest
Associate Librarian

Related