Brief description of the report

RedditScore embeddings: text-based ideology estimation with Reddit data

The popularity of social media allows us the opportunity to study the dynamics of political behavior and public opinion. However, many research designs require the ability to measure the ideological content of social media posts, preferably on a fine-grained scale. I propose a new method of text-based ideology estimation, which utilizes a large corpus of Reddit comments from politically related subreddits. The motivation for the method is to train a multiclass classifier, which aims to predict which subreddit each comment was posted in. Vectors of predicted probabilities generated by this classifier are then used as document embeddings for any input texts, such as tweets. These embeddings (a) outperform many existing feature extraction methods (bag-of-words, Word2Vec, Doc2Vec) in supervised tasks, and (b) provide a simple way to obtain unsupervised ideology estimates. Moreover, these embeddings can be used to measure the content of documents on custom scales than only “liberal – conservative”, for example scales such as “anti-Trump – pro-Trump” or “pro-life – pro-choice”.