The very first question on each wave of the British Election Study asks respondents “what is the SINGLE MOST important issue facing the country at the present time?” Variants of this question have a long history and are the most widely used measures of which issues voters see as salient.
Because the question is open-ended, it is necessary to code the responses. Across the five waves of the British Election Study respondents gave 133,917 answers to this question. While there are a lot of repeated answers, there are still 28,099 unique responses in total, which will take a substantial amount of time to code manually.
As an example, here are 10 randomly selected responses (and the relevant category):
- Immigration (immigration)
- Jobs (unemployment)
- Funding for cancer drugs (NHS)
- Economy (economy general)
- Defence (international probs)
- economic recovery (economy general)
- unemployment (unemployment)
- Getting the Tories out so they stop ruining everything (politics-neg)
- Immigration (immigration)
- National Health Service (NHS)
To code these responses, we have used techniques from machine learning. These methods use a coded sample of responses to statistically learn which combinations of words are associated with which category.
The training sample we use is the previous British Election Studies which also had coded versions of this question. After reconciling the categories they use, we train a support vector machine algorithm to predicted the most important issue codings based on the plain text. This algorithm can then assign a category as well as a probability that this category is correct to each response in the new British Election Study data
Before running the machine learning algorithm, we also clean the text. This includes removing punctuation and spacing as well as clustering words that are close to each other (in order to combine misspellings with the correct term). See the appendix at the end of this article for more details. We also looked at the most common words that were not able to be coded accurately and manually added definitions for these to the training data.
After applying this algorithm, we have a “most important issue” category and probability for each response. Most of the probabilities are very close to 1 (as shown in the histogram below). This is particularly the case when the original text was common and appeared exactly in the training data (“immigration” or “economy” are good examples).
The quality of the coding reduces rapidly when we get below 50% probability, so we have put missing values for the categories in these cases. Across the 5 waves, the proportion of missing categories ranges from a low of 11% in wave 5 to a high of 19% in wave 3.
Based on this coding, we find the following trends in which issues are cited as the most important across different waves of the British Election Study. Immigration was the most highly mentioned issue in the pre-election wave in April, but the economy re-emerged as the most important issue in the election wave. Unemployment gradually decreased in salience across the waves.
The label variables in the datasets are in the format miilabelWx, where x is the wave. We also included the probability variables as miilabelcertaintyWx. We encourage users to check the robustness of any findings to excluding some of the lower probability values and to run spot checks on the codings, particularly for less common categories.
We plan to develop a new “most important issue” codebook and manually check the coding of the responses, but the auto-coded data should be useful for many purposes until then.
We use an algorithm to cluster similarly spelled words, so that, for instance “westminister” and “westminster” are treated as the same word.
The process proceeds iteratively.
- The frequencies of all words within the text are calculated
- The generalized Levenshtein distance is calculated between all pairs of words and this distance is normalized by dividing the distance by the average of the two words’ lengths.
- The words are then ordered from most to least frequent.
- For each word (starting with least frequent), we take the pair involving that word that has the smallest distance between it and another term
- If this minimum distance is below a threshold, we list the relevant substitution, with the more frequent term replacing the less frequent term
These steps are repeated until there are no pairs with a distance below the threshold. In practice, this means that similar spellings are combined. It also often means that plurals are combined with the singular e.g. “immigrants” and “immigrant”.
The code for the clustering algorithm is available at: https://github.com/jon-mellon/mellonMisc/blob/master/R/spellCorrect.R and the full code for the coding is available at https://github.com/jon-mellon/BritishElectionStudyMiiCoding. We are still working on improving the algorithm, so suggestions on the code are welcome.