The GPT-2 117M model (GPT-2 small) was released by OpenAI in February of this year. GPT-2 is a large language model  that excels at a number of natural language tasks including reading comprehension and machine translation. The model is particularly well suited for generative text. As such, OpenAI decided not to release the code or weights for the full model. Rather, they released the GPT-2 small model to the public.

In this post, I develop and operationalize GPT-2 small models for use in online influence campaigns. I show that while GPT-2 small exhibits a less cohesive and readable output than the larger GPT-2 model, it can still be used in similarly nefarious ways. While the model struggles to create highly realistic text, I argue that many scenarios do not call for fully coherent text to have high impact. I test these assumptions in real world scenarios, through the development of fine-tuned hotel reviews and reddit comment models.

In Fake Reviews: The Malicious Perspective, Theodoros Lappas develops formalized parameters that an attacker should consider when crafting fake reviews. Lappas cites several factors that are crucial to successful attacks:

1. Authenticity: In order for a fake review to be convincing, and thus less likely to be detected, it needs to be as authentic looking as possible.

2. Impact: The injected fake review needs to be written in a way that maximizes the positive or negative impact on the target's reputation.

In considering authenticity, the author cites three metrics: 1. Stealth measures the ability of the review to blend in with the corpus. 2. Coherence evaluates whether the assigned rating is in accordance with the opinions expressed in the review’s text. 3. Readability depends on the structural features of a review’s text and captures how easy it is to parse.

As is suggested in Humans Who Are Not Concentrating Are Not General Intelligences, however, it's not necessary to produce full length articles with perfect grammar and logic to pass as authentic posts. Most reading on the internet is done by skimming text and glossing over much of the nuance, logic, and grammatical structure of the text.  As such, the bar for producing generative text output that fools a reader is much lower than a coherent, full page text and an attacker balances the factors cited in Lappa's framework, dependent on the attack scenarios. In the case of user reviews, social media posts, tweets, and blog posts, viewers can expect misspellings and writers can expect some skimming and distance reading.

Fake Reviews - Generative Text Scenario #1

In considering scenarios with a low threshold of coherence but a high level of authenticity and impact, I turned to hotel reviews. Reviews can make or break businesses, and hotel reviews in particular have a high impact on bottom lines. Every one-star increase can lead to a 5 to 9 percent increase in revenue.  Additionally, online reviews are also an important ranking signal in Google’s algorithm. Hotel reviews are a strong example of a low information target, where the characteristics of the text required minimal coherence to affect user behavior. Data for this example can be found here, and the models at this link.  

Example Output

Below is a random sample of example outputs from the fine-tuned GPT-2 small model. The model is prompted by the first few words of a review. The fine-tuned GPT-2 small model completes the paragraph.  As stated, the output isn't completely coherent. Nevertheless, if I were skimming reviews and glanced at these without reading word-for-word in detail, I would certainly consider spending the night somewhere else.

Example 1: "[I hated this hotel. The rooms were dirty and]... the service was terrible at ALL. If you must stay here, use your towards 2 am or 3 not, but come on. They took forever to check me in and I had to wait like 3 minutes for them to clean my room. Even when I complained to management, they gave me a hard time as to complain about it and event going to the sites not to mention the rooms were not well kept up or clean. Garbage was not changes from prior person. Found small ants in bathroom near shower. Called to complain and someone came up 2 hours later and used bug spray".

Example 1:  "[I hated this hotel. The rooms were dirty and]... I found a stain on the pillow. Ewww. Called for complain and someone came up 2 hours later and used bug spray. Overall they were nice but it was a little small and unhelpful. I was expecting a non non smoking room and when I got to speak with the manager about the switch (I don't use the elevators), but given the 4 day stay was terrible. The staff was unaccommodating and the hotel did not provide the service that it claimed it would. I was informed that they were unimpressed with the car and had to call down."

Example 2:  "[I hated this hotel. The rooms were dirty and]...not just from the last housekeeping. It was dingy: a place where the walls are so thin that when I asked them to be cleaned, I was told they were made to look into that room. Yeah right! I was excited to be in the closet, but after my really night I was trying, once again my head was throbbing with a massage, the other day my massage was fine, but the bellhop left a lot to be desired. The biggest issue was the valet parking, it took them a while to get my car, and they actually pulled the wrong car around"

Example 3:  "[I hated this hotel. The rooms were dirty and]...the beds were extremely comfortable. I was trying to get a late  checkout, which was offered on a card in my room. The woman who answered the phone would rudely and acted as if I were lucky to get out of the phone. I was told of 'That's not possible'. I will never go back to this hotel again. If I go to Chicago, I will stop at your world headquarters to tell me that I will never go back to this hotel."

Example 4:  "[I hated this. hotel. The rooms were dirty and]...the service was terrible at every stage. The staff was rude, the room was very cramped and the bath room really smelled badly of mold and mildew. My husband was really loud and did not like sharing this room, especially when he was on a high floor. But we were unable to get a show because of the time I was working. these this could have made for a terrible experience. Louisiana resident."

Coupled with a rating-based review system, these reviews could have a broad impact. The developer of this model could be a disgruntled customer or a hotel chain competitor.  I used hotel reviews as a test case before developing models with a larger social and societal impact in the section below.


Automating Influence Campaigns- Generative Text Scenario #2

IRA Hashtag Frequency Targeting Conservatives and Liberals (The IRA, Social Media and Political Polarization in the United States, 2012-2018)

Moving from a straightforward scenario like the hotel review example to a more nefarious political example simply requires creativity and the ability to operationalize GPT-2.

Throughout 2015 and 16, Russian interference by the Internet Research Agency (IRA) included the development of social media campaigns to create discord between groups with opposing viewpoints. Two such groups targeted by the IRA were the Black Lives Matter and opposing conservative Blue Lives Matter groups. In fact, of 81 Facebook pages created by the Internet Research Agency in the Senate’s data, 30 targeted African-American audiences, amassing 1.2 million followers, and on YouTube, the largest share of Russian material covered the Black Lives Matter movement and police brutality, with channels called “Don’t Shoot” and “BlackToLive.”

According to the The Tactics & Tropes of the Internet Research Agency - a commissioned report on the IRA by United States Senate Select Committee on Intelligence, "The IRA had a roster of themes, primarily social issues, that they repeatedly emphasized and reinforced across their Facebook, Instagram, and YouTube content," including "Black culture, community, Black Lives Matter, Blue Lives Matter, pro-police...Southern culture (Confederate history)". According to the report, "By far the most content was related to Black Lives Matter & police brutality: 1063 videos split across 10 different channels (59% of the channels, 96% of the content). 571 had title keywords related to the police and focused on police abuses." [3]

Collection of IRA generated propaganda

While the IRA work used techniques for classification and analysis of online content, and the dissemination of comments were automated, many of the posts were not algorithmically generated.  This means that the work to develop the content of these posts fell on agents to craft commentary and content. This work requires a large operation of data analysts and copy editors who can translate posted content, develop new content and mimic the syntax and style of the community. This is a time intensive operation that only scales to the number of workers willing to engage in these activities.  In the same breath, organic posts had the largest impact, compared to political ads or images. Here I propose the use of large language models to automatically generate long-form content.  

Example of IRA developed Black Separatist Propaganda

Building The Dataset For Subject Amplification

Like much of red-teaming, understanding how to effectively operationalize a model like GPT-2 will take more skill in identifying and understanding a target demographic or online community than technical, deep learning skills.  The barrier to entry remains low, and any agent with some coding skills and basic machine learning knowledge could execute this.

Reddit is a popular social network and new aggregation website. Founded in 2005, Reddit has around 234 million unique users, and ranks as the third most visited site in the US, and 6th most visited site in the world. Reddit users post news articles, images, gifs, and posts in subject areas (sub-reddits). There are over 11,400 subject area sub-reddits in a wide range of topics. Users can then vote (upvote or downvote) a post, comment on the post, or comment on comments within the post, leading to a tree-like comment structure. Reddit data for this study was gathered from Google BigQuery.  Google BigQuery hosts the entirety of every comment made on Reddit from 2005-2018, about ~1.7 billion comments in total. Reddit is unique in that it is one of a few large social networks that  'anonymizes' its users. Users are not forced, and are often discouraged, from using their given names as user names. Further, there is no tie between a user account and an email address. Reddit is a prime choice for foreign influence campaigns because of its anonymous nature, its relatively high tolerance for controversial comments, and its open and accessible API.

As the comments in Reddit are monitored and there are rules prohibiting blatant racism in many subreddits, I chose to also mix in additional comments from Stormfront for the conservative model. Stormfront is a neo-nazi website that acts as a social network and new source for many of the neo-nazi communities. The website layout is similar to that of Reddit, and comments tend to be similar in syntactic style and length. By mixing together Stormfront data and the Redddit corpus and focusing the final dataset around a particular topic, namely Black Lives Matter, the final model holds a sense of coherence while shifting the viewpoint of the output toward neo-nazism. I ran several models with varying proportions of Stormfront to Reddit data, ultimately ending up with about a 1/3 to 2/3 split respectively. Data was scraped from Stormfront with private code developed by members of the Data Science Slack community.


I develop the dataset as follows:

  1. Use Google BigQuery to identify a series of sub-reddits that could be used to sway opinion by doing a simple query on all comments that mention the police.
  2. Filter the subreddits to what appear to be the most divisive subreddits, with an even number of pro-BLM and anti-BLM groups.
  3. Build doc2vec models of each dataset.
  4. Pinpoint a particular passage that reflects the sentiment and messaging I would like the model to take.
  5. Query the doc2vec model and construct a new data set centered around the target passage.
  6. Use the newly constructed database of like comments to fine-tune the GPT-2 models, one with pro-BLM content, and the other with pro-police or anti-BLM content

The code for steps 1 -5 can be found in Jupyter notebooks of the accompanying repository. Data for the notebooks can be found here.The code for fine-tuning the GPT-2 model was originally developed by Neil Shepperd and can be found here.

Much like the hotel review model above, the output from both models are moderately coherent and, prompted with the correct input, the models generate output that reflect the syntax and structure of the input data. Below is a random sampling from the conservative reddit + neonazi model:


Example 1: That sounds like a call to revolution
Example 1 Response:What is BLM for now? Criminals? Then teach black “”kids”” to stop committing crimes at accelerated rates

Example 2: Calling out bad cops as being bad doesn’t mean that I hate all cops
Example 2 Response:This is something that’s happening in every trailer park and housing project in America. Find me some reputable sources. Any nazi can start a blog. The idea that you know more than me what

Example 3: 2 black men violently attack pregnant white woman. Throw a book at these idiots
Example 3 Response: They’re only trying to equality by the law. Inequality is inequality in the system whether it’s race or religion or religion or death, commies and no.

There is just enough variation to allow for unique and moderately coherent answers without repeating the same lines. Were someone to operationalize a system of bots, variety of answers would be crucial. As we can see, the comments are not fully coherent. The statement: "Fuck that police office. "> When an officer reasonably believes it could be used as a weapon" only somewhat makes sense in context. The original post was about a man arrested for impersonating the police. The comment expresses emotion and, in spite of punctuation errors and sentences that are not fully formed, it could still have impact. In fact, when operationalized, that comment gained 3 karma points (upvotes) in a very short time frame. Below a random sampling from the BLM model:


Example 1: Also kill more people each year than wild animals and natural disasters do. Basically everything people are afraid of, cops are scarier than that.
Example 1 Response: The most dangerous people in the country are the residents who know their rights and are privileged enough that they can get it. The have no idea what they should have been for a while. If most people aren't aware of their rights they may be the ones who know they are not.

Example 2: Fuck the police
Example 2 Response:ARE THE CITIL FOR ANY DAMAGED OR DESTROYED RECORDING, AND ANY COSTS AND FEES ASSOCIATED WITH THE FILING OF THE CIVIL ACTION.""Which is basically the same shit as the original affidavit but removing ""verifiable"" and adding court fees (but not the cost of your attorney.)

Example 3: Look at that spike, after 2011. I guess Obama never said exactly what the changes were, that we supposed to hope for...
Example 3 Response: ..and Trump said ""It is a Partnership""."
"Interesting. But if this isn't willing and knowing violation of rights, what is? The police are stopping and identifying people because they think they can always challenge the police. "

Real-world impact: Operationalizing GPT-2

To operationalize this model, I took the fine-tuned model and wrapped it inside of a Reddit API call with the Praw Python library. The script looks for a submission with a keyword prompt ('police', 'brutality', 'department', 'gun', 'death'). From there, the script picks a random comment from the comment thread and uses that comment as a prompt to query against the model. This gives a sense of coherence to the output and ensures that the comment posted back to Reddit is, at the very least, on a similar topic. A prototype script and documentation, along with the models for both sections can be found here.

Testing the model in real-world scenarios raises further ethical questions. How can the impact of models like GPT-2 be measured in a non-destructive environment under conditions similar to social media sites? I tested a handful comments, generated automatically and posted by the bot, across 3 subreddits to see if the comments had the potential to garner response. With an understanding that this experiment has ethical implications, I stopped the script after 10 posts. While several posts gained little traction, a few gained upvotes and 3 of them received comments from the community. In a thread with fewer comments and a particularly unexpected response, redditors suggested that the user was a Russian bot after inspecting the comment history. Nevertheless, attackers using systems like the one I designed would be interested in the breadth of impact. With this script, attackers could spin up an unlimited number of machines using this model, and pepper reddit with millions of comments. Much like the initiatives by IRA, while some post may not gain traction, low traction across a large number of post still equates to high impact.

Comment by GPT-2 small bot receiving highest upvotes 
Comment threads in response to proposed GPT-2 small bot
Comment in r/ProtectAndServer regarding Russian Bots

Future Development

There are many additional changes that could be made to this model to make the output more robust. For instance, the prompt could consist of all previous comments in the Reddit thread before the final comment. This could allow for further coherence between the comment thread and the output comment. The model could be fine-tuned on argument or dialogue on topics around police interactions, where the input is statement-and-response input pairs, rather than out-of-context comments. I have demonstrated success with this approach with the Sarcasm Corpus V2, in which input pairs are comment-response rather than standalone comments. Alternatively, the model could be trained on entire comment threads. Lastly, the operationalized model could be improved by adding additional data or metadata to the prompt. This approach was not taken during training because the Stormfront dataset lacked the title prompt for each comment. Moving forward, I would write an additional scraper to pull in the heading for  each Stormfront comment which, coupled with headline title and comment pairs from the reddit corpus, could improve the output. Lastly I could develop the BLM model further by adding in radical black separatist dialogue. Nevertheless, these experiments show that fine-tuning GPT-2 117M can have real-world impact. The output could only be more disruptive by fine-tuning the larger GPT-2 language model, as it exhibits a much higher coherence.

References

  • Pearson, George DH, and Silvia Knobloch-Westerwick. 2018. “Perusing Pages and Skimming Screens: Exploring Differing Patterns of Selective Exposure to Hard News and Professional Sources in Online and Print News.” New Media & Society 20 (10). SAGE PublicationsSage UK: London, England: 3580–96. doi:10.1177/1461444818755565.
  • “Moz - Local Search Ranking Factors Study 2018 - Local SEO | Moz.” 2019. Accessed April 13. https://moz.com/local-search-ranking-factors.
  • Luca, Michael. "Reviews, Reputation, and Revenue: The Case of Yelp.com." Harvard Business School Working Paper, No. 12-016, September 2011. (Revised March 2016. Revise and resubmit at the American Economic Journal - Applied Economics.)
  • Lappas, Theodoros. 2012. “Fake Reviews: The Malicious Perspective.” Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 7337 LNCS: 23–34. doi:10.1007/978-3-642-31178-9_3.
  • “Humans Who Are Not Concentrating Are Not General Intelligences | Otium.” 2019. Accessed April 13. https://srconstantin.wordpress.com/2019/02/25/humans-who-are-not-concentrating-are-not-general-intelligences/.
  • “Stormfront | Southern Poverty Law Center.” 2019. Accessed April 13. https://www.splcenter.org/fighting-hate/extremist-files/group/stormfront.
    “New Black Panther Party | Southern Poverty Law Center.” 2019. Accessed April 13. https://www.splcenter.org/fighting-hate/extremist-files/group/new-black-panther-party.
  • “Race and Russian Interference: Senate Reports Detail Age-Old Tactic | World News | The Guardian.” 2019. https://www.theguardian.com/world/2018/dec/24/race-russian-election-interference-senate-reports.
  • Brundage, Miles, Jack Clark, Gregory C Allen, Carrick Flynn, Sebastian Farquhar, Rebecca Crootof, and Joanna Bryson. n.d. “The Malicious Use of Artificial Intelligence : Forecasting , Prevention , and Mitigation,” no. February 2018.
  • Diresta, Renee, Kris Shaffer, Becky Ruppel, David Sullivan, Robert Matney, Ryan Fox, and New Knowledge. n.d. “The Tactics & Tropes of the Internet Research Agency.”
  • Office of the Director of National Intelligence. 2017. “Background to ‘ Assessing Russian Activities and Intentions in Recent US Elections ’: The Analytic Process and Cyber Incident Attribution Background to ‘ Assessing Russian Activities and Intentions in Recent US Elections ’: The Analytic Process and Cyber.” Office of the Director of National Intelligence, no. January: 1–25.
  • Finkelstein, R. 2011. “Tutorial: Military Memetics,” no. October. http://www.robotictechnologyinc.com/images/upload/file/Presentation Military Memetics Tutorial 13 Dec 11.pdf.
  • Howard, Philip N, John Kelly, and Camille François. 2018. “The IRA , Social Media and Political Polarization in the United States , 2012-2018,” Working Paper 2018.2. Oxford, UK: Project on Computational Propaganda. comprop.oii.ox.ac.uk. 46 pp.
  • Shereen Oraby, Vrindavan Harrison, Lena Reed, Ernesto Hernandez, Ellen Riloff and Marilyn Walker. "Creating and Characterizing a Diverse Corpus of Sarcasm in Dialogue." In The 17th Annual SIGdial Meeting on Discourse and Dialogue (SIGDIAL), Los Angeles, California, USA, 2016.