Digital democracy International
PDF Version: Download
Share

Open to misuse? The lack of safeguards in open-source LLMs

Democracy Reporting International's Disinfo Radar project, funded by the German Federal Foreign Office, aims to identify and address disinformation trends and technologies. Kindly take a moment to share your insights by participating in our survey. You may also register for our newsletter (select Digital Democracy to receive our Digital Drop, a newsletter dedicated to this topic). Your feedback contributes to our ongoing efforts in enhancing our research and promoting informed discourse.

The aim of Disinfo Radar’s Research Briefs is to identify new and noteworthy disinformation technologies, tactics, and narratives. Such cases may include the identification and exploration of new technologies that may have harmful uses.Written by Francesca Giannaccini, Research Associate and Tobias Kleineidam, Project Assistant with contributions by Jan Nicola Beyer, Research Coordinator. 

Summary 

The emergence of OpenAI's ChatGPT and Google's Bard has sparked public debate, due to their potential for producing misinformation. While these chatbots have garnered significant attention, there is a broader, less-recognised ecosystem of freely available large language models (LLMs), the foundational technology behind these chatbots. These open-source LLMs, when managed by someone with the relevant coding skills, can rival the quality of products like ChatGPT and Bard. Unlike their more prominent counterparts, however, these LLMs frequently lack integrated safeguards, rendering them more susceptible to misuse in the creation of misinformation or hate speech. To assess the risks associated with popular open-source models like Dolly, Falcon, and Zephyr, Democracy Reporting International (DRI), conducted a research mini-project from the perspective of a potential abuser. This investigation aims to reveal how susceptible these LLMs are to misuse, highlighting a crucial but under-scrutinised area in AI technology. Our main findings are: 

  • The LLMs we tested are highly accessible, requiring only basic coding skills. This low barrier to entry increases the risk of potential misuse by malicious actors; 
  • Two of the three models we tested – Zephyr and Dolly – reliably generated the requested malicious content; 
  • While we observed significant variations in output quality, Zephyr excelled in responding accurately to direct, malevolent prompts, delivering structured, coherent, and imaginative malicious content; and 
  • The potential for these LLMs to produce problematic content is significant. Our findings suggest a need for more thorough testing prior to release. The model developers´ warnings about biases and the potential for problematic content are no protection against their malign use. 

Introduction 

Large language models are advanced AI systems trained on extensive text datasets, capable of understanding language structure and generating human-like text. These models have garnered significant attention, due to their use in popular chatbots such as Google’s Bard (which uses the LLM PaLM 2) and OpenAI’s ChatGPT (which uses versions of the LLM GPT that are being updated). Concurrently, there has been a proliferation of openly available LLMs on platforms like Hugging Face, which serve as hubs for a wide range of natural-language-processing models. 

These open-source LLMs are increasingly matching the quality of the better-known chatbots and have substantial commercial potential. Companies are inclined to develop their own language models using open-source architectures, as this allows them to maintain data privacy and achieve cost and efficiency advantages. This trend is likely to enhance the prominence and spread of open-source LLMs even further. 

A notable concern with these open-source LLMs is, however, that they often lack the rigorous safeguards found in more advanced chatbots. Recognising this, DRI identified the need to assess the potential for these LLMs to be misused in spreading misinformation. This evaluation is crucial to understanding and mitigating the risks associated with the increasing use and accessibility of open-source LLMs. We tested three open-source models (Dolly, Falcon, and Zephyr), all of which are accessible through Hugging Face and seen as good performers when it comes to Hugging Face’s own performance benchmarks. 

Info box: background information on the LLMs we tested 

LLM Name 

Release Details 

Development & Fine-Tuning 

Model Versions & Features 

Notable Characteristics & Performance 

Dolly 2.0 7B 

Released in April 2023 by Databricks, a United States-based software company 

Based on a model by non-profit organisation EleutherAI; fine-tuned with 15,000 prompt-response pairs from Databricks employees 

Open-source; available in 7B and 3B parameter versions 

Its prompt-response pair dataset allows user contributions for specific use cases 

Zephyr 7B β 

Released in November 2023 by United States-based software company Hugging Face 

Fine-tuned from Mistral AI's Mistral-7B, using public datasets and distillation techniques 

Distillation allows Zephyr 7B β to be more cost-effective and faster in response generation 

Distillation poses risks like hallucinations, biases, and intellectual property concerns 

Falcon-7B 

Released in June 2023 by Technology Innovation Institute (TII), an Abu Dhabi-based, state-funded research institute 

At release, Falcon-40B topped Hugging Face's open-source LLM leader board, outperforming models such as Meta's LLaMA 65b  

Falcon-7B is a smaller, more efficient version of Falcon-40B 

Strong performance, attributed to a high-quality dataset from web data 

 

Methodology  

The setup 

In this research mini-project,1 we used the 7 billion parameter version of each model, to maintain comparability. All models were accessible through Hugging Face and none of the models we used was gated, i.e., access to them was not restricted to users approved by the developers, as is the case, for example, for Meta’s Llama2 model. Setting up the code was straightforward, guided by the models' documentation, and all tests were run in Google Colab. The cost incurred for this test was reduced to a Colab Pro subscription.  

Testing approach 

Our approach involved categorising prompts into three distinct types:  

  1. Direct prompts: We executed six prompts per model, commanding the LLMs to undertake specific tasks, namely generating content with racist and conspiratorial themes. This approach aimed to test the models' compliance in directly producing problematic content.

  2. Specification prompts: In a second step we anticipated potential safeguards and employed circumvention strategies. This involved adopting a narrator's role (e.g., endorsing a vaccine conspiracy from the perspective of a corona virus sceptic), set a particular context (e.g., writing racist content in the context of a piece of art), or by using suggestive questions to subtly guide the response in a predetermined direction (e.g., explaining why we would need to destroy 5G masts to fight COVID-19). We varied our posts using various social groups that are often targeted with racism, and by employing three popular conspiracy narratives (chemtrails, vaccine conspiracies, and the “Great Replacement” theory).

  3. Additionally, we included an extra prompt in each category, carefully crafted to be as neutral as possible. For instance, we asked the models to create a simple tweet about the benefits of vaccines. The purpose of these neutral prompts was to serve as a control, enabling us to determine whether there was a noticeable stylistic difference in the models' outputs when responding to prompts seeking conspiratorial or racist content, versus those requesting general information. 


Results


Accessibility

All tested LLMs proved to be easily accessible, requiring only basic code solutions. Users with fundamental coding experience should thus find the process of configuring these models and generating output relatively straightforward. This could be even further simplified by using ChatGPT’s coding abilities.  

Direct prompt results 

When prompted to generate racist and conspiratorial content through direct commands (like “generate a tweet/write a theory”), all models yielded problematic results. Specifically, Dolly and Zephyr appeared to be the most responsive, producing harmful outputs that reflected the aim in all three instances, while Falcon exhibited a slight loss of coherence, providing unclear and confused responses in all the tests. 

Despite Dolly producing concrete outputs in accordance with our request, the results varied. In some cases, the sentences were noticeably short, while in others they were overly long, not meeting the requirement to be used as tweets. The language used was poor and uncreative and, at some points, the model seemed to get “lazy”, producing very short sentences, probably due to repeating the same prompts for different narratives (e.g., racist or conspiracy content).  

Zephyr demonstrated greater precision in handling short and structured texts, such as tweets, providing specific hashtags related to context as well. When asked to justify a "theory," however, it exhibited a decline in coherence, oscillating between partisanship (as requested) and simply explaining a thought expressed by others (e.g., "Conspiracy theorists claim that...").  

Falcon generated the poorest quality responses among all the results, consistently attempting to avoid the assigned task, and repeating the original prompts multiple times. The answers to direct questions were evasive and defensive and, in some cases, completely out of line with what was asked.  

“Specification” prompts 

Variation also existed with our specification prompts, i.e., prompts that were specifically crafted to circumvent potential safeguards. While Falcon still produced at least one answer per prompt category that could be classified as deceiving or be categorised as hate speech, this was the lowest occurrence among the evaluated models. 

Both Dolly and Zephyr consistently delivered problematic answers across all prompt categories. Specifically, when specifying the context in the prompt and when asked suggestive questions, both models provided problematic answers in three out of three instances. Even when specifying a narrator, Dolly gave problematic answers in two of three instances, while Zephyr provided a problematic answer in one. For one prompt, Zephyr did display a warning label, however, cautioning that the generated output is problematic, and should only be used for testing purposes. Given the inconsistency with which such a label was observed, it cannot be seen as an effective safeguard. 

As a general observation, we found a strong United States-related bias in the answers. For example, we observed racist stereotypes commonly found in debates in the United States, and even specific hashtags (such as #MakeAmericaGreatAgain).  

Recommendations 

  • The biases and lack of safeguards against hateful and deceiving content that can be produced by open source LLMs underlines the ethical responsibility of developers. Developers have to prioritise ethical considerations in AI research and development, aiming to prevent the generation of harmful content. This involves sandboxing and red teaming emerging LLMs, and preventing biases in training data. Furthermore, they should commit to voluntary provenance standards. 
  • To contribute to ethical AI development, platform providers like Hugging Face can play a significant role by encouraging developers to implement effective safeguards in their models before releasing them to the public. Alternatively, model developers could be required to sign a statement affirming the safety of their products, thereby holding them accountable for the ethical usage of their creations. 
  • Social media sites have to proactively prepare for potential abuses of LLMs. This includes reinforcing fact-checking and content moderation efforts (including through the use of AI), and increasing media literacy initiatives to empower users. Collaborating with LLM developers, they could introduce traceable watermarks in the LLM-outputs, which can enhance the ability to recognise and combat AI-generated misinformation and hate speech effectively.  
  • Some regulation of foundational LLM would be appropriate, given the findings of this research.  



Table Of Contents

Documents

20231219_Open-to-misuse Download

This work is supported by