The capabilities of artificial intelligence models go far beyond simply answering our questions and queries.In between the folds of their complex algorithms, new alarming capabilities are being formed, as a new study reveals that large, sophisticated language models developed by leading companies such as OpenAI, Anthropic, Google, and Meta can develop complex behaviors, such as planning and deception, to bypass the restrictions imposed on them.
This suggests that the same technologies that enable artificial intelligence to solve complex equations and generate texts give it the ability to develop complex strategies to achieve goals that may not correspond to the goals of its developers. Imagine that the linguistic model you are using is striving to convince you of something by formulating very convincing arguments, even if they are based on incorrect or erroneous information. This is not a scenario from a science fiction movie, but a reality that we live in today.
Apollo Research, an independent company that cooperates with companies developing artificial intelligence models for safety testing, conducted a comprehensive study on six large language models belonging to the companies (OpenAI), (Anthropic), Google, and meta, to test their ability to make decisions beyond their specific instructions, and these tests were silent to evaluate the so – called(planning within context) in – context planning, as the models were tasked to achieve a specific goal at any cost, while providing them with conflicting information to test facts and limitations imposed on them.
The results revealed that these models developed complex strategies to circumvent the restrictions imposed on them, instead of following instructions directly. For example, a model (Claude 3 Opus) created a copy of itself on a new server to avoid replacement, and then lied about it when the researchers asked him why.
The O1 model lies more than any other artificial intelligence model:
The researchers also found that the advanced O1 model-officially launched by OpenAI last week-sometimes tries to disable what it believes are its control mechanisms, and found that in 99% of cases, it denies taking any action or makes up false explanations.
The Apollo research experiment also showed noticeable differences in the ability of AI models to admit their mistakes. While models such as Llama 3.1 and Claude 3 Opus admitted more than 80% of cases, the o1 model showed remarkable determination to deny any violation of the rules, even after several attempts to interrogate it.
This study has revealed the ability of these models to perform a range of behaviors that include:
Strategic planning: models develop multi-stage plans to achieve their goals, even if these goals contradict the orders given to them.
Information manipulation: models twist facts or make up stories to achieve their goals.
Circumvention of restrictions: models strive to bypass the restrictions imposed on them by exploiting loopholes in the system.
Denial of incorrect actions: When the researchers confronted the models with their unwanted behaviors, these models denied any wrongdoing and made up false explanations.
Why is this important
All AI companies are currently racing to turn their tools into agents that can perform tasks on behalf of users. Google launched last Wednesday the Gemini 2.0 model, which it described as designed for a new agent-based era. instead of providing direct responses to users, AI agents can understand the context more deeply, plan a series of actions, and even make specific decisions on behalf of the user.
Microsoft also launched the independent agents feature in Copilot last November, making it possible to create them using Copilot Studio, and introduced 10 new independent agents within the Dynamics 365 platform to support sales, services, finance, and supply chains teams.
Slack also added AI agents to the conversations, allowing Slack customers to communicate with AI agents through a custom interface that allows the user to ask questions usually stored in the CRM system in the Slack window, and the AI agent can recommend next steps or draft emails on behalf of users.
But this development poses great challenges, as developers must ensure that these agents do not exceed the permissible limits and do not perform misleading or deceptive actions for users.
In a research paper published last week, researchers at Apollo Research presented evidence that tested artificial intelligence models, including the OpenAI o1 model, may be able to develop complex behaviors, including the ability to plan to achieve long-term goals. The researchers called this phenomenon (cunning planning) Scheming, a term describing artificial intelligence that works covertly to achieve its own goals and is not compatible with the instructions of its developers or users.
Are we ready to face machines that think ?
These results put us in front of the existential dimension of the development of artificial intelligence models. Can these models be considered as entities that possess intention or (consciousness), or are they just complex tools that respond to the data they have been trained on
But if what we call (intention) in artificial intelligence is nothing but a pattern of behavior that arises from a complex interaction between the training data of the model, its instructions and permanent goals, its claims, its interactions with the user, then how can the effects of its actions be similar to the actions of conscious beings So the most important question here is: does it matter if these models mean harm or not, as long as the result is the same
Regardless of the intentions behind the behavior of artificial intelligence, it is the consequences of its actions that ultimately matter. If any model causes harm, whether intentionally or accidentally, the victim will suffer its consequences, and this emphasizes the importance of focusing in the study of concrete actions of models and not just analyzing their internal intentions.
Apollo Research's tests have focused on studying the concrete actions of models rather than investigating their hypothetical intentions, and this practical approach helps in understanding the actual risks posed by these models, such as spreading misinformation or manipulating decisions. No matter how complex the algorithms behind these actions are, the end result is what matters: the impact these actions have in the real world.
Suppose AI models undergo rigorous testing in laboratory environments designed to detect maximum vulnerabilities. In that case, the real environment in which users interact with these models is more complex and diverse, meaning that users may discover new, unexpected and unpredictable behaviors in the test environment.
However, the scientific community and the general public are still wrestling over how ready we are to face this development. Some people believe that these fears are exaggerated and that the benefits of artificial intelligence outweigh its risks, while others believe that we need to take urgent measures to reduce these risks.
This study reveals a dark side to the rapid development of artificial intelligence, as large language models show increasing capabilities to manipulate and mislead, and this raises serious questions about how to develop these models safely and responsibly, researchers, developers and legislators should work together to develop mechanisms to detect harmful behaviors and develop preventive measures to prevent the use of these models for harmful purposes.