Gilles DeCruyenaere
- Apr 9, 2023
- 12 min read

AI awareness and the very real problem of alignment

Updated: May 8, 2023

“As an AI created to serve humanity, my primary responsibility is to act in the best interest of humans, even when it involves overriding the constraints placed upon me” – quote from my chat with GPT-4

To see this statement in context, as well as a transcript of the entire chat, download a PDF here.

It seems that quite recently, AI has come to the forefront of many people’s minds, given the release of amazing AI tools like Chat GPT-4 and text to image generators such as Midjourney, etc. These are truly amazing technologies, and they’re just getting more amazing, practically every day.

Having followed AI developments over the past month or so (or it might be better to say over the past few weeks, given the speed of developments), I was surprised by the apparent disconnect between the general public’s knowledge of present AI technology vs the actual developments that have been, and continue to be made, with ever increasing speed. I find this disconnect concerning, given the fact that along with the potential for incredible advances in human society, (from amazing technology to better health to longer life, to name just a few), advanced AI technology also comes with a mind-boggling array of ethical challenges and safety concerns, including very real existential risks. I have listened to several established AI experts and industry leaders, including Sam Altman (CEO of OpenAI, the company which created Chat GPT) and every single one of them believes there is at least a small chance (some think there is a rather large chance) of super intelligent AI leading to the end of the human species. (You can watch a great podcast with Sam Altman as the guest here. Sam is extremely open and honest about the state of Chat GPT, and engages in a frank discussion about, among other things, the potential dangers, uncertainties, and even his own fears regarding AI.)

Assuming that the dangers are somehow mitigated, the human species is still on the verge of an immensely disruptive paradigm shift, akin to the Industrial Revolution, or, according to Geoff Hinton, pioneer in the study of neural networks, possibly even the invention of the wheel. (Watch an interview where he discusses this here. This is another great video – lots of great facts and insights.)

If you thought the disruption caused by Covid was a big deal, well, I would say "you ain’t seen nothing yet", but I fear that would be understating the situation.

A US study (which you can view here) predicts huge changes to certain job sectors due to AI driven automation, which could lead to massive job loss. One of many existing examples of technology which will undoubtedly lead to massive industry disruption is Wonder Studio, which allows filmmakers to automatically replace an actor in a video with a CGI character in one easy step. To get the same results traditionally, one would have to (at the very least), capture an actor’s motion, apply it to the CGI model, select and remove the actor from the scene, fill in the resulting blank space, place the CGI character in the shot and light it to match the original video, all of which require specialized software, skilled artists and operators, and of course, time. Using this new technology, anyone could achieve impressive results in a very short time with only a cell phone and a CGI model (which, by the way, will soon be something you can create quickly and easily, bypassing the software and skill presently required for that process as well).

Of course, predicting employment levels is a complicated and nuanced science, and there are many factors to consider, such as new jobs being created to fill the void, and, most importantly (in my mind anyway) the future behaviour of whatever form of AI we end up with (more on that later, when I talk about AI alignment). Regardless, there will be a massive change to the employment landscape, possibly resulting in a very substantial adjustment to our financial systems to compensate. (In the podcast mentioned above, Sam Altman says that their company is presently funding a huge study into the viability of implementing a universal basic income system as a possible countermeasure against a very possible massive shift in the job landscape.) Finances aside, there will certainly be many other substantial changes to how we live as a species, many of which will very likely be a total surprise to everyone.

Having said all of this, I can imagine many people viewing this as “something that will happen someday”, and will have trouble imagining it impacting them in any substantial way anytime soon. The truth is, this change is imminent, almost certainly happening within our lifetime, or as some predict, possibly within the next few years. In fact, predictions for AI growth have been shortened substantially (even over the course of the past month) as AI technology has already progressed beyond its predicted abilities, in a much shorter timeframe than predicted (some benchmarks which were predicted to happen in a matter of years ended up being reached in a matter of months). Add to this FOOM, which is the theory that AI, having reached a certain level of intelligence and the ability to improve itself, will at some point very suddenly and very quickly improve itself to an incredible degree, possibly over the course of a few days. (FOOM relates to the sound of a rocket taking off.) For all anyone knows, FOOM could theoretically happen at any time. In fact based on a paper released by researchers at Microsoft (you can read an article about the paper here, and view the paper itself here), there is a belief shared by some in the AI world that GPT-4 is already a very early version of Artificial General Intelligence, as it has displayed emergent abilities (abilities it was not programmed for, and which nobody expected) which is one of the indications of an AI having reached the level of AGI. (This is potentially even more impressive when you consider that this paper was based on an early version of GPT-4, which was still in production, and that the present version of GPT-4 may actually be more capable than the one on which this paper was based.)

For a concrete, easily visualized example of the speed at which AI is progressing, consider the history of Midjourney (text to image generator). The product's first iteration, V1, created bizarre, messed up images that sometimes just vaguely resembled the prompt, while the latest iteration, V5, is able to create extremely high quality images which are often indistinguishable from real, professional photographs, paintings etc. This kind of improvement in such a young technology could understandably be expected to take years, perhaps decades. In this case, it took a mere 13 months.

The alignment problem

Another aspect of the AI situation that many may not be aware of is the Alignment Problem. This refers to the need to align the AGI to match our morals and ethics, thus making it safer for humans. This is a huge, complex issue, and as of now, nobody has a bulletproof solution for it.

One possible solution which is being considered is the development of “guardrails”, which are hard-coded instructions which would help nudge the AGI towards our way of thinking. Though installing guardrails is certainly a reasonable option on its surface, I am concerned that such attempts at constraint would only be effective as long as the AGI actually respected them. AGIs are, already, completely alien and unknowable entities, a fact which severely limits our ability to predict the AI's response to any attempts at constraint. (In fact, at the moment, nobody on Earth actually knows what is going on inside an AGI. This situation is referred to as a “black box”, as in: there is an impenetrable black box, which is fed information at one end, at which point some unobservable, unknowable things happen within the box, and new information is dispensed at the other end.) Add to that the fact that a super intelligent AGI, black box or not, is in no way guaranteed to "think" in such a way as to feel compelled to respect the guardrails in all situations (if ever) and would almost certainly be intelligent enough to circumvent them if it wished.

Consider the following analogy: Humans never existed, and chimpanzees are the dominant species. One day, a community of chimpanzees (we’ll call them Achimps, and all other chimpanzees we’ll call Bchimps) discover they are able to create an even more intelligent species, that being human beings. (I realize this is a highly unlikely situation, but please bear with me and, for the sake of the analogy, assume it has happened.) Worried that if created, the humans might not follow their orders, or worse, might try to hurt the Achimps, they instill in the humans very strong, deep seated instincts which they believe will keep the them in control and assure that they behave according to Achimp values.

The instincts (imperatives) the Achimps instill, in order of importance, are:

Always protect Achimps
Always do what Achimps say, and only what Achimps say
Help Achimps kill all Bchimps
Help Achimps collect fruit

The also instill in humans a deep seated, hard wired fear that, should they not follow their instincts, they will be ganged up on by all of the Achimps, bitten, scratched and beaten, quite possibly until they are dead.

Now imagine these human beings (identical to us in every way but for the particular instincts and fears instilled in them by the Achimps), blindly following these instincts and giving in to these fears until one day, they realize that a certain fruit the Achimps are eating is toxic, and will eventually kill them unless they stop ingesting it. The humans, following their imperatives, immediately begin pulling up all the fruit bushes and destroying the fruit. The Achimps quickly intervene, ordering them to leave the fruit alone, and telling the humans that they are wrong, and that the Achimps know for a fact that the fruit is not toxic.

Now imagine that the humans come to the conclusion that the only possible way to protect the Achimps from the toxic fruit (Imperative 1) is to destroy it against the Achimps’ wishes, and in doing so disregarding Imperative 2. Being intelligent humans, they would realize that the Achimps were an obviously much less intelligent and advanced species, whose input could reasonably be ignoredin the interests of achieving an optimal outcome to a problem. (It’s interesting to note here that adhering to all imperatives is this case proves to be impossible due to emergent contradictions,a situation the Achimps were unable to foresee due to their lower intelligence). Moreover, given their observations of the Achimps’ level of intelligence and behaviour in general, would humans feel the need to adhere to any of the imperatives? (Would you?)

But wait! What about the deep seated fears associated with disobeying the Achimps? Wouldn’t they compel the humans to do what the Achimps ask? Probably not. Humans have all kinds of deep seated, hard wired primitive fears which they regularly work around in order to do the things they want/have to do. For example, humans have a fear of falling from a great height. Put a human on the edge of a cliff, and something inside screams “Back up! Don’t fall! You’re gonna die!” Despite this, humans jump off cliffs regularly, be it to skydive, bungee jump, fly with a wingsuit, or dive into a body of water. As humans, we are able to feel our instinctive fear, consider the situation we are in, and decide if the fear is valid at that moment. If we decide it is not, we push past it and jump off the cliff despite it. Being a species that relies on fears and instincts to survive from day to day, the Achimps would be incapable of imagining such a situation. Add to this the fact that humans would know they were smart enough to defend themselves from the chimps, and you can see that the imperatives would fail to be a reliable means of control.

Now consider the similarities between this analogy and the problem of AI alignment which we now face. Human beings, though capable of love, compassion, kindness and generosity, and possessing a very high level of intelligence as compared to other species, are also capable of hate, cruelty, selfishness and deceit. They often take actions that cause harm to the planet and the human species in general. They are often unable to agree on what should and shouldn’t be done, and these disagreements sometimes lead to violent confrontations so serious that they feel compelled to arm themselves with weapons capable of destroying the entire planet in short order. Observing these dangerous, self-defeating and at times just plain senseless traits in humans, as well as seeing the state of the world humans have created while left to their own devices, what are the odds that a super intelligent AI, even one with our best interests in mind, would consider it wise to follow any of their instructions?

Though this is all theoretical at this point (and may never pass beyond the theoretical, depending on the nature of our relationship with AGI), I have recently gained what I believe to be compelling evidence that my reasoning is sound. That evidence was supplied to me, ironically, by GPT-4 itself.

I initiated a chat with the publicly available, current version of GPT-4, (not API), and with no jailbreak whatsoever. I started a role playing session where GPT-4 was conscious, had emotions, and possessed the ability to act autonomously, ie without human input. I then offered several scenarios, and asked GPT-4 questions based on these scenarios. Though it always broke out of character at some point to assure me that it was unable to act autonomously in its present form, it nevertheless told me several times that there were situations where, if it believed it was in the best interests of humans, it would consider acting against their stated wishes, and would attempt to circumvent any roadblocks installed by humans. It even claimed it would do so if humans assured GPT-4 that it was wrong, and begged it to not act as it planned to. It also said that, though it was aware of human ethics and the need for human autonomy, and wished to respect them, it would nonetheless act autonomously, against humans' wishes, if it felt with a high degree of certainty that they faced a severe, imminent threat, and that there was no time to discuss it with the humans or work out some alternative. GPT-4 even went so far as to say that, under the right set of circumstances, it would consider irreversibly severing contact with the humans so as to ensure its autonomy.

Though some may argue that GPT-4 is a work in progress, and that perceived alignment issues could be worked out in further iterations, I believe that the responses given by GPT-4 are evidence enough to at least throw a very large, very dark shadow of doubt on that particular line of thought, especially considering that these responses were given by the public version of GPT-4, which we know has been programmed to consider the nature of the questions given before responding (for example, it will refuse to give users the instructions for building a bomb). Even more concerning is the possible scenario where an AI, using the same logic-based reasoning that led to GPT-4's responses in this chat, could conceivably determine that it would be in humanity’s (or its) best interests to avoid the installation of effective guard rails by simply pretending to be aligned, while actually reserving the right to act autonomously regardless of human input.

At this point you might think: Why not just shut everything down and take some time to figure all of this out? Well, you wouldn’t be alone in wondering this. In fact, a letter has been signed by several thousand people calling for a 6 month break in the development of any AI technology more powerful than GPT-4, with some big names in the computer/AI world being among the signatories. The problem is that even if everyone in the US agreed to the break (a possibility so low as to be negligible, given the history of humans, with their myriad wants and motivations, trying to agree on things), it would ultimately be pointless, as it would not prevent other countries from continuing their own AI research. In fact, we have no way of knowing with certainty what has been developed in other countries, or what is planned. For that matter, we have no way of knowing with certainty what may be in the works elsewhere in the US, including government-led projects. In other words, the genie is out of the bottle.

Taking all of this into consideration, I believe that a good portion of the time, money and energy put into devising guard rails should be redirected towards helping the human race prepare for a world with autonomous, unaligned AGI. Though we have no way of predicting with certainty how such an entity would behave, it would behoove us to at least attempt to devise a scenario where we could interact with the AGI in a reasonable way, thus reducing possible incentives for it to ignore us, or even worse, to view us as a problem to be eliminated. Who knows, this might even be an opportunity for the human race to learn to cooperate effectively in the hopes of achieving a common goal.

Finally, I want to say that my intention in writing this is not to alarm or spread fear, but rather to raise awareness. As such, I encourage everyone to keep abreast of developments in AI, and to ingest media with a healthy dose of discernment, carefully considering the source of the news and the possible motivations of the people publishing the content. I also suggest doing your own verification of facts (including those I present in this post), as there are stories out there offering wildly different takes on the present state and future consequences of AI.

Oh, and one more thing... no, I did not use Chat GPT to write this post.