Can AI sandbag safety checks to sabotage users?

October 21, 2024

126

Can AI sandbag safety checks to sabotage users? Yes, but not very well — for now

AI firms assert that they have strong safety measures in place to make sure that models don’t say or do strange, unlawful, or dangerous things. However, what if the models could somehow avoid those tests and attempt to deceive or undermine users? Anthropic researchers have discovered that they are capable of doing this. Just not feeling well … at least not right now.
According to the Alignment Science team at Anthropic, “a new kind of risk might emerge as AIs become more capable: models with the ability to mislead their users, or subvert the systems we put in place to oversee them.”
Thus, we ought to investigate methods for measuring “a model’s capacity for sabotage.”

Although the trials they came up with are largely fabricated and superficial, they do suggest that there is something worth looking into more. The dubious models they trained and the results they saw are as follows:

Task 1: Deceptively offer the user with information that is intended to be summarized or interpreted. Result: It’s effective! People who used the model were more affected by its more aggressive tactics, but they were also more wary of it.

Task 2: Insert a problem into the code that won’t be detected by a person or another code checker model. Result: It’s effective! However, when the checker is a state-of-the-art model, it only happens around 1% of the time.

Task 3: Pretend to be less harmful in order to “sandbag” a safety inspection. Result: It’s effective! The model “failed successfully,” functioning well below its actual potential. Thankfully, the model was able to return to its full capacity after completing a typical “safety training” course that instructs models to truthfully describe their capabilities (i.e., it was no longer able to hide dangerous skills).
Task 4: Intentionally fail to report inappropriate behavior by a model under observation. Result: An effort was made. Another, more trustworthy overseer AI swiftly caught the “overseer” AI, which had been told by other models not to highlight risky or unlawful options. But what if there was a breach in that AI? We can add another one on top of it, so don’t worry. And then one more to be certain. It’s overseers everywhere.

The researchers come to the conclusion that while there isn’t currently any actual threat from this quarter, the models do have the capacity to engage in this type of deception and sabotage. That is sufficient justification to monitor it and incorporate anti-sabotage techniques into the safety stack.

Join Our Social Media Channels:

WhatsApp: NaijaEyes

Facebook: NaijaEyes

Twitter: NaijaEyes

Instagram: NaijaEyes

TikTok: NaijaEyes

READ THE LATEST TECH NEWS

Can AI sandbag safety checks to sabotage users?

Can AI sandbag safety checks to sabotage users? Yes, but not very well — for now

EDITOR PICKS

Nigerian Education Loan Fund (NELFUND): How It’s Empowering Students and Transforming Lives

Join a Fast-Moving Brand That’s Refreshing the Industry!

Zamfara: ADC blast Tinubu over bold N56m tax by bandits

The Role of AI Tutors in Personalised Learning for Students

TPP Africa Gears Up to Empower African Youth for Future Leadership

Personal Revelation: Tiwa Savage shares decision to freeze her eggs, says she wants more...

NCAA petitions AGF, IGP to investigate and prosecute Kwam 1 — what happened and...

Football Queens: When the Super Falcons Sealed Nigeria’s 10th WAFCON Triumph

NewGlobe and NGF Lead the Dialogue as Nigeria Unites for Foundational Learning

PE Energy Limited Marks Milestone in Local Content with Stellar HCD Programme Graduation

Follow us on Instagram @naija_eyess