adversarial – Andrew Fairless, Ph.D.

What I Read: Reward Hacking

By Andrew Fairless on April 1, 2025December 21, 2024

https://lilianweng.github.io/posts/2024-11-28-reward-hacking Reward Hacking in Reinforcement LearningLilian WengNovember 28, 2024 “Reward hacking occurs when a reinforcement learning (RL) agent exploits flaws or ambiguities in the reward function to achieve high rewards,Continue readingWhat I Read: Reward Hacking

What I Read: Debate, AI

By Andrew Fairless on March 3, 2025November 16, 2024

Debate May Help AI Models Converge on Truth Debate May Help AI Models Converge on TruthStephen OrnesNovember 8, 2024 “Letting AI systems argue with each other may help expose whenContinue readingWhat I Read: Debate, AI

What I Read: Toy Models of Superposition

By Andrew Fairless on December 19, 2024September 29, 2024

https://transformer-circuits.pub/2022/toy_model/index.html Toy Models of SuperpositionNelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan,Continue readingWhat I Read: Toy Models of Superposition

What I Read: Adversarial Attacks on LLMs

By Andrew Fairless on February 6, 2024December 19, 2023

https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/ Adversarial Attacks on LLMsLilian WengOctober 25, 2023 “Adversarial attacks are inputs that trigger the model to output something undesired.”

What I Read: Attack Impacts AI Chatbots

By Andrew Fairless on August 4, 2023August 1, 2023

https://www.wired.com/story/ai-adversarial-attacks/ A New Attack Impacts Major AI Chatbots—and No One Knows How to Stop ItWill KnightAug 1, 2023 7:00 AM “Researchers found a simple way to make ChatGPT, Bard, andContinue readingWhat I Read: Attack Impacts AI Chatbots

What I Read: Policy Regulariser, Adversary

By Andrew Fairless on May 11, 2022April 25, 2022

https://deepmindsafetyresearch.medium.com/your-policy-regulariser-is-secretly-an-adversary-14684c743d45 Your Policy Regulariser is Secretly an AdversaryDeepMind Safety ResearchMar 24 By Rob Brekelmans, Tim Genewein, Jordi Grau-Moya, Grégoire Delétang, Markus Kunesch, Shane Legg, Pedro A. Ortega“Policy regularisation can beContinue readingWhat I Read: Policy Regulariser, Adversary

What I Read: Aristotle, Deep Learning

By Andrew Fairless on March 17, 2022March 12, 2022

https://thegradient.pub/how-aristotle-is-fixing-deep-learnings-flaws/ How Aristotle is Fixing Deep Learning’s FlawsPaul J. Blazek17.Feb.2022 “…Aristotle’s logic still stands strong, poignantly describing the building blocks of human reasoning. Yet many of the key ingredients describedContinue readingWhat I Read: Aristotle, Deep Learning

What I Read: AI Researchers Fight Noise by Turning to Biology

By Andrew Fairless on February 1, 2022December 12, 2021

https://www.quantamagazine.org/ai-researchers-fight-noise-by-turning-to-biology-20211207/ AI Researchers Fight Noise by Turning to BiologyAllison WhittenContributing WriterDecember 7, 2021 “Tiny amounts of artificial noise can fool neural networks, but not humans. Some researchers are looking toContinue readingWhat I Read: AI Researchers Fight Noise by Turning to Biology

What I Read: Deploying Machine Learning, a Survey of Case Studies

By Andrew Fairless on February 23, 2021February 3, 2021

https://arxiv.org/abs/2011.09926 Challenges in Deploying Machine Learning: a Survey of Case StudiesAndrei Paleyes, Raoul-Gabriel Urma, Neil D. Lawrence “This survey reviews published reports of deploying machine learning solutions in a varietyContinue readingWhat I Read: Deploying Machine Learning, a Survey of Case Studies

What I Read: Building Robust Machine Learning Systems

By Andrew Fairless on February 19, 2021January 20, 2021

https://medium.com/swlh/deepminds-three-pillars-for-building-robust-machine-learning-systems-a9679e56250a DeepMind’s Three Pillars for Building Robust Machine Learning SystemsSpecification Testing, Robust Training and Formal Verification are three elements that the AI powerhouse believe hold the essence of robust machineContinue readingWhat I Read: Building Robust Machine Learning Systems

Tag: adversarial