https://transformer-circuits.pub/2022/toy_model/index.html Toy Models of SuperpositionNelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan,
What I Read: Adversarial Attacks on LLMs
https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/ Adversarial Attacks on LLMsLilian WengOctober 25, 2023 “Adversarial attacks are inputs that trigger the model to output something undesired.”
What I Read: Policy Regulariser, Adversary
https://deepmindsafetyresearch.medium.com/your-policy-regulariser-is-secretly-an-adversary-14684c743d45 Your Policy Regulariser is Secretly an AdversaryDeepMind Safety ResearchMar 24 By Rob Brekelmans, Tim Genewein, Jordi Grau-Moya, Grégoire Delétang, Markus Kunesch, Shane Legg, Pedro A. Ortega“Policy regularisation can be
What I Read: Building Robust Machine Learning Systems
https://medium.com/swlh/deepminds-three-pillars-for-building-robust-machine-learning-systems-a9679e56250a DeepMind’s Three Pillars for Building Robust Machine Learning SystemsSpecification Testing, Robust Training and Formal Verification are three elements that the AI powerhouse believe hold the essence of robust machine