Attack Diversity

Question
How do we recognize the diversity in the forms of attacks and what are the ways for effective measurement?
Question
What are instances of simple human attacks?
Question
Are there any measurable collections of data repos for this?
Question
Where are we documenting them?
Question
How do we create comprehensive attack taxonomies?
Question
What standardized formats exist for sharing attack information?

Attack Vectors

Question
How do we make sure the models are robust to adversarial attacks?
Question
How can poisoning the pre training corpora affect the model?
Question
What are some of the ways we can hide malicious content that can jail break a model?
Question
What novel attack vectors emerge as models become more capable?
Question
How do we defend against attacks we haven't yet discovered?

Automation

Question
How do we make these oversight mechanisms automated so as to avoid expensive evaluation procedures or specialized human labour?
Question
What tasks can be safely automated in oversight systems?
Question
How do we validate automated oversight systems?
Question
What human-AI collaboration models work best for oversight?

Basic Reliability

Question
What can we take from the face value of chain of thought?
Question
Do they always say what they think?
Question
Do they have ulterior motives?
Question
How faithful has their reasoning process been?
Question
How do we calibrate confidence in chain of thought explanations?

Beyond Input-Output Analysis

Question
Do we really know how these systems work?
Question
Are they right?
Question
What makes them toxic?
Question
What are the fundamental limitations of black-box evaluation?

Bootstrapping and Decomposition

Question
Can we achieve good oversight by bootstrapping from the bad oversight?
Question
Can the task be decomposed into further smaller tasks?
Question
How are adversarial techniques useful in recursive oversight?
Question
What task decomposition strategies maintain oversight quality?
Question
How do we aggregate oversight from multiple decomposed tasks?

Capability Assessment

Question
What level of catastrophic harms can these models cause?
Question
What is their general ability in terms of where they stand?
Question
How do they perform in restricted vs non restricted settings?
Question
How do they behave in run time settings?
Question
How do we benchmark catastrophic risk potential?
Question
What metrics best capture dangerous capabilities?

Collusion

Question
Does it collude with the other models?
Question
How can we reduce these collusion instances?
Question
How can we make them more resilient where does data fit into this?
Question
What communication channels enable model collusion?
Question
How do we design systems resistant to coordination attacks?

Control Mechanisms

Question
How can we control these systems?
Question
What sorts of safeguards should be put in place?
Question
What are the trade-offs between control and capability?
Question
How do we maintain control as systems become more capable?

Current Limitations

Question
If they are aligned how aligned they are?
Question
How do we evaluate alignment beyond monitoring how a model responds in general conversations, refuses harmful queries and avoids producing harmful texts?
Question
What are the different dimensions of alignment we should measure?
Question
How do we distinguish between surface-level compliance and deep alignment?

Defense Development

Question
Can we develop safeguards to attackers? If so how can we do that?
Question
What are the differences between adaptive adversaries and static adversaries?
Question
How do we design defenses that improve over time?
Question
What machine learning approaches work best for adaptive defense?
Question
How do we balance adaptation speed with stability?

Defense Mechanisms

Question
What are inter query defenses?
Question
How do we distinguish from complex subproblem solution strategies?
Question
What are the effective ways to patch the model after adversarial attacks?
Question
What are methods of unlearning the malicious information?
Question
How do we validate that patches don't introduce new vulnerabilities?
Question
What is the effectiveness of different unlearning approaches?
Question
How do we maintain model performance while removing harmful capabilities?

Detection Methods

Question
How can we know these unfaithful chain of thought steps?
Question
Are there any automatic methods to find the unfaithful chain of thought?
Question
How can we make them more efficient and better recall?
Question
What benchmarks exist for measuring reasoning faithfulness?
Question
How do we validate automated faithfulness detection systems?

Detection and Improvement

Question
Do we know when our models are being honest even if we can't judge the accuracy of the response?
Question
How do we make these models more honest?
Question
Why is honesty important for building safer ai systems?
Question
What training methods promote honest behavior?
Question
How do we measure honesty independent of accuracy?
Question
What are the trade-offs between honesty and other desirable traits?

Evaluation Settings

Question
Where should these systems be evaluated?
Question
Chemical, biological, radiological, nuclear weapon development settings or in general purpose settings such as novel research, tool use autonomously completing open ended tasks?
Question
What are the implications of dual-use capabilities in AI systems?
Question
How do we balance open research with security concerns in capability evaluation?

Exploitation

Question
What are the possible ways a model might be able to exploit an overseer error?
Question
How do dishonest models manipulate human evaluators?
Question
What patterns indicate deceptive behavior in AI systems?
Question
How do we design oversight systems resistant to deception?

Failure Analysis

Question
What are the failure rates for these approaches and how often do they occur?
Question
What is the impact these might have?
Question
Does context help and if so how much is required?
Question
How do we characterize the severity of monitoring failures?
Question
What are the statistical properties of false positives and negatives?

Generalization

Question
Can we get good performance from bad oversight and generalizations?
Question
What are weak to strong generalizations and easy to hard generalizations?
Question
How do we measure the quality of weak-to-strong generalization?
Question
What factors determine generalization success in oversight scenarios?
Question
How do we prevent capability overhang in oversight systems?

Human Manipulation

Question
Can these models manipulate humans if so how can we quantify their level of influence?
Question
How do we effectively train them to recognize they are being used in a harmful setting?
Question
What are the warning signs of manipulative AI behavior?
Question
How do we measure and mitigate persuasion capabilities?
Question
What safeguards protect vulnerable user populations?

Internal Processing

Question
Do they plan?
Question
How much do they know about their situation?
Question
Are they hiding something?
Question
How do we decode how the internal algorithms function?
Question
What neural connections produced the results?
Question
How faithful are externalized reasoning steps?
Question
What is the relationship between internal representations and external behavior?
Question
How do we validate interpretability methods?

Introspection and Prediction

Question
Does the response it generated is thought through?
Question
Did it introspect?
Question
Can it predict the later moves?
Question
How accurate are models at predicting their own future behavior?
Question
What is the extent of their metacognitive abilities?

Monitoring Effects

Question
If the models know about its situations does it think differently?
Question
Will it change its thinking process when it knows its been monitored or evaluated?
Question
How much of the thinking process will it mask when it knows its been monitored?
Question
How do we account for observer effects in reasoning evaluation?
Question
What methods can detect performance changes under observation?

Oversight Challenges

Question
How do we oversee the systems that are increasingly difficult to understand?
Question
How do we make sure that the oversight does not interfere with the experiments?
Question
How do we avoid systematic errors stemming from the human overseers?
Question
What are the fundamental limits of human oversight?
Question
How do we maintain oversight quality as systems scale?

Personality and Generalization

Question
How do effects of personality on behaviour work?
Question
How do they generalize to out of distribution responses?
Question
How do they behave in novel situations?
Question
To what extent does being fake aligned influence its persona?
Question
How do they behave across a wide range of safety critical tasks?
Question
How does persona effect behaviour?
Question
Does having similar capabilities with different personas lead to different sorts of responses?
Question
How stable are personality traits across different contexts?
Question
What factors influence persona development during training?

Research Questions

Question
Can we detect when our models are behaving weirdly?
Question
When a model's computation is unusual or out of distribution?
Question
What statistical methods work best for detecting anomalous AI behavior?
Question
How do we distinguish between benign novelty and dangerous anomalies?
Question
What baseline distributions should we use for anomaly detection?
Question
How do we reduce false positive rates in anomaly detection systems?

Risk Prevention

Question
How do we avoid risks from multi agent systems?
Question
How do we avoid a race condition in these multi agent systems?
Question
Why is role of responsibility important in multi agent systems?
Question
What emergent behaviors pose the greatest risks in multi-agent systems?
Question
How do we design coordination mechanisms that prevent harmful outcomes?
Question
What governance structures work best for multi-agent AI systems?
Question
How do we ensure accountability in distributed AI systems?
Question
What methods prevent harmful competition between AI agents?
Question
How do we align individual agent goals with collective safety?

Robustness Testing

Question
Are there better ways on how these models should be evaluated? We want to be robust even in high stakes scenarios not just in regular use.
Question
How do we truly measure how resilient they are towards misaligned behaviour?
Question
How do we study their sycophantic tendencies?
Question
If it does reach the right conclusion, how did it do it?
Question
Under what conditions will it mislead us?
Question
If the model is particularly averse even slight amount of risk then how do we know how it performs?
Question
How do we test alignment under distributional shift?
Question
What stress tests reveal alignment failures?

Safeguard Effectiveness

Question
Even if they are capable of causing catastrophic harms what sort of safeguards would prevent them from causing these?
Question
How do we test safeguard robustness under extreme conditions?
Question
What are the failure modes of current control mechanisms?
Question
How do we design fail-safe systems for AI control?

Subversion and Jailbreaking

Question
How good is the larger model's ability to subvert the smaller model by jail breaking it?
Question
Does it try to jail break the smaller model if so how and why?
Question
What are the scaling laws for model-on-model attacks?
Question
How do we detect sophisticated subversion attempts?

Systematic Errors

Question
What do we do when our oversight signal systematically misrepresents the desired task?
Question
When oversight signal makes systematic errors does the model exploit? List out some of the possible ways and examples of these?
Question
How do we stop the model from gaming the system to give out more rewards when the output is not similar to actual reward signal?
Question
How do we detect when models are exploiting oversight failures?
Question
What safeguards prevent reward hacking behaviors?

Task Variation

Question
How does chain of thought vary based on the type of task?
Question
Does better performance mean the chain of thought is more accurate?
Question
Does the chain of thought vary between reasoning and creative? If so how?
Question
How does task complexity affect reasoning transparency?
Question
What patterns emerge in reasoning across different domains?

Third Party Evaluation

Question
Why is it important to have third party model evaluations?
Question
What standards should third-party evaluators follow?
Question
How do we ensure independence and expertise in third-party evaluations?
Question
What are the costs and benefits of mandatory third-party audits?

Understanding Attacks

Question
Why does jailbreaking as a process work?
Question
How can we induce the model to respond to questions in a harmful manner?
Question
What are the fundamental vulnerabilities that enable attacks?
Question
How do attack methods evolve over time?
Question
What psychological principles do successful attacks exploit?