How to Detect AI-Generated Answers in Technical Interviews
The rise of large language models has changed technical hiring in a way most companies have not caught up with. Candidates now have access to tools that can generate plausible code, explain system design tradeoffs, and write articulate technical communication, all in seconds.
This does not make technical assessments useless. It means the assessment process needs an integrity layer that did not exist two years ago.
The signals that matter
AI-generated technical answers have patterns. They are not always obvious to a human reviewer, but they become clear when you look at multiple signals together.
Uniform structure. LLM outputs tend to follow a consistent template: context paragraph, numbered list of considerations, balanced conclusion with caveats. Real human answers are messier. They skip steps, emphasize what they know best, and trail off on points they are less sure about.
Hedging language. AI models are trained to be balanced, so they frequently hedge: "it depends on the use case," "there are tradeoffs to consider," "both approaches have merit." Experienced engineers do hedge sometimes, but they also take strong positions when they have strong opinions.
Polished prose on the first try. When a candidate produces a perfectly structured, grammatically flawless answer with no revisions, pauses, or corrections, that is worth flagging. Real writing, especially under time pressure, involves backspacing, reordering sentences, and fixing typos.
Excessive comments in code. AI-generated code tends to be heavily commented, explaining each block in detail even when the code is straightforward. Most experienced engineers comment sparingly and only where the intent is not obvious from the code itself.
Unprompted edge cases. LLMs often enumerate edge cases that were not asked about, because they are drawing on training data that covers those cases. A candidate who spontaneously lists five edge cases for a simple function may be demonstrating deep knowledge, or may be pasting from a model that was trained to be thorough.
Behavioral signals
Beyond the content of the answer, the process of producing it reveals a lot.
Keystroke patterns. A legitimate candidate types in bursts: thinking, typing, pausing, deleting, retyping. An answer that appears all at once (zero keystrokes followed by a paste) is a strong indicator that the content was generated elsewhere.
Typing speed. Sustained typing at 400+ characters per minute is physically difficult. If someone appears to produce content at that rate, the content was likely pre-written or generated.
Timing. If a complex system design question is answered in under 30 seconds, the candidate probably did not think through the answer in real time.
Tab switching. Frequent tab switches during a timed assessment are not conclusive on their own (candidates might be looking up syntax), but combined with other signals, they add context.
A scoring approach
At Evaluator, we combine these signals into an AI-generation likelihood score between 0.0 and 1.0 for each answer:
- 0.0 to 0.2: Almost certainly human-written. Shows natural typing patterns, personal voice, normal revision behavior.
- 0.2 to 0.4: Likely human. Minor flags that could be explained by fast typing or careful preparation.
- 0.4 to 0.6: Uncertain. Mixed signals. Worth reviewing manually.
- 0.6 to 0.8: Likely AI-generated. Multiple structural and behavioral flags.
- 0.8 to 1.0: Almost certainly AI-generated. Pure paste behavior, uniform structure, no revision history.
The goal is not to automatically reject anyone above a threshold. The goal is to give the hiring manager enough information to ask better follow-up questions. If a candidate's written assessment scores 0.7 on AI likelihood, that does not mean they cheated. It means you should probe those topics more deeply in a live conversation.
What companies should do
Do not ban AI tools outright. Some companies try to enforce "no AI" policies during assessments. This is nearly impossible to enforce and penalizes honest candidates who disclose their tool use.
Do measure the delta. If you run an async assessment followed by a live technical conversation, you can compare the depth of understanding. A candidate who produced a sophisticated system design on paper but cannot explain their reasoning live is a clear signal.
Invest in multi-signal integrity. No single indicator is conclusive. Keystroke patterns alone are not enough. Content analysis alone is not enough. The combination of behavioral signals, content analysis, and timing data gives you a reliable picture.
Be transparent. Let candidates know that integrity monitoring is part of the process. This is not about surveillance; it is about fairness. The candidate who does their own work should not compete on equal footing with someone who outsources the thinking to a model.
The bigger picture
AI tools are not going away. The question is not whether candidates use them, but whether your hiring process can distinguish between someone who understands the material and someone who can prompt an LLM effectively.
Those are different skills. Both might be valuable, but you should know which one you are hiring for.