Human judgment, AI assistance
How I use LLMs to compile performance evidence without outsourcing decisions
Great engineering managers don’t just ship—they build organizations capable of shipping. But that meta-skill is rarely named, let alone taught.
Our book Engineering Manager’s Compass focuses on the unspoken rules of the role: how to read organizational structures, how to turn messy metrics into real decisions, and how to build teams that deliver without you holding everything together.
I have been managing engineers for about seven years now. Like most engineering managers, roughly half my time goes to technical work: architecture decisions, code reviews, unblocking. And the other half goes to people management: 1:1s, career development, goal-setting, and performance reviews.
For the past year, every engineer on my team has been using Claude, Copilot, or similar. They write code faster, debug faster, learn new codebases faster. AI has genuinely changed how they work.
Meanwhile, I was using the exact same tools but only for my technical work. The management half of my job? Still 100% manual. I’d spend hours every month clicking through Linear, re-reading ticket descriptions, cross-referencing Slack threads, and trying to reconstruct what each person actually shipped all before writing a single word of a performance review.
It took me an embarrassingly long time to ask the obvious question: why am I not using LLMs for the tedious part of my job?
This post is about what happened when I did.
Why performance management is so damn hard
Before I get to the AI part, I want to be honest about why performance reviews are difficult. Not “I don’t like doing them” difficult. It is structurally, systematically difficult.
The data problem
A performance review is supposed to be an evidence-based assessment of someone’s work over a defined period. In practice, finding that evidence is extremely difficult.
For a single direct report’s monthly review, I might need to scan 10–20 Linear issues, read the ticket descriptions to understand scope and complexity, check PR activity, look at whether they self-initiated the work or were assigned it, note any SLA timelines, review Slack threads for collaboration signals, and revisit my 1:1 notes for context. That’s one person. I have multiple direct reports. By the time I’ve gathered the raw material, I’ve spent the better part of a day and I haven’t written anything yet.
Recency bias
When data-gathering is this expensive, you unconsciously start to cut corners. And the most common shortcut is recency bias: you remember what happened last week clearly, what happened two weeks ago vaguely, and what happened at the start of the month barely at all. The engineer who shipped a critical feature on the 3rd and then spent the rest of the month on less visible but equally important work? Their review ends up thinner than it should be.
The “vibes” trap
When you can’t easily reconstruct the full picture, you fall back on gut feel. “I think they had a good month.” That’s not good enough. Gut feel is biased toward the visible, the loud, and the recent. The IC who quietly fixed a security vulnerability, wrote developer tooling that saved the whole team time, or unblocked a colleague in a Slack thread. Their contributions vanish if you’re relying on vibes.
Consistency is nearly impossible
Most companies use some kind of rating rubric. I often see a 1–5 scale with an expected distribution. Applying that rubric fairly across multiple people requires comparing apples to apples. But when your raw data for each person exists in different tabs, different formats, and different levels of completeness, calibration becomes guesswork.
The remote multiplier
In a remote setting, things become even more challenging since you can’t rely on physical presence or informal interactions to gauge performance. You can’t pattern-match on “who seemed busy” or “who was in the office late.” If work isn’t documented in a ticket, a PR, or a message, it effectively didn’t happen. That’s mostly a good thing (remote-first culture forces documentation) but it also means the volume of written material you need to synthesize is enormous.
The key insight: Separate the layers
Here’s the mental model that changed everything for me. A performance review is not one task. It’s three distinct layers stacked on top of each other:
Data collection: What did this person actually do this month?
Analysis: What does their output mean relative to their level, their role, and our expectations?
Judgement: What rating do they deserve? What feedback will help them grow? What should I say in the review conversation?
These layers require fundamentally different capabilities. Data collection requires thoroughness and patience. Analysis requires pattern recognition and domain knowledge. Judgement requires empathy, context, fairness, and courage.
Here’s the thing: LLMs are exceptionally good at layer 1, relatively useful at layer 2, and completely unqualified for layer 3.
Once I saw the process this way, the path forward was obvious. I don’t need an AI to do performance reviews. I need an AI to do the part that takes the most time and delivers the least value from me doing it manually.
What I actually do
Here’s the shape of my workflow. Not the exact prompts, but what goes in and what comes out.
The setup
I give the LLM three pieces of context that it keeps across every review:
Our competency matrix: what’s expected at each engineering level (Senior I, Senior II, Team Lead, etc.) across both technical skills and behaviors.
Our rating rubric: the 1–5 scale, what each rating means, and the expected distribution.
A review template: the structure I want the output in: a summary, a “What” section (achievements and delivery), a “How” section (values and behaviors), and areas for improvement.
This context is reusable. I set it up once and it persists across every review I write.
Since I want the model to pull evidence from a system, that system needs a working MCP server with the right auth and permissions. In practice, that means setting up MCP for every tool I expect to collect data from, for example:
Slack
Notion
Linear
GitLab
HoneyComb
Otherwise, the model only sees part of the picture and the draft becomes less reliable.
The prompt
Applying the context above, I give the model a prompt like this:
---
name: monthly-review
description: Produce a thorough, evidence-based research brief on a developer's month. Gathers signals from Notion, GitLab, Slack, Sentry, Linear, and Honeycomb. The output is a research document for the manager — not a performance assessment.
argument-hint: [developer-name] [start-date] [end-date]
---
Produce a **thorough, evidence-based research brief** for:
- **Developer:** $0 (or ask if not provided)
- **Period:** $1 → $2 (or ask if not provided)
This is a **research document** — a thorough, evidence-backed account of what the developer shipped and how they worked. It is not a performance assessment. It must:
- Identify meaningful patterns
- Separate signal from noise
- Give the manager the full picture so they can evaluate fairly
---
## Inputs
Use all relevant data sources:
- `summary.md` (company context and guidelines)
- Systems:
- Notion (documentation, planning)
- GitLab (code contributions, reviews, delivery)
- Slack (communication, collaboration)
- Sentry (ownership, production issues)
- Linear (execution, task flow)
- Honeycomb (reliability)
---
## What to Document
### 1. Delivery & Execution
- What did they actually ship?
- Complexity and ambiguity of the work
- Reliability and follow-through
### 2. Ownership & Initiative
- Did they proactively identify problems?
- Did they take responsibility beyond assigned tasks?
- Did they prevent issues or only react?
### 3. Collaboration & Influence
- Visible impact on teammates
- Code reviews, knowledge sharing
- Communication clarity and effectiveness
### 4. Growth & Trajectory
- Signs of improvement over the period
- Response to feedback (where observable)
- Complexity of challenges taken on
---
## Instructions
### 1. Extract Signals (not activities)
Identify **specific, observable behaviors**.
Bad: "Worked on project X"
Good: "Identified and resolved a race condition in service Y, preventing potential production issues"
### 2. Provide Evidence
Every key claim must be supported by:
- A concrete example
- Clear impact
- Links to evidence (Notion docs, GitLab commits, Slack threads, Sentry issues)
If no evidence exists, do not include the claim.
### 3. Identify Patterns
Go beyond isolated events:
- What is consistently strong?
- What is consistently weak?
- What is changing over time?
### 4. Note the Limits of What You Can See
Flag where important context is likely missing. The model cannot observe:
- Work that wasn't documented in a ticket or thread
- Coordination and negotiation that happened before a ticket was opened
- How the developer showed up in private conversations or 1:1s
- Personal circumstances that may have affected output
Where evidence feels thin or a pattern seems off, say so explicitly rather than filling the gap with inference.
### 5. Strengths
List 2–4 **high-confidence strengths**:
- Must be backed by repeated evidence
- Clearly tied to impact
### 6. Areas for Improvement
List 1–3 **observable improvement areas**:
- Specific (not vague traits)
- Framed as behaviors, not personality
- Based only on what the data shows
---
## Output
Follow `template.md` exactly.
Save to: `[YEAR]/[MONTH]/performance_review_[developer_name]_[start_date]_[end_date].md`
---
## Quality Bar Checklist
Before finalizing, ensure:
- No vague statements
- No unsupported claims
- Evidence is concrete and recent
- Gaps and limits are flagged honestly
- The suggested rating is clearly tied to evidence, not inference
The output should give the manager a complete, honest picture of what the data shows — and be clear about where the data runs out.The draft
The output is a draft for me to review. If all goes well, I will now have a 2–3 page document that gives me a thorough, evidence-backed account of what the engineer shipped, how they worked, and what patterns I can see in their output.
Now, the real work begins. I read the draft carefully, checking the evidence and noting where I have additional context that the model doesn’t. I adjust the wording, add or remove strengths and improvement areas, and make sure the narrative I want to tell is actually supported by the data.
Where the LLM must not replace you
I want to be blunt about this section because I think it’s the most important one.
The draft is not the review.
What the LLM produces is a research document. It’s a thorough, well-organized, evidence-backed summary of what someone shipped. It is not a performance assessment. The gap between those two things is the entire job of being a manager.
Context the model can’t see
The LLM knows what tickets were completed. It does not know:
That the IC was dealing with a family emergency and still managed to ship on time.
That a ticket estimated as “Small” actually required three days of negotiation with another team’s tech lead before a single line of code was written.
That the quiet month wasn’t low output, it was because you asked them to onboard a new hire, and they did it brilliantly.
That their most valuable contribution this month was a Slack thread where they debugged a production issue and unblocked four people in other teams. There’s no Linear ticket for that.
How they’ve been showing up in 1:1s. Whether they’re energized or burning out. Whether they’re growing into the next level or coasting.
This is the stuff that separates a data summary from a performance review. And it can only come from you.
Rating calibration is a human job
The LLM will suggest a rating. Sometimes it’s right. Often it needs adjustment. Not because the model is wrong about the data, but because calibration requires context it doesn’t have.
Rating someone a 4 (”Exceeding Expectations”) is a big deal in our system. It means they’re not just doing their job well but that they’re doing work at the next level. That judgement requires knowing what “the next level” actually looks like on your specific team, how other ICs at the same level are performing, and what the organizational bar is this cycle. No model has that context. You do.
The feedback conversation
A review exists to develop someone. The words you choose, what you emphasize, what you deliberately leave out, how you frame an area for improvement. That’s leadership. The LLM can write “consider improving documentation practices.” Only you know whether that feedback will land better as a direct request, a question in a 1:1, or a paired working session.
The ethical bright line
Never let an LLM make a final rating decision. Never let it write a PIP. Never let it determine someone’s career outcome. The model is a research assistant. You are the decision-maker. Your name goes on the review because you are accountable for it. If you can’t defend every word without pointing at the AI, you haven’t done your job.
The elephant in the room: Privacy and ethics
I’d be irresponsible to write this post without addressing the obvious: you are feeding your team’s work output into a third-party model. You need to think about that seriously, not dismiss it.
What data goes in
In my workflow, the data entering the model includes issue titles, descriptions, pull requests, static code analysis, estimates, labels, dates, and assignee metadata, plus internal documents like our competency matrix and rating rubric. Because this is company data, I only use company-approved LLMs (or locally hosted models under company policy). I never paste this data into non-approved tools or consumer chat products.
What must never go in
I have a hard rule: nothing from 1:1s enters a prompt. No personal disclosures, no health information, no family situations, no salary discussions, no PIP documentation. These are categorically off-limits, full stop. The same goes for Slack DMs and any HR-sensitive material. If you wouldn’t paste it into a public channel, don’t paste it into a model.
Tell your team
This is non-negotiable for me. Your direct reports should know you use an LLM to help compile data from their public work output. I’ve framed it to my team straightforwardly:
I use an LLM to help me compile what you shipped each month, so I don’t miss anything. The assessment, the ratings, and the feedback are entirely mine.
Nobody has had a problem with this. Most people actually appreciated it saying that they want their work to be seen, and knowing the data collection is thorough is reassuring.
The power asymmetry
I want to name something that’s easy to skip over: a manager using AI to evaluate reports is not the same thing as an engineer using AI to write code. The engineer controls their own output. The report has less visibility into and less control over the evaluation process.
That asymmetry is why transparency matters. It’s why boundaries on what data enters the model are non-negotiable. And it’s why the human judgement layer isn’t optional. It’s the whole point.
Better data, better judgement
I want to close by pushing back on a framing I see a lot: that using AI for management tasks means you’re “automating management.” That’s wrong, or at least it’s describing something very different from what I do.
What I’ve automated is the tedious, error-prone, time-consuming work of reconstructing what happened. I haven’t automated a single decision. If anything, the decisions are harder now Since I have more data, I see more nuance, and I can’t fall back on “I think they had a good month.”
The real payoff isn’t saving time, though that’s real. It’s that good data collection unlocks a higher review cadence. When building a review draft takes minutes instead of hours, you can afford to do it monthly instead of quarterly. You catch things sooner; both problems and wins. You give feedback while it’s still fresh. You spot growth trajectories early enough to act on them.
The goal is not an EM who does less. It’s an EM who is more present, more informed, and more fair. One who walks into a review conversation having actually seen everything their report shipped, not just the highlights they remember.
The LLM didn’t make me a better manager. It gave me the time and data to actually be one.

