"Most PMs were never actually bottlenecked by execution. They were bottlenecked by taste and judgment. Team capacity functioned as a governor that prevented bad ideas from shipping. Remove that governor and you discover who was driving and who was just steering."
— Head of Product, Google Gemini
At Landr, the execution governor is gone. The only bottleneck left is taste. That changes everything about the role — what it demands, what it rewards, and who survives in it.
Open position
PM — Seamless Travel Experience
We hire operators. They ship production code in Cursor or Claude Code. They write their own eval suites in Braintrust. They read a LangSmith trace without asking for help. They define what "good" looks like before the agent ships — not after a user complains. They map every way the agent can go wrong before it goes live: the 2am Hong Kong disruption scenario is a test case, not an edge case. Above all, they have taste — the judgment to know what's worth shipping when capacity is infinite and the courage to kill what isn't. A PM who still needs engineering capacity to test an idea won't last a month here.
The number they own
% of trips completed end-to-end without human intervention.
Current
89%
Target
94%
Next milestone
78%
6-month retention
Five days. No meetings that could be a Loom. No waiting for eng capacity. No PRDs. The cycle is: prototype → evals → ship → review → talk to users. Every week.
Prototypes the reroute agent v2 in Claude Code — specifically the moment when two options have near-identical preference scores. Builds the tiebreaker logic themselves. Working demo by noon.
Writes 20 evals in Braintrust against last week's failure logs. Three new test cases covering the "traveller paying in a second currency" regression. Evals are the spec — they get written before the fix ships.
Ships the experiment to 10% of traffic. Not a feature flag — a live agent running on real trips. Watches LangSmith traces in real time for the first two hours. Kills one branch that's behaving unexpectedly.
Reviews the eval deltas from Wednesday's run. One metric improved 4 points. Another regressed on corporate travellers with connecting flights. Writes 6 new evals covering the regression. Fixes it before Monday.
Calls 3 travellers who hit the failure mode this week. Not a survey — a real 20-minute conversation. The failure mode becomes a test case. The test case becomes an eval. The eval prevents it next week.
These aren't inefficiencies that got cut in the name of speed. They are deliberate rejections — each one a choice that the legacy PM ritual was protecting bad ideas, not shipping good ones.
They prototype in Claude Code instead. A working demo in 3 hours costs less in attention than a PRD in 3 days — and it answers the question the PRD never does: does this actually work? Evidence precedes documentation.
The two-week sprint cycle was a governor on bad ideas — it rationed capacity, which forced prioritisation, which caught some nonsense before it shipped. Remove the capacity constraint and the ritual becomes pure overhead. Judgment is the governor now.
No PRD → Figma → ticket relay. The PM builds in the codebase directly. Each handoff in the traditional model introduced 20–30% information loss. At Landr, the person who defines the problem is the person who ships the first version of the solution.
If the PM is spot-checking every booking the agent produces, the evals aren't good enough. That's a sign to fix the eval suite, not to add a human review step. Manual rechecking is an eval failure disguised as diligence.
Features shipped is vanity. A PM who reports "we launched 12 features this quarter" without showing job-completion rate is describing activity, not value. The only number that matters: % of trips completed end-to-end without human intervention.
What's on this PM's screen on any given Tuesday. Not Jira. Not Confluence. Not a roadmap deck.
Claude Code
Prototyping features directly. First version of any idea.
Cursor
Iterating on agent logic. Reading codebase, writing fixes.
Braintrust
Writing and running eval suites. Evals are the spec.
Arize
Catching hallucinations before travellers do.
LangSmith
Observability on agent traces. Where does it burn budget and fail?
Bolt / v0
Spinning up UI and onboarding demos in an afternoon.
Linear
Issues, not sprints. No velocity theatre.
Loom + phone
User research. Screen recordings over slide decks.
Not on this PM's machine
This is how the PM at Landr works with the agent stack. They don't manage AI features. They delegate to agents, define what failure looks like, own the evals, and review the traces. One real day, timestamped.
The pricing agent shipped a 3% markup experiment on 8% of traffic
No one approved it. It passed all evals. The PM set the threshold — "ship if eval score above 92, traffic below 15%" — and the agent executed. That's the process.
Opened the eval suite in Braintrust
Spotted a regression: the pricing agent had regressed on "traveller paying in a second currency." The eval score dropped from 94 to 87 on that subset. No human reported this. The eval caught it automatically.
Wrote 4 new evals capturing the failure mode precisely
Not a ticket to engineering. Not a Slack message asking someone to look into it. They wrote the evals in Braintrust — test cases that describe exactly what correct behaviour looks like for dual-currency trips — and committed them directly.
Shipped a fix that passed all four new evals
Wrote the prompt correction in Cursor. Ran the eval suite locally. Passed. Deployed to the same 8% of traffic. No handoff. No waiting. The time from "spotted regression" to "fix live" was 3 hours.
Re-ran traffic on the full experiment
With the regression fixed, expanded the pricing experiment from 8% to 22% of traffic. Set a new eval threshold: alert if dual-currency score drops below 90. The agent monitors itself now.
Reviewed LangSmith traces. Regression confirmed gone.
Checked 40 traces across the dual-currency subset. All clean. Wrote a 3-line Loom for the engineering team explaining what was found, what was fixed, and what the new eval threshold is. No meeting needed. Tomorrow: expand to 50% of traffic.