PM of the future

The PM at Landr is not what you think.

"Most PMs were never actually bottlenecked by execution. They were bottlenecked by taste and judgment. Team capacity functioned as a governor that prevented bad ideas from shipping. Remove that governor and you discover who was driving and who was just steering."

— Head of Product, Google Gemini

At Landr, the execution governor is gone. The only bottleneck left is taste. That changes everything about the role — what it demands, what it rewards, and who survives in it.

The role

What we're looking for.

Open position

PM — Seamless Travel Experience

We hire operators. They ship production code in Cursor or Claude Code. They write their own eval suites in Braintrust. They read a LangSmith trace without asking for help. They define what "good" looks like before the agent ships — not after a user complains. They map every way the agent can go wrong before it goes live: the 2am Hong Kong disruption scenario is a test case, not an edge case. Above all, they have taste — the judgment to know what's worth shipping when capacity is infinite and the courage to kill what isn't. A PM who still needs engineering capacity to test an idea won't last a month here.

The number they own

% of trips completed end-to-end without human intervention.

Current

89%

Target

94%

Next milestone

78%

6-month retention

Tasks

A week in the life.

Five days. No meetings that could be a Loom. No waiting for eng capacity. No PRDs. The cycle is: prototype → evals → ship → review → talk to users. Every week.

Monday

Prototypes the reroute agent v2 in Claude Code — specifically the moment when two options have near-identical preference scores. Builds the tiebreaker logic themselves. Working demo by noon.

Tuesday

Writes 20 evals in Braintrust against last week's failure logs. Three new test cases covering the "traveller paying in a second currency" regression. Evals are the spec — they get written before the fix ships.

Wednesday

Ships the experiment to 10% of traffic. Not a feature flag — a live agent running on real trips. Watches LangSmith traces in real time for the first two hours. Kills one branch that's behaving unexpectedly.

Thursday

Reviews the eval deltas from Wednesday's run. One metric improved 4 points. Another regressed on corporate travellers with connecting flights. Writes 6 new evals covering the regression. Fixes it before Monday.

Friday

Calls 3 travellers who hit the failure mode this week. Not a survey — a real 20-minute conversation. The failure mode becomes a test case. The test case becomes an eval. The eval prevents it next week.

Habits

What they refuse to do.

These aren't inefficiencies that got cut in the name of speed. They are deliberate rejections — each one a choice that the legacy PM ritual was protecting bad ideas, not shipping good ones.

No PRDs

They prototype in Claude Code instead. A working demo in 3 hours costs less in attention than a PRD in 3 days — and it answers the question the PRD never does: does this actually work? Evidence precedes documentation.

No sprint ceremonies

The two-week sprint cycle was a governor on bad ideas — it rationed capacity, which forced prioritisation, which caught some nonsense before it shipped. Remove the capacity constraint and the ritual becomes pure overhead. Judgment is the governor now.

No handoff drift

No PRD → Figma → ticket relay. The PM builds in the codebase directly. Each handoff in the traditional model introduced 20–30% information loss. At Landr, the person who defines the problem is the person who ships the first version of the solution.

No rechecking agent output manually

If the PM is spot-checking every booking the agent produces, the evals aren't good enough. That's a sign to fix the eval suite, not to add a human review step. Manual rechecking is an eval failure disguised as diligence.

No feature-based success metrics

Features shipped is vanity. A PM who reports "we launched 12 features this quarter" without showing job-completion rate is describing activity, not value. The only number that matters: % of trips completed end-to-end without human intervention.

Tools

The stack they live in.

What's on this PM's screen on any given Tuesday. Not Jira. Not Confluence. Not a roadmap deck.

Claude Code

Prototyping features directly. First version of any idea.

Cursor

Iterating on agent logic. Reading codebase, writing fixes.

Braintrust

Writing and running eval suites. Evals are the spec.

Arize

Catching hallucinations before travellers do.

LangSmith

Observability on agent traces. Where does it burn budget and fail?

Bolt / v0

Spinning up UI and onboarding demos in an afternoon.

Linear

Issues, not sprints. No velocity theatre.

Loom + phone

User research. Screen recordings over slide decks.

Not on this PM's machine

Jira Confluence Product roadmap deck Sprint board Quarterly OKR spreadsheet

Processes

A full day with the agents.

This is how the PM at Landr works with the agent stack. They don't manage AI features. They delegate to agents, define what failure looks like, own the evals, and review the traces. One real day, timestamped.

overnight

The pricing agent shipped a 3% markup experiment on 8% of traffic

No one approved it. It passed all evals. The PM set the threshold — "ship if eval score above 92, traffic below 15%" — and the agent executed. That's the process.

9:00am

Opened the eval suite in Braintrust

Spotted a regression: the pricing agent had regressed on "traveller paying in a second currency." The eval score dropped from 94 to 87 on that subset. No human reported this. The eval caught it automatically.

10:30am

Wrote 4 new evals capturing the failure mode precisely

Not a ticket to engineering. Not a Slack message asking someone to look into it. They wrote the evals in Braintrust — test cases that describe exactly what correct behaviour looks like for dual-currency trips — and committed them directly.

12:00pm

Shipped a fix that passed all four new evals

Wrote the prompt correction in Cursor. Ran the eval suite locally. Passed. Deployed to the same 8% of traffic. No handoff. No waiting. The time from "spotted regression" to "fix live" was 3 hours.

2:00pm

Re-ran traffic on the full experiment

With the regression fixed, expanded the pricing experiment from 8% to 22% of traffic. Set a new eval threshold: alert if dual-currency score drops below 90. The agent monitors itself now.

4:00pm

Reviewed LangSmith traces. Regression confirmed gone.

Checked 40 traces across the dual-currency subset. All clean. Wrote a 3-line Loom for the engineering team explaining what was found, what was fixed, and what the new eval threshold is. No meeting needed. Tomorrow: expand to 50% of traffic.