3/2025 — present

Aviv Sheriff

FounderGamingProduct DirectorAI

Commits · last 14 months1,407

Concurrent agents35 peak

350

Mar '25 Jul Nov Mar '26

Capability expansion 75%

Share of work shipped that wouldn’t exist without AI — vs 27% industry average per Anthropic’s 2026 Agentic Coding Trends Report.

industry

27%

aviv

75%

Front-end development

Very High

Usage of AI to design and develop front-end software — landing pages, product dashboards, game clients, interactive demos, mobile web, IDE extension UI.

Landing pages — Goldie, Harmony, Motif
Product dashboards — Harmony, Chess Arena, TechTree
Interactive product demos — Goldie, Breakout
Game clients — Chess Arena Phaser + React clients (arena-beta-blitz, ChessArenaTT_Phaser)
Mobile / responsive web — Harmony, TariffTracker
IDE / VS Code extension UI — Motif VS Code + Cursor extension (closed source)
Admin & ops UIs — Harmony affiliate admin, Chess Arena dashboard, steam_page_analyst prompt admin tool

Marketing

High

Usage of AI across the full ad lifecycle — programmatic campaign management, creative generation, copy, A-B testing, performance diagnosis — plus newsletter and social.

Ad lifecycle management — Harmony custom Meta-API campaign tooling
Creative generation — Harmony AI-generated ad images + visuals
Ad A-B testing — Harmony 17-combo Elo tournament
Ad copy generation — Harmony 4 campaigns, 61 signups
Landing-page copy iteration — Harmony, Goldie, Motif
Newsletter writing & publishing — Daily_Steam_Review live cron pipeline (293 AI-era commits), voice-governed newsletter-writing skill
Social content — twitter-engage skill, voice-governed drafts across the portfolio
Performance diagnosis — Harmony CPA + ad-mix root-cause analysis

04Tooling & Integration

ProficiencyVery High

Definition. The AI stack the operator wields — agent runtime, MCPs, retrieval infrastructure (RAG, vector DB, embeddings, rerankers), model routing — with attention to closed-loop leverage, not ghost installations.

Also Known As: Power user of agentic coding tools · tool/function calling · RAG · embeddings · vector search · MCP servers · LLM APIs

Best Practices

CLI-based agent runtime (Claude Code) alongside Cursor
MCP servers connected to real inputs and outputs
RAG / vector / reranker infrastructure in shipped products
Multi-model routing per task
Closed-loop patterns (write-then-verify; ingest-then-act)

Evidence

MCP roster — 10+ servers permissioned and actively called from agent loops: Granola, Gmail, Google Calendar, Notion, Playwright, Supabase, Vercel, Nanobanana, Typefully, RapidAPI-Twttr, chrome-mcp-server, user-sqlite.
Closed-loop workflows via MCP — meeting-processing skill chains Granola → transcript → wiki → CRM (user-sqlite) → Gmail draft → STATUS.md update in one pipeline.
Hybrid SQL + semantic-search agent — Breakout's ai-marketing-agent registers 10 Gemini function-calling tools including find_similar_games (ChromaDB semantic search + Cohere rerank) and query_database (read-only Postgres SELECTs); agent routes between SQL and vector retrieval inside one streaming chat.
Production RAG with retrieve-then-rerank — ChromaDB top-N retrieval followed by optional Cohere reranking across 4,352 Steam strategy games; comparison-mode UI shows "boosted" and "dropped" results so operator can eval rerank quality live (300-500ms no rerank / 800-1500ms with).
Multi-model evaluation router — Steam page scoring engine routes 6 rubrics across 3 providers by per-component strength: GPT-o3 (copy) / Gemini 2.5 Pro (hero trailer video) / GPT-4.1 (tags + screenshots + capsules). One scorer file per component under a shared llm_router.
Multi-model video pipeline with context caching — TechTree's video flow uploads full gameplay video to Gemini File API, creates a CachedContent, reuses across step-1 feature identification + step-2 per-feature captions; rate-limited ThreadPoolExecutor batches 5-10 concurrent under a 900 req/min guardrail (5× throughput).
Vector DB + embeddings across products — ChromaDB managers in both TechTree's VideoProcessingFlow and Frontend; embeddings feed similar-game discovery, canonical-feature matching, and dedup flows. Two shipped products.
CLI / package distribution — Motif CLI on PyPI (v0.7.9, Apache-2.0); Click-based command group, structured-JSON output contracts, versioned analysis prompt independent of package version, daily update check, motif setup installs companion skill into Cursor + Claude Code.

05Context Engineering

ProficiencyVery High

Definition. Deliberately managing what the LLM sees — via hierarchical rule files, custom skills, knowledge structures written for agent consumption, and programmatic hooks.

Also Known As: Context engineering · system prompts · prompt guides · behavioral specifications · AI-readable knowledge base · translating user feedback into prompts

Best Practices

Hierarchical CLAUDE.md files that act as routers
Custom, project-specific skills authored and iterated
Knowledge base designed for agent retrieval
Programmatic hooks that protect the knowledge substrate
Catching context leaks mid-session as a reflex

Evidence

Hierarchical rule files — 6 CLAUDE.md files across the tree: repo root + 5 sub-projects (Harmony, content-strategy, vibe-coding-portfolio, vibe-cli, Motif output artifact). Each scoped to its domain.
Router architecture — root CLAUDE.md (95 lines) is a dispatcher, not a knowledge dump: 9-row workflow-triggers table loads scoped files on demand.
User-written skills — 13 venture-specific SKILL.md files under .cursor/skills/. Zero off-the-shelf. Range from 30-line helpers to 352-line workflow drivers.
Skill iteration after real failure — twitter-engage SKILL.md was re-grounded against 20 real tweets (+187/−69 diff) after producing LinkedIn-flavored drafts.
Knowledge base shaped for agent retrieval — 44-page wiki with mandatory frontmatter (decay_class, weighted connections[]), hop-weighted retrieval protocol (hop 1 weight > 0.5, hop 2 > 0.7, cap 8 pages).
Persistent behavioral memory — 19 files at the user-level memory folder (10 feedback overrides + 5 project-state + index + profile). Some explicitly contradict skill defaults (e.g. Opus-locked subagents overrides a cheaper-model skill recommendation).
Context-leak fence at trust boundaries — private/public repo boundary codified as a catch-yourself guard in vibe-cli/CLAUDE.md line 5.
Root CLAUDE.md is a living artifact — 17 commits from 2025-12-31 through 2026-04-20. Notable iterations: "make wiki retrieval mandatory", "add wiki workflow triggers and architecture layer", "add concurrent agent safety rules", "reorganize repo for context isolation".

06Agent Concurrency

ProficiencyVery High

Definition. Running multiple AI sessions in parallel and deploying multiple agents per session as a cognitive default, not a special technique.

Also Known As: Agentic workflows · multi-agent workflows · orchestration patterns · sub-agents · agent skills · hierarchical orchestrators

Best Practices

Multiple Claude Code / Cursor sessions active in parallel
Routine subagent dispatch from a parent orchestrator
Worktree isolation for parallel-write safety
Long-running background agents while operator continues

Concurrent agents over time · daily hourly-peak, Mar 2025 – Apr 2026

Busiest moment35 agents

Per-session max33 agents

Total work sessions2,324

Evidence

Concurrent sessions — 35 AI sessions simultaneously active at peak (2026-03-10), measured by sweep-line across Claude Code + Cursor session intervals.
AI agents per work session — long-tailed distribution: 62.8% single-agent · 14.4% 2–3 · 8.8% 4–7 · 8.4% 8–13 · 5.6% 14+ (12 sessions, max 33 agents).
Autonomy as throughput — operator frames concurrency as "agent runs unattended once a task is kicked off" — not "multiple windows open." Sets the brief for the autonomous-hooks catalog.
Subagents / agent teams — scoped-prompt specialist subagents dispatched from a parent orchestrator; results merged. The DATA-REFRESH-PLAN in-voice codifies the pattern.
Worktree isolation for parallel-write safety — root CLAUDE.md mandates worktree isolation for 5+ file changes; routine worktree-switch before plan-mode-then-dispatch.
Runtime split across agent clients — Claude Code 54.1% / Cursor 45.9% of messages over 14-month window; CLI-first + GUI-second as parallel first-class tools.
Model routing for subagents — Opus locked as default for all subagents via durable memory; override rule after a Haiku-downgrade regression.
Structure Lock for subagent output — orchestrator rewrites subagent returns to fit existing sections rather than letting them grow the HTML ad hoc (P51).
Self-measurement loop — motif-cli parses Aviv's own ~/.claude/projects/**/*.jsonl + Cursor archive; the chart above is emitted by scripts/motif-concurrency.py, so the evidence is measured by the tool he ships.

07Planning

ProficiencyHigh

Definition. Thinking upfront — with sophistication proportional to the task's scope — before implementation begins. The signal is whether planning produces a reusable artifact the agent and operator can both execute against.

Also Known As: Spec-first discipline · prompt-as-spec · PRDs with AI · plan mode · scoping · rapid prototyping with agentic tools · templated deep research

Best Practices

Write acceptance criteria and constraints before invoking
Use plan mode (Claude / Cursor) before code edits
Prototype in Cursor, Bolt, v0, or Lovable
Iterate the plan itself — multiple revisions, not one-shot
Attach rich context: file paths, references, prior research
Reuse templated research patterns for recurring investigations

Evidence

Spec-first tool chain — superpowers:brainstorming → writing-plans → executing-plans skills loaded as first-class skills in every project session; produces paired spec + plan files before code ships.
Plan / spec files committed before implementation — 16 design specs + 13 matching implementation plans under docs/superpowers/ covering Mar 30 – Apr 17 2026 (multi-child support, plan sharing, scraping pipeline, frontend migration, OAuth, dashboard reactivity, subject-aware planner, questionnaire flow, others). Same-day commits with the matching feature branch.
Plan-file iteration before execution — frontend-migration spec landed, then a review-feedback commit absorbed 5 named design corrections (file paths, scope, replacement sources) before the implementation plan was committed.
Spec revised from real-data learnings — curriculum-scraping-pipeline spec rewritten after initial draft to require stage-by-stage founder review ("No building ahead — each stage may change based on data learnings"), then implementation plan shipped.
Wireframing — visual mockups during UX design — landing-page work generates product mockups as PNG assets before writing copy; interactive React mockup added for stakeholder review before the Platform section was finalized.
Pre-planning research subagents — scoped inventory / recon agents dispatched before design. This Data Refresh Plan sends a Phase 0 subagent to rank the full repo universe before any of the 9 Phase-1 research agents run.
Mid-session plan correction against reality checks — operator redirects the agent's proposed A/B structure with a design principle: "stages are defined by purpose, not methodology." Plan reshaped in-flight.
Plans as reusable artifacts across sessions — a brainstorm spec file from a prior session is resumed by pointing at the saved file; assistant summarizes the 5 open questions, operator picks which thread to continue.
Acceptance criteria + decision tables upfront — every design spec opens with a Decisions table enumerating constraints and rationale before architecture (e.g. frontend-migration spec names 9 decisions with explicit rationale before any implementation description).

08Control & Verification

ProficiencyHigh

Definition. The operator's exercised judgment over AI output — overriding, correcting, verifying — plus the systems built so agents can run long without babysitting.

Also Known As: Control over the AI · verification habits · human-in-the-loop escalation · confidence thresholds · guardrails · prompt-injection defenses · safety by design

Best Practices

Overriding AI suggestions with explicit reasoning
Verification-before-completion patterns
Hooks that programmatically enforce constraints
Corrections codified into rules/memory immediately
Model-failure handling with proactive guardrails

Evidence

Verification rules in config — root CLAUDE.md §Verification names four post-write checks: SELECT after CRM insert/update, file-existence check after creation, date-diff after STATUS.md edit, em-dash/arrow scan before finalizing newsletter content.
Pre-completion checklists — root CLAUDE.md §Pre-Completion Checklist mandates STATUS.md / CRM / orphaned-TODO persistence before session close. "Rule of thumb: if the user closes this conversation, can they find every TODO? If not, persist it now."
Persistent behavioral-feedback memory as override layer — ~12 feedback_*.md files at the user-level memory folder, each overriding a specific AI default with named reasoning. Memory index is categorical: domain knowledge lives in the wiki, memory is for preferences + behavioral feedback + external references only.
Correction-to-rule loop with session provenance — 5 of the ~12 feedback files carry an originSessionId: frontmatter field tracing the rule back to the session where the AI's behavior was corrected. User corrects once → rule committed to memory → future sessions start with the override loaded.
Explicit override of AI defaults — memory entry overrides a named skill's recommendation: writing-plans suggests cheaper models for mechanical tasks; feedback rule forces Opus across all subagents.
Pre-mortem artifacts — autonomous-hooks catalog — 337-line spec cataloging 8 hooks (observability, auto-mode tuning, secret scanner, protected-branch guard, Stop verification, SessionStart orientation) with reference implementation, build-vs-buy verdict, success criterion, and explicit cut/defer rationale per principle "start with ≤8 hooks; add only when a real failure justifies it." Spec on main; implementation on a feature branch.
Hooks implemented with tests — 11 hook-related commits on a feature branch (H0 observability, H2 secret scanner + file-deny, H3 protect-branch with a self-redirection guard, H5 Stop verification, H7 SessionStart orientation) with a T15 smoke test at 7/7 pass. Observation window active; not yet 30-day-validated.
Prompts-vs-hooks design thesis — explicit principle in the hooks catalog: prompts achieve ~70-90% compliance (community consensus); hooks achieve 100% because they run outside model reasoning. Paired with a "use native features before building" rule — Auto mode + permissions.deny handle the permission layer; custom hooks only where native coverage is insufficient.

Eval design

ProficiencyHigh

Definition. Designing the evaluation system that decides whether an AI feature is shippable.

Best Practices

Closed-vocabulary failure-mode taxonomy with uncertain fallback
Output-diff harness gating every prompt promotion
Per-output confidence + reasoning on the audit trail
Structured-output contract enforced by unit tests
Multi-stage workflow where each stage's output is independently checkable
Golden datasets benchmarking live output against user intent

Evidence

Closed-vocabulary tagger schema — Harmony's 10-dim taxonomy with explicit uncertain fallback; schema miss breaks the build.
A/B output-diff harness — 50/50 split through parallel tagger pipelines; reports per-side counts and declares winner before a prompt ships.
Multi-stage prompt workflow as eval — Breakout runs chained prompts (MegaTimeline.md → per-game *MegaPrompt.md → EnrichMilestones.md); each stage produces a checkable artifact that the next stage verifies or replaces.
Hybrid SQL + semantic agent eval loop — TechTree chatbot answers grounded in vector+SQL store; behavior tuned via regression iterations against named failure modes.
Per-output confidence + reasoning audit trail — every tagged row carries confidence + reasoning; low-confidence rows are the review queue rather than silent drops.

What would move this to Very High: a checked-in golden dataset (20–50 hand-labeled items) + eval-diff script comparing current output against it, with a prompt-edit commit that references the eval delta.

Model management

ProficiencyVery High

Definition. Mapping product surfaces to a model-tier strategy and owning cost, latency, and quality tradeoffs.

Best Practices

Per-surface model routing (one model per component, not global)
Per-component cost and latency budgets
Cost escape-hatch wired in as a first-class mode
Cross-provider selection owned at the product layer

Evidence

Per-component routing config, externalized — steam_page_analyst's config/llm_routing.json maps 7 scoring components to model providers (capsule/tags/screenshots/about/json_parsing → GPT-4.1, description → o3, hero_trailer → Gemini 2.5 Pro video). Swapping a model is a config edit, not a code edit.
Multi-model video pipeline — TechTree's VideoProcessingFlow routes full-video reasoning to Gemini 2.5 Pro (context-cached) and per-frame feature extraction to a second-tier model; two-step prompt contract.
Tiered research-to-extraction pipeline — Harmony: Haiku for wide research, Sonnet for narrow structured extraction; tier matched to task difficulty.
Product-layer cross-provider selection — Breakout scorer routes per rubric across OpenAI o3, Gemini 2.5 Pro (video), and GPT-4.1; choice explicit in product docs, not hidden in an SDK call.
Batch-API cost escape-hatch — Anthropic Batch API wired as a documented ~50% cost fallback, invoked intentionally for large offline runs, not as a fire-drill.

Metrics & instrumentation

ProficiencyHigh

Definition. Defining what "working in production" means for a probabilistic system — AI-specific OEC, online quality sampling, and task-level acceptance tracked alongside engagement.

Best Practices

Track AI-specific signals, not just engagement
Sample quality online, not only offline
Instrument task acceptance and abandonment
Persist raw events for retrospective replay
Compare early vs late sessions to track drift
Diagnose funnel breaks to their root cause

Evidence

End-to-end paid-acquisition funnel — Harmony: ad impression → CTR → LP conversion → questionnaire completion → plan-build → affiliate click — each step instrumented across 3 campaigns, 61 families, $198 spend.
CPA tracking with root-cause diagnosis — $1.68 CAC at best; C03's higher CPC ($0.16) traced through to Meta auto-allocating 38% of budget to the less efficient ad variant. Root cause, not aggregate number.
AI-specific OEC baked into the product — Motif CLI tracks AIPM, concurrency, autonomy ratio, and output density as its core metric set, then exposes a Growth Scorecard comparing first-25% vs last-25% of sessions to detect drift. Engagement metrics wouldn't catch autonomy or density regressions.
Real-time telemetry with peak tracking — motif live records per-session peak AIPM, peak concurrency, session duration, and persists bests to records.json so users see personal distribution shift over time.
Optimization-event choice — Meta campaign optimization set to account_created over higher-volume profile_completed — the latter optimizes for tire-kickers; the former optimizes for committed users.
Campaign-level commit log — raw Meta-Ads data persisted for replay + retrospective eval.

Product sense

ProficiencyVery High

Definition. Turning a model affordance into a scoped, shippable product — PRDs with model assumptions, eval plans, kill-criteria.

Best Practices

PRD with model assumptions + eval plans + kill-criteria
Prototypes built personally to validate feasibility
Scope updates when model capability shifts mid-build
Ship in weeks, not quarters

Evidence

Model-affordance-to-product in under a month — Motif: conceived, built, published to PyPI as motif-cli with companion VS Code + Open VSX extension; beta pilot with Lutz Finger (Cornell Tech). Scope stayed tight (Python CLI, 2 dependencies, local-only data); release-notes + branching conventions codified in vibe-cli/CLAUDE.md.
Small-n pilot landing unit economics — Harmony C02: 35 signups at $1.67 CAC, 87.5% questionnaire completion, 16% plan-build rate, 15.4% affiliate click rate; clear diagnosis of which ad variants to keep vs. drop. Product sense + metric discipline in one pilot loop.
Deep-research → product demo in 2 weeks — Goldie: OpenAI o3-deep-research produced a 131KB market analysis that became the product's own demo content; low-code app (Lovable) + hero video (Clueso) shipped alongside research. Background-task polling kept long o3 jobs off the UI thread.
Public post-mortem + knowing when to kill — Breakout: 60+ validation calls including SVP-level at Ubisoft, $1M valuation offer received and declined, followed by a public LinkedIn post-mortem. Product sense here is the kill decision, transparently documented.

Aviv Sheriff

Best Practices

Evidence

Best Practices

Evidence

Best Practices

Evidence

Best Practices

Evidence

Best Practices

Evidence

Eval design

Best Practices

Evidence

Model management

Best Practices

Evidence

Safety & policy

Best Practices

Evidence

Prompt lifecycle management

Best Practices

Evidence

AI UX

Best Practices

Evidence

Metrics & instrumentation

Best Practices

Evidence

Product sense

Best Practices

Evidence

Breakout

TechTree

Chess Arena

Goldie

Motif

Harmony