Step 1 - Let's Pop

A pop is the structural spine that a musician group forms an event around. Think of it as a song, but described from the inside out: instead of writing notes, you write what the song wants to be doing at each moment.

The pop has a schema -- a recipe book that says which fields are legal, which are required, and what they mean. Let's look at what we made, and why we made each choice.

Reader Assumptions

This tutorial is for game-dev/content work, not Rust engine work. You will work in YAML specs and scripts (Lua preferred, Rhai fallback).

Quick definitions:

Rally: the surrounding survival/logistics game layer.
Kurzz: a playable music activity inside Rally.
POP Roll: the timeline view of note and frame events in Walker UI.
Jazz Balance: the ensemble state view (roles, step, cam, agent activity).
Runner: the server process that emits SSE events.
Walker: the TypeScript UI client subscribing to those events.

Player action vocabulary used by these tutorials:

play
wait
repeat
answer
support
leave space

No music theory is required. The system is designed around legible intent and recoverable mistakes, not perfect-note scoring.

What We Built

The file is scenarios/kurzz/pop/follow_the_leader.yaml.

The goal was a tutorial song: slow, legible, and shaped around a call-and-response interaction where the AI musicians play a phrase and then explicitly step back to invite the human to echo it.

The Header

apiVersion: net.plantange/v1
kind: simulation.pop
ensemble: follow_the_leader.pop

meta:
  title: "Follow the Leader"
  uuid: jg-pop-follow-the-leader-v1

Every simulation document in this project starts with apiVersion and kind. kind: simulation.pop is what tells the runtime this is a pop spec and not, say, an area or a quest. ensemble is a logical name that the server uses to associate this pop with a musician group. The uuid is a stable identifier so that if we rename or move the file, references to this specific pop don't break.

Tempo and Resolution

tempo: 72.0
steps_per_beat: 4
beats_per_bar: 4

72 BPM is slow by jazz standards -- most jazz is 120-180. At 72, one bar lasts about 3.3 seconds. That gives a human musician enough time to hear a phrase, recognize its shape, and respond to it before the next cycle begins. We are explicitly designing for perceptual lag.

steps_per_beat: 4 means the internal time grid divides each beat into four sub-steps. This is the resolution the engine uses when it decides whether an agent is about to emit. The human doesn't see steps; they matter only to the musicians.

The Substrate

substrate:
  chord_loop: [Dm7, G7, Cmaj7, Cmaj7]
  key_root: C
  loop_duration_beats: 8.0
  tempo_feel: patient_swing
  tonal_mode: major_swing
  tutorial_mode: true

The substrate is the harmonic and rhythmic context the musicians work inside. chord_loop is the chord progression -- in our case, a two-bar II-V-I.

Why II-V-I?

Dm7 → G7 → Cmaj7 is the most teachable phrase shape in jazz. The tension rises on the V chord (G7) and resolves cleanly on the I (Cmaj7). Every jazz student learns this pattern first because it has a clear emotional arc that a listener can follow even without any music theory knowledge: something wants to go somewhere, and then it arrives. The fourth bar holds Cmaj7 a second time to give the human space to settle before the next call begins.

tempo_feel: patient_swing and tonal_mode: major_swing are semantic hints. They don't control the musicians directly -- they inform downstream systems (a future arranger, a renderer, a UI label) what the intended feel is.

tutorial_mode: true is a flag we added speculatively. It doesn't do anything yet, but it marks this pop as something that downstream tutorial systems -- ren triggers, Walker UI overlays, quest conditions -- can key on without having to inspect the song structure.

Sections

sections:
  - { id: call, semantic: call, start_s: 0.0, end_s: 13.5 }
  - { id: respond, semantic: respond, start_s: 13.5, end_s: 27.0 }

The 8-bar form divides into two 4-bar halves by time. At 72 BPM / 4/4, 4 bars = 13.3 seconds -- close enough to 13.5 that the small rounding difference won't be audible.

semantic is the meaningful label. id is the machine name. Both matter: id is used in rules and arc references; semantic is what the rule evaluator matches on when you write when: { section_semantic: call }.

The Arc

arc:
  - id: call_entry
    section_id: call
    start_s: 0.0
    note: "intent=LeadPhrase;clarity=Simple;space=Held;phrase=II_V_I"
  - id: respond_entry
    section_id: respond
    start_s: 13.5
    note: "intent=InviteEcho;clarity=Open;space=Offered;phrase=Your_Turn"

The arc is a sequence of annotated moments. Each entry is a landmark in the song's emotional and structural shape. The note is a semicolon-separated set of key=value hints that describe what the song intends at that moment.

These are not instructions to the musicians. They are signals to the rest of the system -- the UI, the ren trigger layer, future arrangers -- about what is happening narratively. intent=InviteEcho is a statement of purpose that a cutscene script or a Walker overlay can read to know when to show the human a "your turn" cue.

Axes

axes:
  - id: phrase_space
    value_kind: ordered_categorical
    options:
      - { value: held, rank: 0 }
      - { value: narrowing, rank: 1 }
      - { value: open, rank: 2 }
      - { value: offered, rank: 3 }
    default: held

  - id: phrase_clarity
    value_kind: ordered_categorical
    options:
      - { value: complex, rank: 0 }
      - { value: moderate, rank: 1 }
      - { value: clear, rank: 2 }
      - { value: simple, rank: 3 }
    default: simple

Axes are the pop's semantic vocabulary. They name the dimensions that matter for this particular song.

phrase_space describes how much melodic space the ensemble is holding versus offering to the human. During call, it is held -- the musicians have the floor. During respond, it becomes offered -- the floor is yours.

phrase_clarity measures how legible the leading phrase is. In a tutorial we want this pegged at simple throughout. In a more advanced song it might drop to complex during an improvised break to signal that the human is on their own.

The rank field on each option is what makes an axis ordered_categorical. Rank allows the engine to evaluate whether a song is monotonically increasing in complexity, or whether an agent is moving toward or away from a target value -- without treating the category names as numbers directly.

Rules

rules:
  - when: { section_semantic: call }
    set:
      phrase_space: held
      phrase_clarity: simple
      suggested_pop_tension: 0.30
      silence_policy: lead_active
      tutorial_cue: watch

  - when: { section_semantic: respond }
    set:
      phrase_space: offered
      phrase_clarity: simple
      suggested_pop_tension: 0.20
      silence_policy: space_open
      tutorial_cue: your_turn

Rules are the song's active policy. Each rule watches for a section semantic and sets a collection of guidance fields. These fields enter the shared chem space that all agents read from.

suggested_pop_tension is a soft target: it tells agents how energized and active the song wants the ensemble to be. 0.30 during the call, 0.20 during the respond. The tension drops when the human gets the floor -- the AI musicians are supposed to get quieter, not louder.

silence_policy is guidance for how agents should treat pauses. lead_active means the leading voice keeps going; space_open means agents should resist filling the space.

tutorial_cue is the field we care most about for the tutorial system. It is a named signal that other parts of the system -- Walker UI overlays, ren triggers, quest conditions -- can watch. When tutorial_cue: your_turn is active, a Walker panel can show a visual prompt; a ren trigger can fire a "the floor is yours" cutscene; a quest can increment a counter. We haven't built any of those yet, but the hook is in the pop spec so that when we do, the song itself drives the tutorial flow, not hardcoded timers in the UI.

Step 2 applies the same approach to an incident: named fields in spec, a Lua script watching them, and a stable event contract the Walker layer can subscribe to without coupling to script internals.

The Companion Jazz Spec

The pop spec describes the song's structural and semantic shape. The jazz spec describes the ensemble's musical behavior inside that shape.

venues/tutorial_follow_the_leader.yaml is the companion. It says:

form_length: 8 -- 8 bars, matching the pop
cam_shape: constant { value: 0.3 } -- no arc pressure; no build, no climax
vm_floor: 0.2 -- musicians are active but not competitive
3 musicians at agent_hz: 3.0 -- a small, slow-moving group

The two files are linked by naming convention, not by a hard reference. When the session control API is implemented, a Walker UI will present a list of venues; this venue will be selectable and will load both specs together.

What's Next

→ Step 2: Build The Stuck Note Incident

The pop is structurally complete. Step 2 adds the first incident: a musical problem with a clear mechanism, a Lua script, and an event contract the Walker UI can display.

The tutorial_cue guidance field gives us a named hook that the cutscene and quest layers can key on when they are written. The chord loop and tempo were chosen for maximum legibility.

Further content work (deferred; tracked in the plan):

Author the tutorial ren beats (the "watch the band" and "your turn" cutscene moments) in scenarios/kurzz/ren/ and ren_triggers/
Sketch the tutorial quest objective: "complete one full respond section"
Add a second pop spec at a slightly higher difficulty once the tutorial arc is stable

Further code work (tracked in the plan): venue loading and pop spec broadcast, which are the two integration steps that make this YAML file go live in a session.

References

`plantangenet/music/docs/MUSIC_IN_PLANTANGENET.md -- How "music" works in plantangeent.
plantangenet/music/docs/NEW_POP.md -- How to Develop a new pop
scenarios/kurzz/docs/A_new_drummer.md -- lore and voice reference for the Kurzz scenario