Step 1 - Let's Pop
A pop is the structural spine that a musician group forms an event around. Think of it as a song, but described from the inside out: instead of writing notes, you write what the song wants to be doing at each moment.
The pop has a schema -- a recipe book that says which fields are legal, which are required, and what they mean. Let's look at what we made, and why we made each choice.
Reader Assumptions
This tutorial is for game-dev/content work, not Rust engine work. You will work in YAML specs and scripts (Lua preferred, Rhai fallback).
Quick definitions:
- Rally: the surrounding survival/logistics game layer.
- Kurzz: a playable music activity inside Rally.
- POP Roll: the timeline view of note and frame events in Walker UI.
- Jazz Balance: the ensemble state view (roles, step, cam, agent activity).
- Runner: the server process that emits SSE events.
- Walker: the TypeScript UI client subscribing to those events.
Player action vocabulary used by these tutorials:
- play
- wait
- repeat
- answer
- support
- leave space
No music theory is required. The system is designed around legible intent and recoverable mistakes, not perfect-note scoring.
What We Built
The file is scenarios/kurzz/pop/follow_the_leader.yaml.
The goal was a tutorial song: slow, legible, and shaped around a call-and-response interaction where the AI musicians play a phrase and then explicitly step back to invite the human to echo it.
The Header
apiVersion: net.plantange/v1
kind: simulation.pop
ensemble: follow_the_leader.pop
meta:
title: "Follow the Leader"
uuid: jg-pop-follow-the-leader-v1
Every simulation document in this project starts with apiVersion and kind.
kind: simulation.pop is what tells the runtime this is a pop spec and not, say,
an area or a quest. ensemble is a logical name that the server uses to associate
this pop with a musician group. The uuid is a stable identifier so that if we
rename or move the file, references to this specific pop don't break.
Tempo and Resolution
tempo: 72.0
steps_per_beat: 4
beats_per_bar: 4
72 BPM is slow by jazz standards -- most jazz is 120-180. At 72, one bar lasts about 3.3 seconds. That gives a human musician enough time to hear a phrase, recognize its shape, and respond to it before the next cycle begins. We are explicitly designing for perceptual lag.
steps_per_beat: 4 means the internal time grid divides each beat into four
sub-steps. This is the resolution the engine uses when it decides whether an agent
is about to emit. The human doesn't see steps; they matter only to the musicians.
The Substrate
substrate:
chord_loop: [Dm7, G7, Cmaj7, Cmaj7]
key_root: C
loop_duration_beats: 8.0
tempo_feel: patient_swing
tonal_mode: major_swing
tutorial_mode: true
The substrate is the harmonic and rhythmic context the musicians work inside.
chord_loop is the chord progression -- in our case, a two-bar II-V-I.
Why II-V-I?
Dm7 → G7 → Cmaj7 is the most teachable phrase shape in jazz. The tension rises on the V chord (G7) and resolves cleanly on the I (Cmaj7). Every jazz student learns this pattern first because it has a clear emotional arc that a listener can follow even without any music theory knowledge: something wants to go somewhere, and then it arrives. The fourth bar holds Cmaj7 a second time to give the human space to settle before the next call begins.
tempo_feel: patient_swing and tonal_mode: major_swing are semantic hints.
They don't control the musicians directly -- they inform downstream systems
(a future arranger, a renderer, a UI label) what the intended feel is.
tutorial_mode: true is a flag we added speculatively. It doesn't do anything
yet, but it marks this pop as something that downstream tutorial systems --
ren triggers, Walker UI overlays, quest conditions -- can key on without
having to inspect the song structure.
Sections
sections:
- { id: call, semantic: call, start_s: 0.0, end_s: 13.5 }
- { id: respond, semantic: respond, start_s: 13.5, end_s: 27.0 }
The 8-bar form divides into two 4-bar halves by time. At 72 BPM / 4/4, 4 bars = 13.3 seconds -- close enough to 13.5 that the small rounding difference won't be audible.
semantic is the meaningful label. id is the machine name. Both matter:
id is used in rules and arc references; semantic is what the rule
evaluator matches on when you write when: { section_semantic: call }.
The Arc
arc:
- id: call_entry
section_id: call
start_s: 0.0
note: "intent=LeadPhrase;clarity=Simple;space=Held;phrase=II_V_I"
- id: respond_entry
section_id: respond
start_s: 13.5
note: "intent=InviteEcho;clarity=Open;space=Offered;phrase=Your_Turn"
The arc is a sequence of annotated moments. Each entry is a landmark in the song's
emotional and structural shape. The note is a semicolon-separated set of key=value
hints that describe what the song intends at that moment.
These are not instructions to the musicians. They are signals to the rest of the
system -- the UI, the ren trigger layer, future arrangers -- about what is happening
narratively. intent=InviteEcho is a statement of purpose that a cutscene script
or a Walker overlay can read to know when to show the human a "your turn" cue.
Axes
axes:
- id: phrase_space
value_kind: ordered_categorical
options:
- { value: held, rank: 0 }
- { value: narrowing, rank: 1 }
- { value: open, rank: 2 }
- { value: offered, rank: 3 }
default: held
- id: phrase_clarity
value_kind: ordered_categorical
options:
- { value: complex, rank: 0 }
- { value: moderate, rank: 1 }
- { value: clear, rank: 2 }
- { value: simple, rank: 3 }
default: simple
Axes are the pop's semantic vocabulary. They name the dimensions that matter for this particular song.
phrase_space describes how much melodic space the ensemble is holding versus
offering to the human. During call, it is held -- the musicians have the floor.
During respond, it becomes offered -- the floor is yours.
phrase_clarity measures how legible the leading phrase is. In a tutorial we
want this pegged at simple throughout. In a more advanced song it might drop to
complex during an improvised break to signal that the human is on their own.
The rank field on each option is what makes an axis ordered_categorical. Rank
allows the engine to evaluate whether a song is monotonically increasing in
complexity, or whether an agent is moving toward or away from a target value --
without treating the category names as numbers directly.
Rules
rules:
- when: { section_semantic: call }
set:
phrase_space: held
phrase_clarity: simple
suggested_pop_tension: 0.30
silence_policy: lead_active
tutorial_cue: watch
- when: { section_semantic: respond }
set:
phrase_space: offered
phrase_clarity: simple
suggested_pop_tension: 0.20
silence_policy: space_open
tutorial_cue: your_turn
Rules are the song's active policy. Each rule watches for a section semantic and sets a collection of guidance fields. These fields enter the shared chem space that all agents read from.
suggested_pop_tension is a soft target: it tells agents how energized and active
the song wants the ensemble to be. 0.30 during the call, 0.20 during the respond.
The tension drops when the human gets the floor -- the AI musicians are supposed to
get quieter, not louder.
silence_policy is guidance for how agents should treat pauses. lead_active
means the leading voice keeps going; space_open means agents should resist
filling the space.
tutorial_cue is the field we care most about for the tutorial system. It is a
named signal that other parts of the system -- Walker UI overlays, ren triggers,
quest conditions -- can watch. When tutorial_cue: your_turn is active, a Walker
panel can show a visual prompt; a ren trigger can fire a "the floor is yours" cutscene;
a quest can increment a counter. We haven't built any of those yet, but the hook
is in the pop spec so that when we do, the song itself drives the tutorial flow,
not hardcoded timers in the UI.
Step 2 applies the same approach to an incident: named fields in spec, a Lua script watching them, and a stable event contract the Walker layer can subscribe to without coupling to script internals.
The Companion Jazz Spec
The pop spec describes the song's structural and semantic shape. The jazz spec describes the ensemble's musical behavior inside that shape.
venues/tutorial_follow_the_leader.yaml is the companion. It says:
form_length: 8-- 8 bars, matching the popcam_shape: constant { value: 0.3 }-- no arc pressure; no build, no climaxvm_floor: 0.2-- musicians are active but not competitive- 3 musicians at
agent_hz: 3.0-- a small, slow-moving group
The two files are linked by naming convention, not by a hard reference. When the session control API is implemented, a Walker UI will present a list of venues; this venue will be selectable and will load both specs together.
What's Next
→ Step 2: Build The Stuck Note Incident
The pop is structurally complete. Step 2 adds the first incident: a musical problem with a clear mechanism, a Lua script, and an event contract the Walker UI can display.
The tutorial_cue guidance field gives us a named hook that the cutscene and quest
layers can key on when they are written. The chord loop and tempo were chosen for
maximum legibility.
Further content work (deferred; tracked in the plan):
- Author the tutorial ren beats (the "watch the band" and "your turn" cutscene moments)
in
scenarios/kurzz/ren/andren_triggers/ - Sketch the tutorial quest objective: "complete one full respond section"
- Add a second pop spec at a slightly higher difficulty once the tutorial arc is stable
Further code work (tracked in the plan): venue loading and pop spec broadcast, which are the two integration steps that make this YAML file go live in a session.
References
- `plantangenet/music/docs/MUSIC_IN_PLANTANGENET.md -- How "music" works in plantangeent.
plantangenet/music/docs/NEW_POP.md-- How to Develop a new popscenarios/kurzz/docs/A_new_drummer.md-- lore and voice reference for the Kurzz scenario