Session 12 — Phase 3.2 Complete — String Manipulation & Static Typing¶
Date: April 2026
Status: Complete ✅
Milestone¶
Phase 3.2 is complete. Terse handles strings as first-class values throughout the full pipeline — lexer, parser, IR, runtime, LLVM emitter. All seven planned string operations compile to native binaries and produce correct results when executed.
The session also locked in two foundational language decisions that will shape every phase to come: Terse is statically typed, and function call expressions are how Terse calls return-valued functions.
// Test programs verified end-to-end:
greeting = "hello world" // exit code 0 — clean run
size = length("hello world") // exit code 11
joined = concat("hello", "world") // length 10 — "helloworld"
slice = substring("hello world", 6, 11) // length 5 — "world"
found = contains("hello world", "world") // exit code 1 — true
match = equals("hello", "hello") // exit code 1 — true
shouted = replace("hello hello hello", "hello", "hi") // length 8 — "hi hi hi"
Every result above was produced by a real native binary, not the interpreter.
Operations Built¶
Op #1 — String literals through the pipeline¶
A string literal like "hello" now travels cleanly through every gate. The lexer was already producing STRING tokens correctly. The parser learned to wrap them in a new StringLiteral AST node when assigned to a variable. The IR compiler emits a new StoreString IR op. The LLVM emitter reuses the existing string interning system (intern_string) to produce global constants and stack pointers.
Compiles to LLVM IR that allocates a stack slot for greeting and stores a pointer to the interned global constant.
Op #2 — length(s)¶
The first Terse function with a return value. Required teaching the parser about function call expressions — a fundamentally different construct from the existing statement-level function calls.
The parser now recognizes name = identifier(arg) and produces a FunctionCallExpression AST node. The IR compiler dispatches on function name to emit StringLength. A new C runtime function terse_str_length wraps strlen. The LLVM emitter calls it via call i32 @terse_str_length(i8* %x).
This op also introduced the separate runtime file for the LLVM pipeline — terse_runtime_llvm.c — distinct from terse_runtime.h which serves the C emitter.
Op #3 — concat(a, b)¶
Two-argument function call. Required generalizing FunctionCallExpression from a single arg field to a list args, and updating the parser's argument-collection loop to handle commas.
terse_str_concat is the first runtime function that allocates memory — malloc is now part of the LLVM runtime. The function returns a freshly allocated i8* that becomes the result variable's value.
This op also exposed the memory leak that's now formally tracked for Phase 5: every concat call allocates and never frees. Fine for short-lived programs, fatal for long-running AI systems.
Op #4 — substring(s, start, end)¶
Three-argument function call with two integer arguments. Required teaching the parser to accept NUMBER tokens as function arguments (numbers were previously only valid in math expressions).
terse_str_substring implements defensive clamping semantics that are normative across all Terse backends:
- Negative
startis clamped to 0 endgreater than the string length is clamped to the string length- If
start > end, returns an empty string"" - Always allocates a new null-terminated string
These semantics are documented as language-level guarantees, not implementation details. Future backends (FPGA emitter, alternative compilers) reproduce them exactly.
Op #5 — contains(haystack, needle)¶
The first Terse function with a boolean return type. This forced the static-typing decision: should booleans be i32 0/1 (Python style), i8 0/1 (C style), or LLVM's native i1 (Rust style)?
Decision: i1. Real boolean type, distinct from integers, enforced by the LLVM type system. The runtime function returns int on the C side because C lacks a portable single-bit type, but the LLVM emitter truncs the result down to i1 at the call site. That truncation is where Terse's type discipline visibly applies.
terse_str_contains wraps strstr. The C standard library has been doing substring search for fifty years; Terse now uses it.
Op #6 — equals(a, b)¶
The second boolean op. Reuses every pattern from contains.
terse_str_equals wraps strcmp with an inversion — strcmp returns 0 when strings are equal (zero meaning "no difference"), which is the opposite of what a boolean named equals should return. The runtime inverts the result so Terse code reads naturally. This is Law III: the runtime works harder so the user doesn't have to remember strcmp's confusing return convention.
match = equals("hello", "hello") // true
mismatch = equals("Hello", "hello") // false (case-sensitive)
Both branches verified end-to-end — true case returned exit code 1, false case returned exit code 0.
Op #7 — replace(s, old, new)¶
The most complex runtime function in Phase 3.2. Three-argument function call returning a new string with all occurrences of old replaced by new (matches Python and Rust; JavaScript's first-only String.replace is widely considered a wart).
The C function uses a two-pass algorithm: first pass counts non-overlapping occurrences to compute the exact output buffer size; second pass copies bytes with replacement. This lets us malloc the right size upfront rather than reallocating during the copy.
Edge cases documented: empty old returns a copy unchanged (replacing "nothing" with anything is a no-op), NULL inputs treated as empty strings, always allocates a fresh string even when no replacements happen.
Architecture Decisions¶
Path B — Separate Runtime for LLVM¶
terse_runtime.h (the existing C emitter runtime) is built around static functions and fixed-size structs. It's tightly coupled to the C emitter pipeline and not easily callable from external LLVM IR.
Rather than refactor it, Phase 3.2 introduces a separate runtime file for the LLVM pipeline:
src/transpiler/
terse_runtime.h — runtime for the C emitter pipeline (unchanged)
terse_runtime_llvm.c — runtime for the LLVM emitter pipeline (NEW)
The two runtimes solve different problems. terse_runtime.h simulates Terse semantics by holding state in C structs. terse_runtime_llvm.c is a thin shim over libc — it implements primitives that LLVM IR can't easily express inline. The build process for LLVM-compiled programs now links both files: clang output.ll terse_runtime_llvm.c -o output.exe.
When Phase 4 lands and the C emitter is potentially deprecated in favor of LLVM, the runtimes can be unified. For now, separate is appropriate.
Function Call Expressions — Generic Parsing¶
Instead of making each new function (length, concat, etc.) a keyword in the lexer, the parser uses generic identifier(arg, arg, ...) parsing. Any identifier followed by ( is treated as a function call.
This means adding a new function requires zero lexer changes — the lexer never knows about specific function names. Functions are dispatched by name in the IR compiler and the LLVM emitter. New functions are added by registering them in the dispatch logic, not by extending the keyword set.
This scales without bound. User-defined functions and built-in functions are syntactically identical, which is the right design for a language.
Static Typing as a Language Commitment¶
Phase 3.2's boolean return values forced the question: does Terse care about types?
Decision: yes. Terse is a statically typed language. Variables, function arguments, and function return values have known types at compile time. The compiler's role is to enforce type correctness — Law III applies (the compiler works harder so you don't have to).
Type checking enforcement is deferred to Phase 3.4 alongside the standard library. But the infrastructure for static typing lands here, in Phase 3.2:
- IR ops specify what type they produce (
StringLengthreturns int,StringContainsreturns bool) - LLVM emit methods use the correct LLVM type per operation (
i1for bool,i32for int,i8*for string) - The C runtime returns concrete C types that map cleanly to LLVM types
The decision rules out a Python-style "everything is an object, types are runtime" path. Terse is heading toward Rust-style discipline where the compiler refuses to produce a binary if types don't match — appropriate for a language meant to host AI systems where correctness, performance, and safety matter.
Documented Function Semantics¶
Operations that have edge cases (empty inputs, out-of-range indices, etc.) have their semantics documented as language-level guarantees, not implementation details:
substring's defensive clamping (negative start → 0, end > length → length, start > end → empty)replace's all-occurrences, non-overlapping, empty-old-is-no-op behaviorequalsandcontainsnull-handlingcontains's empty-needle returning true (matching strstr/Python/Rust)
When alternative Terse backends arrive (FPGA emitter, alternative compilers), they must reproduce these semantics exactly, not approximately. This is a feature of the language, not the implementation.
Memory Ownership — Heap and Leak¶
String operations that allocate (concat, substring, replace) currently malloc and never free. For short-lived programs this is genuinely fine — the operating system reclaims everything on exit. For long-running processes (NCI hosting Terse for years) this is fatal.
This is a known issue, formally tracked for Phase 5 (long-running runtime). The fix won't be a single technique — it will likely combine reference counting for short-lived strings, interning for repeated concept references, and possibly a generational collector for general allocations.
For Phase 3.2 the memory leak is acceptable. Capturing it in CONTEXT.md and the docs site means future Lesley (or future contributors) can find the deferred decision in context rather than rediscovering it.
Files Changed¶
Existing files (modified)¶
src/interpreter/lexer.py— addedLPARENandRPARENtoken typessrc/interpreter/parser.py—StringLiteralAST node,FunctionCallExpressionAST node with comma-separatedargslist, updatedparse_assignto detect string literals and function call expressionssrc/transpiler/ir.py— six new IR ops:StoreString,StringLength,StringConcat,StringSubstring,StringContains,StringEquals,StringReplacesrc/transpiler/ir_compiler.py— dispatch logic for each of the seven string ops in theFunctionCallExpressionbranch ofAssignStatementsrc/transpiler/llvm_emitter.py— six new emit methods, six newdeclarelines in the LLVM preamble, six new dispatch dict entriessrc/transpiler/test_llvm_pipeline.py— updated build instructions to includeterse_runtime_llvm.cin the clang invocation
New files¶
src/transpiler/terse_runtime_llvm.c— the LLVM pipeline's C runtime, ~120 lines containing six string functionsexamples/string_test.trs— test programs exercising each operation
Documentation¶
CONTEXT.mdupdated with the Phase 3.2 design decisions (type system, runtime split, documented semantics, naming conventions)- Docs site updated: index, syntax reference, concepts, roadmap
Verification Method¶
Because Terse doesn't have print yet, numeric verification used a temporary hand-edit of output.ll to make main return the value of interest as the program's exit code (echo $?). For boolean returns, this required adding a zext i1 %x to i32 instruction to extend the boolean back up to an integer before returning.
Each operation was verified by:
- Writing a
.trsfile using the operation - Generating
output.llviapython test_llvm_pipeline.py - Inspecting the IR output for correct ops
- Inspecting
output.llfor correct LLVM IR - Compiling with
clang output.ll terse_runtime_llvm.c -o test.exe - Hand-editing the
ret i32 0at the end ofmainto return a meaningful value - Re-compiling and running, checking exit code
When print arrives in a future session, this verification flow will be replaced with proper output assertions. The current method works and was the path of least resistance for end-to-end testing without a full output mechanism.
Toolchain¶
- LLVM 22 + Clang 22 via MSYS2 MINGW64 (unchanged from Phase 3)
- C runtime files compiled directly by clang alongside
.llfiles - No new dependencies —
terse_runtime_llvm.cuses only<string.h>and<stdlib.h>
Related Projects¶
| Project | Status |
|---|---|
| NCI | Session 32+ — Stage 1 Terse integration confirmed in production |
| Terse | Phase 3.2 complete — strings, static typing, runtime split |
Next Session Goals¶
Phase 3.2.5 — Lists.
Lists are the second LLVM type after i1 that's genuinely a "Terse type" rather than a libc thing. The phase introduces:
- A
listtype in the runtime (pointer + length + capacity) list()to create,append(list, item)to growlength(list)polymorphic with the existing string lengthat(list, n)for indexed accesseach item in listiteration using the existingeachkeyword for grammar consistency
Lists unlock Phase 3.3 (file I/O, where read_lines returns a list of strings) and the deferred split operation from Phase 3.2.