Turbocharging BASIC: The Hybrid Assembly Technique That Defined 1980s Programming

Mandelbrot set rendered on the Oric Atmos using hybrid BASIC and 6502 assembly

How bedroom coders squeezed impossible performance from 8-bit home computers — and how I revisited it on an Oric Atmos

Mandelbrot set rendered on the Oric Atmos

The Problem Every 1980s Programmer Knew

If you owned a home computer in the 1980s — a ZX Spectrum, Commodore 64, BBC Micro, Amstrad CPC, or Oric Atmos — you knew BASIC. It was right there when you switched on, cursor blinking, ready for you to type 10 PRINT "HELLO". BASIC was how most people learned to program. It was friendly, forgiving, and immediately accessible.

It was also painfully slow.

BASIC on these machines was interpreted. Every time your program reached line 120 ZR=R2-I2+CR, the interpreter had to parse that text, look up variable names character by character, convert between its internal floating-point format, perform the arithmetic, and store the result — all on a CPU running at 1 MHz with no floating-point hardware whatsoever. A single multiply of two decimal numbers could take thousands of CPU cycles, the interpreter burning most of that time just figuring out what you were asking it to do.

For printing text menus or simple games, this was fine. But for anything computationally intensive — 3D graphics, music synthesis, fast arcade action, or fractal rendering — pure BASIC was hopeless.

Assembly language, on the other hand, was blisteringly fast. The same 1 MHz 6502 or Z80 that crawled through interpreted BASIC could execute hand-crafted machine code at full speed, one instruction per few clock cycles. The CPU wasn't slow — the interpreter was.

But assembly was hard. There were no comfortable IDEs, no debuggers worth the name, and a misplaced byte could crash the machine with no error message — just a frozen screen or a reboot. For most hobbyist programmers, writing an entire application in assembly was impractical.

So the community developed a pragmatic solution: write the program in BASIC, but replace the performance-critical inner loop with a small assembly routine. Let BASIC handle the easy parts — user interface, file I/O, coordinate calculations, screen setup — and let assembly handle the tight numerical core where the CPU spends 90% of its time.

This hybrid technique became one of the defining patterns of 1980s home computing. Magazine type-in listings regularly included blocks of DATA statements containing machine code bytes, loaded into memory with a POKE loop and called from BASIC with USR() or CALL. It was how games got their smooth scrolling, how demos achieved their impossible effects, and how ambitious programs like fractal renderers became feasible on hardware that, on paper, had no business running them.

A Concrete Example: Mandelbrot on the Oric Atmos

To explore this technique hands-on, I implemented a Mandelbrot set renderer on the Oric Atmos — a 1983 home computer built around the 6502 CPU running at 1 MHz, with 48KB of RAM and a 240×200 monochrome high-resolution display.

The Mandelbrot set is a perfect test case. The algorithm is simple to express:

For each pixel (px, py) on screen:
    Map to complex coordinate c = cr + ci*i
    Set z = 0
    Repeat up to MAXITER times:
        z = z*z + c
        If |z|² ≥ 4, the point escaped — stop
    If the point never escaped, it's in the set — draw it

The outer loops (scanning pixels, mapping coordinates, drawing) are straightforward. But the inner iteration — the z = z*z + c loop — is where all the computation lives. Each iteration requires multiple floating-point multiplies and adds. Each pixel might need 20 iterations. A 240×200 display has 24,000 pixels (half that with symmetry exploitation). That's potentially 480,000 multiply-heavy iterations.

Pure BASIC: Correct but Glacial

Our first version was pure Oric BASIC:

10 HIRES
20 MI=20
30 SX=3.5/239:SY=2.5/199
40 FOR PY=99 TO 0 STEP -1
50 CI=PY*SY-1.25
60 FOR PX=0 TO 239
70 CR=PX*SX-2.5
80 ZR=0:ZI=0:IT=0
90 R2=ZR*ZR:I2=ZI*ZI
100 IF R2+I2>4 OR IT>=MI THEN 140
110 ZI=2*ZR*ZI+CI
120 ZR=R2-I2+CR
130 IT=IT+1:GOTO 90
140 IF IT>=MI THEN CURSET PX,PY,1:CURSET PX,199-PY,1
150 NEXT PX
160 NEXT PY
170 GET Z
180 TEXT

Nineteen lines. Clean, readable, mathematically transparent. It exploits the vertical symmetry of the Mandelbrot set, drawing both halves simultaneously to halve the work.

It produces a correct image. It just takes approximately 5–6 hours to finish.

Lines 90–130 are the bottleneck. The BASIC interpreter executes this tight loop hundreds of thousands of times, and each pass involves parsing variable names, performing floating-point arithmetic through software routines, evaluating compound boolean expressions, and performing GOTO line lookups. The interpreter overhead dwarfs the actual mathematical work.

The Hybrid Approach: BASIC + 269 Bytes of Assembly

The hybrid version keeps BASIC for everything except the inner iteration. A 269-byte 6502 assembly routine handles the z = z*z + c loop, and BASIC calls it for each pixel:

10 REM ** MANDELBROT ASM **
15 HIMEM #97FF
20 HIRES
25 FOR I=0 TO 268:READ B:POKE #9800+I,B:NEXT
30 MI=20:SX=3.5/239:SY=2.5/199
35 POKE #9A04,MI
40 FOR PY=99 TO 0 STEP -1
50 V=INT((PY*SY-1.25)*256):IF V<0 THEN V=V+65536
55 DOKE #9A02,V
60 FOR PX=0 TO 239
70 V=INT((PX*SX-2.5)*256):IF V<0 THEN V=V+65536
75 DOKE #9A00,V
80 CALL #9800
90 IT=PEEK(#9A05)
100 IF IT>=MI THEN CURSET PX,PY,1:CURSET PX,199-PY,1
110 NEXT PX
120 NEXT PY
130 GET Z$:TEXT
500 DATA 162,0,181,80,72,232,224,16,208,248,169,0,133,80,...

The structure is recognisable as the same algorithm. The key differences:

  • Line 15: HIMEM #97FF tells BASIC not to use memory above $97FF, reserving space for our machine code at $9800.
  • Line 25: A POKE loop reads 269 bytes from DATA statements and writes them into memory at $9800–$990C. This is the assembly routine, stored as raw byte values.
  • Lines 50–75: Coordinates are converted from BASIC's floating-point to signed 8.8 fixed-point format and written to a parameter block at $9A00 using DOKE (16-bit poke).
  • Line 80: CALL #9800 transfers execution to the assembly routine. It runs the iteration loop at full machine speed and returns.
  • Line 90: PEEK(#9A05) reads the iteration count result back from the parameter block.

The DATA statements at line 500+ are the machine code — the same bytes a 6502 CPU would execute, just expressed as decimal numbers that BASIC can read.

Render time: approximately 35–40 minutes. A 10× speedup.

How It Works: Fixed-Point Arithmetic on a CPU with No Multiply Instruction

The 6502 CPU has no multiply instruction and no floating-point support. It can add, subtract, and shift 8-bit values. That's essentially it. So how do you compute z*z + c with complex numbers?

Signed 8.8 Fixed-Point

Instead of floating-point, the assembly routine uses signed 8.8 fixed-point representation. Each number is stored as a 16-bit value where the high byte is the signed integer part and the low byte is the fractional part:

ValueHexHigh byte (integer)Low byte (fraction)
1.0$0100$01 = 1$00 = 0/256
0.5$0080$00 = 0$80 = 128/256
-2.5$FD80$FD = -3$80 = 128/256
4.0$0400$04 = 4$00 = 0/256

The precision is 1/256 ≈ 0.0039, which gives about 3–4 sub-pixel levels of precision across the 240-pixel display — adequate for a recognisable Mandelbrot image at this resolution, though the limited precision means fine boundary detail is lost compared to floating-point.

16×16 Multiply via Shift-and-Add

To multiply two 8.8 values, we need a 16-bit × 16-bit → 32-bit unsigned multiply, then take the middle two bytes of the result as the 8.8 product. The 6502 has no multiply instruction, so we implement one using the shift-and-add algorithm — essentially long multiplication in binary:

Place the multiplier in the bottom half of a 32-bit register.
Clear the top half.
Repeat 16 times:
    If the lowest bit is 1, add the multiplicand to the top half.
    Shift the entire 32-bit register right by one.
The 32-bit register now contains the product.

In 6502 assembly, this is a tight loop using four zero-page bytes as the 32-bit register, with ROR (rotate right) instructions chaining the shift across all four bytes, and the carry flag from the addition naturally feeding into the rotation:

MUL_LOOP:
    LDA PROD_0       ; Test low bit of multiplier
    LSR A             ; Shift into carry
    BCC MUL_NOADD     ; Skip add if bit was 0
    CLC
    LDA PROD_2        ; Add multiplicand to upper half
    ADC M_AL
    STA PROD_2
    LDA PROD_3
    ADC M_AH
    STA PROD_3
MUL_NOADD:
    ROR PROD_3        ; Shift entire 32-bit register right
    ROR PROD_2        ; Carry from add feeds into MSB
    ROR PROD_1
    ROR PROD_0
    DEX               ; 16 iterations
    BNE MUL_LOOP

There's a beautiful detail here: the carry flag does double duty. When the addition overflows, the carry bit is set. The immediately following ROR PROD_3 rotates that carry into the top of the register, correctly preserving the overflow bit. When no addition happens (the BCC path), the carry is 0, and the ROR shifts in a zero. The hardware carry flag is doing what would require explicit logic in a higher-level language.

For signed multiplication, the routine takes the absolute value of both operands, performs the unsigned multiply, then negates the result if the signs differed (detected by XOR-ing the high bytes).

The Iteration Loop

The complete iteration loop performs three multiplies per iteration (ZR×ZR, ZI×ZI, ZR×ZI), plus additions, a subtraction, a comparison against the escape radius, and an iteration count check. The Y register is used as the iteration counter — it's never touched by the multiply subroutine, so it's naturally preserved across calls. The total routine is 269 bytes.

The Interface: How BASIC and Assembly Communicate

The communication between BASIC and assembly uses a simple shared-memory interface — a technique that was standard practice in the 1980s:

$9A00-$9A01: CR (real part of c, signed 8.8, little-endian)
$9A02-$9A03: CI (imaginary part of c, signed 8.8, little-endian)
$9A04:       MAXITER (byte)
$9A05:       RESULT (iteration count, byte)

BASIC writes the parameters with DOKE (16-bit write) and POKE (8-bit write), calls the routine with CALL #9800, and reads the result with PEEK. The assembly routine reads the parameters, runs the iteration, writes the result, and returns with RTS.

This is essentially the same calling convention used by modern programs calling library functions — pass parameters, execute, read results — just implemented with direct memory addresses instead of stack frames.

One practical detail: the assembly routine borrows 16 bytes of zero page ($50–$5F) for its workspace. Zero page on the 6502 is the first 256 bytes of memory, which can be accessed with shorter, faster instructions. Since BASIC's interpreter also uses zero page, the routine saves these bytes to the stack on entry and restores them on exit — a courtesy that prevents corrupting BASIC's internal state.

Where the Time Goes

With the assembly inner loop, the per-iteration time drops from thousands of interpreted BASIC operations to roughly 500–1000 machine cycles. But the overall speedup is "only" 10× rather than the 50–100× you might expect from native code, because BASIC still handles the per-pixel overhead:

  • Floating-point coordinate conversion: INT((PX*SX-2.5)*256) involves multiple floating-point operations through BASIC's software math routines.
  • DOKE and POKE: Writing parameters to the shared memory block.
  • CALL overhead: The BASIC interpreter parsing and dispatching the CALL statement.
  • PEEK and comparison: Reading the result and evaluating IF IT>=MI.
  • CURSET: The ROM routine for plotting individual pixels.

This per-pixel BASIC overhead is roughly 5,000–10,000 cycles — comparable to the assembly iteration cost for many pixels. The inner loop is no longer the bottleneck; the BASIC per-pixel wrapper is.

This is a characteristic of the hybrid approach: you get a dramatic speedup on the targeted hot path, but you're ultimately limited by the speed of the remaining BASIC code. To go further, you'd move the entire pixel loop into assembly, writing directly to screen memory — which was exactly the next step serious 1980s programmers would take.

The Toolchain: Then and Now

In the 1980s, a programmer would typically write assembly by hand using pencil and paper or a simple on-screen monitor program, manually looking up opcodes from a reference card, calculating branch offsets by counting bytes, and converting the final machine code to decimal values for DATA statements.

For this project, I built a small toolchain in Python:

  • asm6502.py: A minimal two-pass 6502 assembler supporting labels, constants, and the instruction subset needed for the Mandelbrot routine. It handles the classic two-pass assembly problem: pass 1 collects label addresses, pass 2 emits bytes with resolved references.
  • build_mandel_asm.py: A build script that assembles the .s source file, generates a BASIC program with the machine code embedded as DATA statements, and converts it to a .tap tape image for loading into the emulator.
  • bas2tap.py: A BASIC tokenizer that converts human-readable .bas text files into the Oric's tokenized tape format.

This is a modernised version of the 1980s workflow, replacing pencil-and-paper assembly with a proper assembler, but producing the same artifact: a BASIC program with DATA statements that any Oric user could type in and run.

Lessons and Pitfalls

The HIRES trap: The Oric's HIRES command clears the memory region at $9800–$9FFF (the alternate character set area). If you load your machine code there before calling HIRES, it gets wiped. The solution is simple — call HIRES first — but this kind of platform-specific memory layout gotcha was a constant source of bugs in 1980s development. Every machine had its own minefield of reserved addresses, hardware registers, and ROM routines that silently clobbered your carefully placed code.

Assembler sizing bugs: The assembler initially had a classic two-pass bug: on pass 1, it evaluated instruction sizes without the full label table, causing absolute addresses ($9A00+) to be misidentified as zero-page addresses (2 bytes instead of 3). This shifted all subsequent label positions, producing code with corrupt branch targets and subroutine addresses. The assembled bytes looked plausible but the program crashed mysteriously. In the 1980s, without a proper assembler, this class of bug was caught by hand-counting byte offsets — tedious and error-prone.

Fixed-point precision trade-offs: The 8.8 format gives only 256 levels of fractional precision. For the Mandelbrot set, this means the fine filaments and smaller bulbs along the boundary are lost — they're below the resolution of the number format. A 16.16 format would give much better images but would require 32-bit arithmetic throughout, roughly quadrupling the multiply routine's size and execution time. The 8.8 format was a deliberate trade-off: good enough for a recognisable image at 240×200, fast enough to complete in under an hour.

The Bigger Picture

The hybrid BASIC+assembly technique wasn't just a performance hack — it was a bridge. It let programmers work primarily in a language they understood (BASIC), dipping into assembly only for the critical section, and learning machine-level concepts incrementally. Many professional developers who built the software industry through the 1990s and 2000s learned assembly language exactly this way: by writing small routines to speed up their BASIC programs, gradually taking on more until they were comfortable writing entire programs in assembly.

The same fundamental insight — profile your program, identify the hot loop, optimise that specific part — remains the foundation of performance engineering today. Modern developers use SIMD intrinsics in C, inline assembly in systems code, GPU compute shaders for parallel workloads, or native extensions in scripting languages. The tools change, but the pattern is the same one that bedroom programmers discovered on their Orics and Spectrums forty years ago.

The Mandelbrot set rendered on the Oric — recognisable, correct, 240×200 pixels, from 269 bytes of hand-crafted machine code collaborating with 18 lines of BASIC — is a small monument to that pragmatic, creative approach to programming. The machine was slow. The programmer was clever. And between them, they made it work.


The complete source code, assembler, and build toolchain are available on GitHub. The program runs on the Oric Atmos (or the Oricutron emulator) and produces a complete Mandelbrot set rendering in approximately 35–40 minutes — down from 5–6 hours in pure BASIC. This project was developed with assistance from Claude Code (Anthropic), which contributed to the assembler toolchain, fixed-point arithmetic implementation, and debugging throughout the development process.