The Frankenstein composer needs distributed compute to make inference of merged LLMs usable. If you have an idle Mac, Linux box, or NVIDIA GPU sitting around, you can plug it in to a running pool with one command. Your machine becomes one segment of a pipeline-parallel inference chain — it holds a slice of the model's layers in RAM only, computes activations for incoming requests, and forwards results to the next node.
When you join the pool, your machine runs rpc-server from llama.cpp on port 50052. When an orchestrator (whoever runs the master llama-server) decides your machine should handle layers 16–24 of an 8B model, those layer weights are streamed to you over TCP at load time and held in your RAM. Per-token traffic is small (just activations, KB-scale).
Things this means concretely:
curl -fsSL https://charenix.com/Frankenstein/join_pool.sh | bash
The script:
1. Detects your OS and accelerator (CUDA / Metal / CPU only)
2. Clones llama.cpp and builds rpc-server with the right backend (-DGGML_RPC=ON -DGGML_CUDA=ON or -DGGML_METAL=ON)
3. Writes a supervisor loop (/tmp/supervised_rpc.sh) that auto-restarts rpc-server if it crashes
4. On macOS, installs a launchd agent so the supervisor survives reboot
5. On Linux, runs the supervisor in a detached setsid session (add to systemd or crontab @reboot if you want persistence)
6. Prints your Tailscale IP for you to share
The build step takes 5–10 minutes on a Mac mini and 3–5 minutes on a workstation. After that, your machine is a worker.
After install finishes, the script prints something like:
================================================================
You are now a Frankenstein compute pool worker.
accelerator : Metal
rpc address : 100.121.29.3:50052
logs : /tmp/rpc_supervised.log
================================================================
Send that 100.x.x.x:50052 line to the pool orchestrator. They add it to the master's --rpc list and your machine starts receiving work on the next model load.
The Frankenstein composer (charenix.com/Frankenstein) lets anyone compose custom LLMs by merging two existing models. But composed models still cost real GPU/CPU time to actually serve. A 70B-class merge is interesting in theory but useless if it takes 30 seconds per token on a single CPU.
Pipeline parallel inference splits the model's layers across N machines. Each machine runs only its slice. The latency cost is one TCP round-trip per layer boundary; the throughput multiplies. On a 4-node Tailscale pool, an 8B merge runs ~10x faster than the same model on the strongest individual node in the pool.
The interesting structural observation: building this on top of llama.cpp's rpc-server means contributors don't need ML expertise. They run a binary. The orchestrator handles layer placement, batching, and model swaps. This is the same separation as BOINC / SETI@home / Folding@home twenty years ago — workers contribute cycles, the project lead defines the problem.
When the orchestrator's llama-server receives a prompt:
1. Tokenizer runs on the orchestrator (fast, local)
2. Embedding layer runs on whichever worker owns layer 0
3. Each transformer block runs on whichever worker owns that block
4. Between blocks, activations are sent TCP to the next worker
5. Final lm_head runs on whichever worker owns the last block
6. Logits come back to orchestrator, next token sampled, repeat
The KV-cache for each worker's layers stays on that worker. This means warm-cache requests are nearly as fast as if the model were local — only the per-token activation round-trips are added.
To leave the pool:
# macOS
launchctl unload ~/Library/LaunchAgents/com.frankenstein.rpc-worker.plist
rm ~/Library/LaunchAgents/com.frankenstein.rpc-worker.plist
# Linux + both
pkill -f supervised_rpc.sh
pkill -f rpc-server
Your machine's RAM is freed and the orchestrator's next health-check will see you're gone.
Open issues + PRs welcome at github.com/norika1207-lab/frankenstein-skeleton.