Papi AI — Case Study | Karim Sangid

v2.9

Mobile Version Live

Param LoRA Base

31 tok/s

Local Inference

Distinct Layers

The Goal

Papi started as a single-file dating-app helper and has grown into a three-layer product: a mobile app users actually open on their phone, a backend that runs on Vercel and Groq, and an in-progress custom fine-tune that will eventually replace the third-party LLM on a personal RTX 4060 Ti.

The interesting engineering isn't "Papi answers Tinder messages." It's three layers that have to ship together — every release means a coordinated mobile + backend + (eventually) model update — and the long-term goal is to own the entire stack so the inference cost tends toward zero and the data never leaves my hardware.

Three Layers, Tightly Versioned

Layer 01 · Mobile

papi-ai-app

React Native on Expo SDK 54, distributed to peers via the EAS Update preview channel — no App Store review for iteration. Tap-to-fullscreen chat-image modal, voice + screenshot reply, per-match memory.

Expo 54 EAS Update Reanimated 4 RN Worklets

Layer 02 · Backend

papi-ai-web

Vercel FastAPI on Groq Llama 3.3 70B (chat) + whisper-large-v3 (multilingual voice). Strict version-tagged rollouts; the health endpoint hardcodes the version so a stale deployment lights up red immediately.

Vercel FastAPI Groq Whisper

Layer 03 · Local Model

papi-finetune

Qwen2.5-7B LoRA fine-tune lab. WSL Ubuntu + uv-managed Python 3.11 + torch 2.11+cu130 + Unsloth + xformers. Seed-extract → synthetic-corpus → train → GGUF export → drop into local Ollama gateway.

Unsloth PEFT TRL GGUF

Architecture

Today the app calls Vercel; tomorrow that same call routes through a personal Tailscale-fronted Ollama gateway running the fine-tuned model. The mobile and backend layers don't need to know which model is on the other end.

                        [ User on iPhone ]
                                │
                       React Native (Expo)
                                │
                  HTTPS / WebSocket / multipart
                                ▼
                       Vercel Edge / Functions
                       FastAPI (papi-ai-web)
                                │
       ┌────────────────────┬───────┴────────┬────────────────────┐
       ▼                    ▼                ▼                    ▼
   Groq Cloud            Local Ollama     Whisper-V3          SQLite
   Llama 3.3 70B         Qwen2.5-7B + LoRA  (multilingual)     memory
   (today)               (in progress)     voice transcribe
       │                    ▲
       └────────────────────┘
              routed via
            ┌──────────────┐
            │ Tailscale    │
            │ 100.89.111.87│
            │ :8089 Bearer  │
            └──────────────┘

The Local Fine-Tune Lab

GPU spec

NVIDIA RTX 4060 Ti

Workstation card, single GPU, runs in Karim's tower at home. Sufficient VRAM for Qwen2.5-7B QLoRA without offload; enough headroom for inference + training swap.

16GB

The fine-tune lab is a five-script pipeline:

extract_seed.py — pulls real Papi conversations + persona blocks into a clean ChatML format.
synth_generate.py — expands seed corpus with synthetic continuations (currently via the local Ollama gateway, intentionally not via third-party APIs because if the goal is "build my own AI" the data layer also has to be self-hosted).
train_lora.py — Unsloth-accelerated QLoRA on Qwen2.5-7B-instruct. WSL Ubuntu, Python 3.11.15 in a uv-managed venv, torch 2.11+cu130, xformers 0.0.35, trl 0.15.2, peft 0.19.1, bitsandbytes 0.49.2 — full stack imports clean.
export_gguf.py — converts the LoRA adapter + base into a single GGUF for llama.cpp / Ollama consumption.
peek.py + Modelfile.papi — final drop into the personal Ollama instance behind the Tailscale gateway, ready to serve through the same API the Vercel backend calls today.

Discipline

"Own AI" includes the data layer. Synthesizing training data via someone else's API still leaves a critical dependency upstream. This pipeline draws its synthetic corpus from a local model so the entire training-and-serving loop is self-hosted from day one.

The Hard Bugs

iPhone HEIC + Vercel's 4.5 MB body cap

iPhones photograph in HEIC. Vercel Functions cap multipart bodies at 4.5 MB. A modern iPhone HEIC plus form metadata = 5–8 MB easily. The mobile client now handles both the format flip (HEIC → JPEG re-encode) and a 1600 px resize before upload — and the error strings on the server differentiate which layer failed so the next bug doesn't take an hour to bisect.

Whisper Turbo's English-only failure mode

Initial voice transcription used whisper-large-v3-turbo. Speed was great; multilingual quality was abysmal — non-English voice notes came back as confident-sounding nonsense. Switched to whisper-large-v3 with explicit language hint + a domain-specific prompt bias. Real bug, real fix, permanent note.

Reanimated 4 + missing Worklets dep

Upgrading to Reanimated 4 silently requires react-native-worklets as a peer dep. Missing it surfaces as a misleading "unable to resolve the native app" in Expo Go that doesn't mention worklets. The fix is one line in package.json; the diagnostic is the cost.

What's Next

Scale the synthetic corpus to ≥500 examples, run the first real LoRA training pass, evaluate against held-out conversations.
Cut over the chat path from Groq Llama to local Qwen2.5-7B-LoRA via the Tailscale gateway. Keep Groq as a fallback while measuring quality + latency from real device.
Per-match memory upgrades — embed each conversation, retrieve relevant past context on every reply, evaluate offline before shipping to the app.
Streaming + screenshot reply polish on the mobile side, then decide whether the next push is App Store TestFlight (broader reach) or stays on EAS Update (faster iteration).

Next Case Study

RoofRoof.solutions →

Live roofing-lead marketplace — Stripe in production, Twilio toll-free, Supabase + RLS, Meta Pixel + CAPI dedup, restoreSiteDeploy incident write-up.