Projects

HandClaw

One Slack = Multiple AI Coding Agents

Connect Claude Code, Codex, and OpenCode to Slack. Code anywhere from anything. Work from your phone. Monitor agent progress. Switch agents by renaming channels.

Key Features

One workspace = Multiple agents
Code from phone, even from your watch
Walk away, let agents work
Supports persistent plan/build mode switching
Notifies users when coding tasks are completed

HandClaw vs OpenClaw

Feature	HandClaw	OpenClaw
Switch plan/build mode	✓ (`!code switch plan/build`)	✗
Early stop code CLI	✓ (`!stop`)	✗
Project management via channels	✓ (just rename channel)	✗ (need install acpx + complex config)
Support ACP	✓ Easy (rename channel)	✗ Complex

TeraXLang

Triton Extension for LLM. As fast as FlashAttention

A CUDA kernel-specific DSL built on top of Triton that achieves SOTA GPU kernel performance on both Hopper (H100) and Blackwell (B200) architectures.

Why TeraXLang?

What optimizations has Triton done?
Why do many DSLs claim they can easily outperform Triton?
What if we add a few more APIs that might harm Triton's generality, but bring superior performance in exchange?

Key Features

Minimal Extensions: Adds only essential methods to Triton (smem, tmem, mbar, TMA operations)
Warp-level Primitives: Efficient warpgroup synchronization and reduction
TMA Support: Hardware-accelerated tensor memory operations
Multi-Architecture: Optimized for both Hopper and Blackwell GPUs

Performance

Matmul (H100 80GB HBM3)

M=8192, N=8192, K=1024

Kernel	TFLOPS
cuBLAS	710.4
TXL (hopper_txl_ws_persistent)	697.7 (~2% slower)

Flash Attention (H100 80GB HBM3)

batch=16, heads=32, seq_len=16384, head_dim=128

Kernel	TFLOPS
FlashAttention3	640
TXL (hopper_txl_ws_fa3)	676.26 (~6% faster)

MLA Decoding (H100 80GB HBM3)

Kernel	Time (ms)	TFLOPS
HuggingFace MLA	2.03	592
TXL MLA	2.22	754

NSA Prefill (H100 80GB HBM3)

Kernel	Time (us)	TFLOPS
FlashNSA	235	2248.4
TXL NSA	219	266.4