Architecture Transformer MoE Linear Attention FFN Components Tokenizer Norm RoPE KV Cache Model Family BERT GPT DeepSeek Infra Quantization PyTorch FlashAttention Training Epoch vs Batch Emergence Optimizer Misc Agent