Transformer-based models have advanced 3D scene reconstruction, but their quadratic attention limits scalability to large scenes. We introduce the Local View Transformer (LVT), which replaces global attention with locality-aware attention over neighboring views, conditioned on relative camera geometry. LVT decodes directly into 3D Gaussian splats with view-dependent color and opacity for high-fidelity rendering. Our approach enables scalable, single-pass reconstruction of large, high-resolution scenes.