Large language and vision-language models have greatly advanced automated chest X-ray report generation (RRG),. yet current evaluation practices remain largely text-based and detached from image evidence. Traditional machine translation metrics fail to determine whether generated findings are clinically correct or visually grounded, limiting their suitability for medical applications. This study introduces a comprehensive, image-aware evaluation framework that integrates the VICCA (Visual Interpretation and Comprehension of Chest X-ray Anomalies) protocol with the domain-specific semantic metric MCSE (Medical Corpus Similarity Evaluation). VICCA combines visual grounding and text-guided image generation to assess visual-textual consistency, while MCSE measures semantic and factual fidelity through clinically meaningful entities, negations, and modifiers. Together, they provide a unified, semi-reference-free assessment of pathology-level accuracy, semantic coherence, and visual consistency. Five representative RRG models, R2Gen, M2Trans, CXR-RePaiR, RGRG, and MedGemma, are benchmarked on 2461 MIMIC-CXR studies using a standardized pipeline. Results reveal systematic trade-offs, models with high pathology agreement often generate semantically weak or visually inconsistent reports, whereas textually fluent models may lack proper image grounding. By integrating clinical semantics and visual reliability within a single multimodal framework, VICCA establishes a robust paradigm for evaluating the trustworthiness and interpretability of AI-generated radiology reports.