As artificial intelligence (AI) becomes increasingly central to healthcare, the demand for explainable and trustworthy models is paramount. Current report generation systems for chest X-rays (CXR) often lack mechanisms for validating outputs without expert oversight, raising concerns about reliability and interpretability. To address these challenges, we propose a novel multimodal framework designed to enhance the semantic alignment between text and image context and the localization accuracy of pathologies within images and reports for AI-generated medical reports. Our framework integrates two key modules, a Phrase Grounding Model, which identifies and localizes pathologies in CXR images based on textual prompts, and a Text-to-Image Diffusion Module, which generates synthetic CXR images from prompts while preserving anatomical fidelity. By comparing features between the original and generated images, we introduce a dual-scoring system, one score quantifies localization accuracy, while the other evaluates semantic consistency between text and image features. Our approach significantly outperforms existing methods in pathology localization, achieving an 8% improvement in Intersection over Union score. It also surpasses state-of-the-art methods in CXR text-to-image generation, with a 1% gain in similarity metrics. Additionally, the integration of phrase grounding with diffusion models, coupled with the dual-scoring evaluation system, provides a robust mechanism for validating report quality, paving the way for more reliable and transparent AI in medical imaging.