Recent advancements in Augmented Reality (AR) and Vision-Language Modelss (VLMs) have significantly improved interactive user experiences, particularly in training and educational environments. However, when it comes to precise tasks such as manual assembly instruction, these technologies still face inherent limitations in understanding fine-grained details. Specifically, existing VLMs often struggle to accurately compre- hend complex scenes and precisely position objects, limiting their full potential in AR-based training environments. To address this challenge, we introduce a novel dataset specifically developed for AR training tasks, sourced from a diverse collection of multimodal data, including LEGO instruction manuals. This dataset serves as a foundation for a series of vision-language tasks, simulating real-life AR training scenarios such as scene understanding, object detection, and state detection. These tasks are designed to push the boundaries of cur- rent VLMs, offering a rigorous benchmark for evaluating their capabilities in handling fine-grained assembly instructions within AR environments. Our findings demonstrate that even leading VLMs struggle with the challenges posed by our dataset. For instance, GPT-4V, a state-of-the-art commercial model, achieves an F1-score of only 40.54% on state detection tasks, underscoring the need for continued research and dataset development. These results reveal critical gaps in current models’ ability to handle detailed vision-language tasks, suggesting the importance of creating more robust datasets and benchmarks to guide future advancements. Ultimately, this work lays the foundation for future research in integrating VLMs into AR environments, highlighting areas where improvements are necessary and proposing new strategies for overcoming these limitations. By pushing the current boundaries of multimodal learning systems, this research opens the door to more effective and intelligent AR training assistants, driving progress in industrial assembly and beyond.