AI Workflow Generator - Automating Assembly Instructions

During the ETH Exploration Lab, I developed an AI-powered tool at Bossard AG that automatically converts PDF assembly instructions into digital workflows for smart manufacturing stations.

The Problem

Bossard’s Smart Stations help factory workers by displaying step-by-step assembly instructions digitally. However, most companies already have assembly instructions as PDFs. Converting these PDFs into the digital format manually is extremely time-consuming.

Bossard Smart Station showing assembly instructions

Smart Station displaying digital assembly instructions

My Solution

I built an automated conversion pipeline that takes a PDF and outputs structured digital workflows ready for the Smart Stations. The system:

  1. Extracts text instructions using AI multimodal models that understand document layout
  2. Detects and extracts images from PDFs using computer vision models like YOLO
  3. Matches images to instructions so each step shows the correct picture

The key insight was that context awareness is essential. Understanding how instructions and images relate both spatially and semantically was crucial for accurate matching, and even instruction extraction. A unified multimodal model approach with spatial awareness is way better than trying to combine outputs from separate models, although less flexible.

What I Learned

Real-world complexity: PDFs from different companies vary wildly in quality and structure. Building something that works on clean test data is one thing, making it robust for real customer documents is much harder.

User feedback is crucial: I built a web application and tested it with actual Bossard customers. Their feedback showed both the potential and limitations, helping prioritize what to improve.

Video mode prototype: I also prototyped a feature where assembly experts can record themselves while explaining the process verbally. The AI extracts instructions from their speech and captures key frames, useful for documenting knowledge that doesn’t exist in written form. A feature with huge potential for many customers.

Technologies Used

  • Python for the entire pipeline
  • GPT-4o/Whisper API for instruction extraction
  • Computer vision models (YOLO, Docling, Mistral) for image detection and document layout understanding
  • Web application for customer testing and feedback

Impact

The tool successfully automates a previously manual, time-intensive process. While it doesn’t work perfectly on all PDFs (quality and structure matter), it demonstrates that AI can significantly reduce the barrier to adopting smart manufacturing technology.

This project was my individual research component of the Exploration Lab and became a stepping stone in understanding how to apply AI to real industrial challenges.