DOCX to MDX Converter
Overview
The DOCX Converter is a Python script that automatically converts Microsoft Word documents (using the standard RMC Word Document Template) into MDX format ready for the RMC Software Documentation site. This tool saves significant time when you have existing documentation in Word format that needs to be added to the website.
Use the DOCX converter when:
- You have an existing RMC document in Word format
- The document follows the standard RMC Word Document Template
- You want to quickly create MDX files without manual conversion
This script only works with documents that follow the standard RMC Word Document Template.
If your Word document doesn't use the RMC template, the converter won't work correctly. The script relies on specific styles and formatting conventions from the template to identify figures, tables, equations, and document structure.
What the Converter Does
The DOCX Converter automatically handles:
✅ Document Structure
- Extracts headings and creates separate MDX files for each chapter
- Maintains proper heading hierarchy
- Preserves formatting (bold, italic, lists)
✅ Figures
- Extracts all images from the Word document
- Saves them as PNG files in the correct folder
- Converts figure captions to MDX format
- Creates
<Figure>components with automatic numbering
✅ Tables
- Extracts all tables
- Converts to
<TableVertical>or<TableHorizontal>components - Preserves table formatting and structure
- Maintains table captions
If a table contains merged cells, the conversion may not be seamless. The script adds a danger admonition to alert contributors that they should manually verify the proper table structure and make corrections as needed.
⚠️ Equations
- Detects mathematical equations in the Word document
- Cannot automatically convert equations to LaTeX format
- Inserts admonitions where equations are detected:
- For standalone equations: admonition placed at the equation's location
- For inline equations: admonition placed above the paragraph containing them
- Contributors must manually add equations using the
<Equation>component
⚠️ Citations
- Identifies in-text citations
- Creates
<Citation>components - Links to bibliography file
The script sometimes removes author names and years that directly precede citations. For example:
- Source text: "Foster and Fell (2024)"
- May convert to:
<Citation citationKey="FosterFell2024" /> - Should be:
Foster and Fell (2024) <Citation citationKey="FosterFell2024" />
Contributors should verify that citation text is properly preserved during conversion.
✅ Document Metadata
- Extracts title, authors, date, abstract
- Creates front matter for MDX files
Result: The converter significantly reduces conversion time by automatically handling document structure, figures, tables, and basic formatting. While the output requires review and some manual refinement, it saves hours of work compared to manual conversion.
Prerequisites
Before using the DOCX converter, ensure you have:
1. Python Installation
Version Required: Python 3.8 or higher
Check if you have Python:
python --version
or
python3 --version
If you don't have Python:
- Download from python.org
- During installation, check "Add Python to PATH"
2. Word Document Using RMC Template
Your Word document must:
- Follow the modern RMC Word Document Template (e.g., Blue, Red, Green, or Yellow themes)
- Use correct style names (RMC_Figure, RMC_Table, etc.)
- Have properly formatted citations and references
3. Bibliography File Created
Before running the converter:
- Create a
bib.jsonfile with all document references - Place it in the appropriate location in
static/bibliographies/ - Format according to Project Structure
Why this is required:
The converter needs the bibliography file to properly link in-text citations. If citations exist in the Word document but not in bib.json, the conversion will have errors.
Setup Instructions
Step 1: Navigate to Converter Folder
Open Terminal in VS Code (Ctrl + Shift + `) and navigate to the converter:
cd docx_converter
Step 2: Create Virtual Environment (Recommended)
A virtual environment keeps the converter's Python packages separate from your system Python.
Virtual environments prevent conflicts between different Python projects. The converter requires specific package versions that might differ from other Python tools you use.
Important: Create the virtual environment ONLY in the docx_converter/ folder, NOT in the root of the project.
Create the virtual environment:
Use one of the following commands to create the virtual environment (which one depends on your Python installation):
python -m venv venv
or
python3 -m venv venv
or
py -m venv venv
Activate the virtual environment:
# Windows
venv\Scripts\activate
# Mac/Linux
source venv/bin/activate
When activated successfully:
You'll see (venv) appear at the beginning of your terminal prompt:
(venv) C:\GitHub\RMC-Software-Documentation\docx_converter>
Step 3: Install Dependencies
With the virtual environment activated, install required packages:
pip install -r requirements.txt
This installs:
python-docx- For reading Word documents- Other dependencies needed for conversion
Installation time: ~30 seconds to 1 minute
Conversion Workflow
Overview of the Process
1. Prepare Word Document & Bibliography
↓
2. Configure Converter Settings (main.py)
↓
3. Run in Development Mode (test)
↓
4. Review Generated Files
↓
5. Assess Output & Note Issues
↓
6. Run in Production Mode (final)
↓
7. Make Manual Refinements
↓
8. Test and Commit
Detailed Conversion Steps
Step 1: Prepare Your Files
A. Place Word Document
Place your .docx file in the static/source-documents/ folder following the same folder structure used for docs, figures, and bibliographies.
Example structure:
static/source-documents/desktop-applications/your-software/users-guide/v1.0/your-document.docx
This organizational structure keeps all related files (source documents, MDX files, figures, and bibliographies) aligned across the project.
B. Create Bibliography File
Before conversion, create bib.json with all references:
Location example:
static/bibliographies/desktop-applications/your-software/users-guide/v1.0/bib.json
See Project Structure for bibliography format and examples.
Step 2: Configure the Converter
Open docx_converter/main.py in VS Code and configure these settings:
A. Set Environment Mode
# Line ~20
ENVIRONMENT = "development" # Start with development to test
Environment modes:
-
development- Test mode- Safe to experiment
- Outputs to temporary location
- Won't overwrite existing docs
- Always start here!
-
production- Final mode- Outputs directly to docs folder
- Will overwrite existing files
- Only use after testing in development
B. Configure Paths
Find and set these variables in main.py:
Figure Path (FIGSRC):
# Path used in <Figure> component src attributes
# Figure filenames will be appended (e.g., "figure-1.png")
FIGSRC = "figures/desktop-applications/your-software/users-guide/v1.0"
This path appears in the generated MDX files and must match where figures are stored in static/. Always use forward slashes /.
Navigation Component Settings:
# NAVLINK: URL destination for the back arrow navigation
NAVLINK = "/desktop-applications/your-software"
# NAVTITLE: Display text shown in the navigation link
NAVTITLE = "User's Guide"
# NAVDOC: Document identifier for version selector
# Must match the key in versionList.json
NAVDOC = "desktop-applications/your-software/users-guide"
These configure the NavContainer component at the top of each page, providing navigation back to parent pages and version selection.
Development Environment Paths:
# DOCX_PATH: Location of source Word document
DOCX_PATH = r"C:\path\to\your-document.docx"
# BIB_PATH: Location of bibliography file
BIB_PATH = r"C:\GitHub\RMC-Software-Documentation\static\bibliographies\desktop-applications\your-software\users-guide\v1.0\bib.json"
# FIGURES_DIR: Temporary output directory for extracted figures (testing)
FIGURES_DIR = r"C:\temp\conversion-test\figures"
# MDX_DIR: Temporary output directory for generated MDX files (testing)
MDX_DIR = r"C:\temp\conversion-test\mdx"
Production Environment Paths:
# DOCX_PATH: Location of source Word document (typically same as development)
DOCX_PATH = r"C:\path\to\your-document.docx"
# BIB_PATH: Location of bibliography file (typically same as development)
BIB_PATH = r"C:\GitHub\RMC-Software-Documentation\static\bibliographies\desktop-applications\your-software\users-guide\v1.0\bib.json"
# FIGURES_DIR: Final output directory for figures
FIGURES_DIR = r"C:\GitHub\RMC-Software-Documentation\static\figures\desktop-applications\your-software\users-guide\v1.0"
# MDX_DIR: Final output directory for MDX files
MDX_DIR = r"C:\GitHub\RMC-Software-Documentation\docs\desktop-applications\your-software\users-guide\v1.0"
Path Configuration Tips:
- DOCX_PATH and BIB_PATH are typically the same for both environments
- FIGURES_DIR and MDX_DIR differ: development uses temporary locations, production uses final project locations
- Use absolute paths (full paths starting from drive letter on Windows)
- Forward slashes
/work on all platforms and are recommended - Backslashes
\work on Windows but should be in raw strings (prefix withr)
Incorrect paths in production mode can overwrite existing documentation!
Always:
- Start with
ENVIRONMENT = "development" - Test thoroughly
- Then switch to
ENVIRONMENT = "production" - Double-check production paths before running
Step 3: Run Development Mode (Test)
A. Ensure Settings Are Correct
✓ ENVIRONMENT = "development"
✓ All paths point to correct locations
✓ Bibliography file exists and is complete
✓ Virtual environment is activated (you see (venv) in terminal)
B. Run the Converter
From the docx_converter/ folder:
python main.py
or
python3 main.py
or
py main.py
C. Respond to Prompts
The script will ask:
Prompt 1:
Are the input and output variables set correctly? (Y/N):
- Type
Yif paths are correct - Type
Nto exit and fix paths
Prompt 2:
Regenerate figures in the Docusaurus project directory? (Y/N):
- Type
Yto extract all images from Word (first run) - Type
Nif images already exist and you don't want to overwrite
D. Monitor Progress
You'll see output like:
Processing DOCX file...
Extracting figures... (15 found)
Extracting tables... (8 found)
Processing equations... (12 found)
Parsing citations... (23 found)
Writing MDX files...
- 00-document-info.mdx
- 00-version-history.mdx
- 01-preface.mdx
- 02-introduction.mdx
...
Conversion complete!
Conversion time: 10 seconds to 2 minutes depending on document size
Step 4: Review Generated Files
A. Check MDX Files
Navigate to your output directory (specified in MDX_DIR).
Review each file for:
Basic Structure:
✓ Headings are correct
✓ Text formatting preserved (bold, italic, lists)
✓ Figure components look right
Tables (pay special attention):
✓ Table components render properly
✓ Check for danger admonitions marking tables with merged cells
✓ Verify table structure is correct, especially for complex tables
Equations (require manual work):
✓ Look for equation admonitions placed by the converter
- Standalone equations: admonition at equation location
- Inline equations: admonition above paragraph
✓ Plan to manually add LaTeX equations using <Equation> component
✓ Note which sections have equations needing conversion
Citations (verify carefully):
✓ Citation components are present
✓ Author names and years are preserved (not removed)
- Should be: "Foster and Fell (2024)
<Citation citationKey="FosterFell2024" />" - Not: "
<Citation citationKey="FosterFell2024" />" (missing text)
✓ Citation keys match entries in bib.json
✓ Citations link correctly to bibliography
B. Check Extracted Images
Navigate to your figures directory (specified in FIGURES_DIR).
Verify:
✓ All images extracted
✓ Images are clear (not blurry or pixelated)
✓ Filenames are reasonable
✓ File sizes are appropriate (< 500KB each ideally)
C. Test Locally
With the dev server running (npm start from project root):
- Navigate to your converted document in the browser
- Check all pages render correctly
- Verify figure numbering works
- Test cross-references
- Check citations link to bibliography
Common issues to look for:
- Missing images (check image paths)
- Broken references (check unique keys)
- Formatting problems (may need manual fixes)
- Tables with merged cells flagged by danger admonitions
- Equation admonitions requiring manual LaTeX conversion
- Missing citation text (author names/years removed by converter)
Step 5: Assess Development Output
After reviewing the development output, assess whether the conversion worked well enough to proceed to production.
Determine If You're Ready for Production
The conversion is ready for production mode if:
✓ Document structure is generally correct (headings, paragraphs)
✓ Figures extracted successfully
✓ Tables converted (even if flagged with admonitions)
✓ Citations are present (even if missing some text)
✓ No critical errors or missing content
You may need to adjust the Word document and re-run development mode if:
✗ Major structural problems (chapters missing, wrong order)
✗ Most figures didn't extract
✗ Converter crashed or produced errors
✗ Bibliography file missing or incorrect
Note What Needs Manual Fixing Later
Don't make manual edits to the development output files! Instead, make a note of issues to fix after the production run:
Equations (will always need work):
- Count how many equation admonitions were created
- Note which sections have the most equations
- Plan time for LaTeX conversion after production run
Tables with merged cells:
- Identify which tables have danger admonitions
- Reference page numbers in original Word document
- Plan to fix these tables after production run
Citations:
- Scan a few citations to see if author names/years are missing
- If it's a pattern, expect to fix all citations after production run
- Verify bib.json has all needed entries
Other issues:
- Note any major formatting problems
- Identify any missing or incorrect content
- Document any unexpected converter behavior
Development mode is for testing and assessment only. Once you confirm the conversion generally works, proceed to production mode. All manual refinements happen after the production run, not during development testing.
Step 6: Run Production Mode (Final)
Once you're satisfied with the development output:
A. Update Settings
In main.py:
# Change from development to production
ENVIRONMENT = "production"
B. Verify Production Paths
Critical: Double-check these paths!
# Ensure these point to final locations
FIGURES_DIR = "../static/figures/desktop-applications/your-software/users-guide/v1.0"
MDX_DIR = "../docs/desktop-applications/your-software/users-guide/v1.0"
Verify the folders exist:
- Create
docs/folder structure if needed - Create
static/figures/folder structure if needed
C. Backup Existing Files (If Applicable)
If converting a new version and old files exist:
# Create backup of existing documentation
cp -r docs/path/to/old-version docs/path/to/old-version-backup
D. Run Production Conversion
python main.py
Respond to prompts:
- Confirm variables are correct:
Y - Regenerate figures:
Y(for final run)
E. Final Review
After production conversion:
-
Verify files are in correct location
- Check
docs/folder - Check
static/figures/folder
- Check
-
Test with dev server
# From project root
npm start -
Check all functionality
- Navigation works
- All pages load
- Images display
- Cross-references work
- Citations link correctly
Step 7: Make Manual Refinements
Now that production files are in place, make the manual adjustments you identified in Step 5.
Required Manual Fixes
Equations (always required):
✓ Locate all equation admonitions placed by the converter
✓ Open original Word document for equation reference
✓ Convert Word equations to LaTeX format
✓ Replace admonitions with <Equation> components
✓ Test equation rendering in browser
✓ See React Components for equation examples
Tables with merged cells:
✓ Find danger admonitions marking problematic tables
✓ Open original Word document for table reference
✓ Manually verify and fix table structure in MDX
✓ Adjust <TableVertical> or <TableHorizontal> components
✓ Remove danger admonitions after fixing
✓ Test table rendering in browser
Citations (verify all):
✓ Check every citation for missing author names/years
✓ Add back any text removed by converter
✓ Format should be: "Author (Year) <Citation citationKey="..." />"
✓ Verify citation keys match bib.json entries
✓ Test that citations link correctly to bibliography
Optional Refinements
Formatting adjustments:
✓ Adjust line breaks and spacing
✓ Add missing bold/italic formatting
✓ Fine-tune list formatting
Component refinements:
✓ Set figure widths appropriately
✓ Adjust table column widths
✓ Fine-tune caption text
✓ Update component props as needed
Content corrections:
✓ Correct special characters
✓ Adjust formatting for code blocks
✓ Update outdated information
Step 8: Test and Commit
A. Test with Development Server
After completing all manual refinements, test your converted documentation thoroughly:
npm start
This starts the development server at http://localhost:3000. Verify:
✓ All pages load correctly
✓ Navigation works properly
✓ Images display correctly
✓ Equations render properly
✓ Tables are formatted correctly
✓ Citations link to bibliography
✓ Cross-references work
✓ No console errors appear
Press Ctrl+C to stop the development server when testing is complete.
Contributors do NOT need to run npm run build before committing. The development server (npm start) is sufficient for testing. Site administrators will handle building and deploying the site to production.
If you encounter any issues while testing locally, contact the repository administrator rather than attempting to troubleshoot build processes.
B. Commit and Push Changes
- Using GitHub Desktop
- Using Git (Command Line)
Commit Changes:
- Open GitHub Desktop
- Review the changed files in the left sidebar
- Ensure all relevant files are checked:
- New MDX files in
docs/ - Extracted figures in
static/figures/ - Source Word document in
static/source-documents/ - Bibliography file in
static/bibliographies/
- New MDX files in
- Write a commit message (e.g., "Add converted documentation for [Software Name] v1.0")
- Click Commit to main
Push to Repository:
- Click Push origin at the top of the window
Once pushed, site administrators will review, build, and deploy your changes to the live site.
Commit Changes:
git add docs/your-new-files/
git add static/figures/your-new-images/
git add static/source-documents/your-document.docx
git add static/bibliographies/your-path/bib.json
git commit -m "Add converted documentation for [Software Name] v1.0"
Push to Repository:
git push origin mainOnce pushed, site administrators will review, build, and deploy your changes to the live site.
Understanding the Converter's Output
What You'll Get
After conversion, you'll have:
docs/your-software/users-guide/v1.0/
├── 00-document-info.mdx # Metadata and document info
├── 00-version-history.mdx # Version history table
├── 01-preface.mdx # Preface chapter
├── 02-introduction.mdx # Introduction chapter
├── 03-methodology.mdx # Methodology chapter
└── ... # Additional chapters
static/figures/your-software/users-guide/v1.0/
├── figure-1.png # Extracted figure images
├── figure-2.png
├── figure-3.png
└── ...
MDX File Structure
Each generated MDX file will include:
Front matter:
---
title: Chapter Title
---
Component imports:
import Figure from '@site/src/components/Figure';
import TableVertical from '@site/src/components/TableVertical';
import Equation from '@site/src/components/Equation';
import Citation from '@site/src/components/Citation';
Content with components:
# Chapter Title
Regular paragraph text with **bold** and _italic_ formatting.
<Figure figKey="fig-1" src="/figures/your-software/users-guide/v1.0/figure-1.png" alt="Description" caption="Figure caption text" />
More content with <Citation citationKey="Smith2020" /> references.