Skip to main content

Docx Converter

Overview

The docx_converter Python script is designed to automate the conversion of Microsoft Word (.docx) documents using the standard RMC Word Document Template into modular MDX files, suitable for use with Docusaurus or other React-based documentation systems. It extracts and processes figures, tables, equations, citations, and document structure, outputting clean, component-rich MDX files and associated assets. This script allows for efficient conversion of existing Word documents into the RMC Software Documentation site format.

warning

This script will not work if the DOCX file does not follow the standard RMC Word Document Template.
Ensure your document is formatted correctly before running the script.

File Structure

Below is a typical file structure for the docx_to_mdx script, with a brief description of each file’s purpose.

docx_to_mdx/

├── main.py # Entry point; orchestrates the DOCX to MDX conversion process.
├── utils/ # Folder housing all utility and helper files for main.py
├── __init__.py # Marks the directory as a Python package
├── constants.py # Contains style mappings and global constants for parsing DOCX.
├── helpers.py # Utility functions for formatting, file operations, JSX/MDX generation, and user prompts.
├── figures.py # Handles extraction of figures and their captions from DOCX, saves images, and manages figure references.
├── tables.py # Extracts tables from DOCX, formats them for MDX, and manages table references.
├── citations.py # Parses citations and bibliography entries, formats them for MDX.
├── equations.py # Extracts and formats equations and equation references for MDX.
├── docx_processor.py # Main DOCX parsing logic; splits document into sections, identifies elements, and coordinates extraction.
├── mdx_writer.py # Writes the processed content (sections, figures, tables, etc.) to MDX files.
├── venv/ # Virtual environment folder
├── README.md # Documentation for the script, usage instructions, and requirements.
├── requirements.txt # Dependencies requires to run the script

File Descriptions

  • main.py
    The main script to run. Handles user prompts, sets up input/output paths, and calls the appropriate modules to process the DOCX and generate MDX files and assets.

  • constants.py
    Stores style names, mappings, and other constants used to identify and process elements in the DOCX file (e.g., which styles correspond to headings, figures, tables, etc.).

  • helpers.py
    Provides utility functions for formatting text, handling JSX/MDX output, file and directory operations, user confirmations, and other shared logic.

  • figures.py
    Contains logic for detecting figures in the DOCX, extracting and saving figure images, parsing and cleaning captions, and generating MDX-compatible figure components and references.

  • tables.py
    Handles extraction of tables from the DOCX, formatting them as Markdown or MDX components, and managing table references.

  • citations.py
    Parses in-text citations and bibliography entries, formats them for MDX, and links them to the bibliography if provided.

  • equations.py
    Extracts equations and equation references from the DOCX, formats them for MDX, and generates appropriate components.

  • docx_processor.py
    The core parser that reads the DOCX file, identifies document structure (sections, headings, paragraphs), and coordinates the extraction of figures, tables, equations, and citations.

  • mdx_writer.py
    Responsible for writing the processed content to MDX files, organizing sections, and ensuring correct references and imports for assets.

  • README.md
    Provides an overview, usage instructions, requirements, and other documentation for users of the script.


What It Does

  1. Configuration and Setup

    • The user specifies input and output paths for the DOCX file, bibliography, figures directory, and MDX output directory in main.py.
    • The script checks for required dependencies and prompts the user to confirm whether the input variables are set correctly and whether figures should be regenerated within the figures folder of the Docusaurus project.
  2. DOCX Parsing

    • The script loads the DOCX file using python-docx and parses the document structure, identifying sections, headings, paragraphs, figures, tables, equations, and citations based on style mappings defined in constants.py.
  3. Figure Extraction

    • Figures are detected by their style (e.g., "RMC_Figure").
    • Each figure image is extracted and saved as a PNG file in the specified figures directory.
    • Figure captions are parsed, cleaned, and associated with the correct image.
  4. Table Extraction

    • Tables are identified and converted into Markdown or MDX table components.
    • Table captions and references are handled and linked appropriately.
  5. Equation and Citation Handling

    • Inline equations and references are detected and formatted for MDX.
    • Citations are parsed and linked to the provided bibliography file.
  6. Section and Navigation Structure

    • The script builds a modular section structure, splitting the document into logical MDX files based on headings.
    • Navigation metadata can be generated for Docusaurus sidebars.
  7. MDX File Generation

    • Each section, figure, table, and equation is written to an MDX file in the output directory.
    • MDX files include custom React components for figures, tables, and references, ready for use in Docusaurus.
  8. Asset Management

    • All extracted images are saved to the figures directory.
    • Relative paths are set so MDX files can reference images correctly.

Output

tip

The script generates the necessary MDX files and figures to allow the Docusaurus project to build the document. When successful, the document will be in draft form and ready for review following minor edits.

Outputs include:

  • MDX Files:
    Modular .mdx files for each section or chapter of the original DOCX, placed in your specified output directory (e.g., docs/ or docs/generated/).
  • Figures:
    Extracted figure images as .png files, saved in the designated figures directory (e.g., static/img/figures/).

Usage

  1. Install Python

    • Ensure you have Python 3.8+ installed.
  2. Create a Virtual Environment (Optional but Recommended) and Install Dependencies

    • Create a virtual environment in the docx_converter/ folder by running one of the following commands (depending on how Python is configured on your system):

      python -m venv venv

      python3 -m venv venv

      py -m venv venv
    • Activate the virtual environment:

      venv\Scripts\activate
    • Install the required dependencies:

      pip install -r requirements.txt
  3. Set the Environment for the conversion

    • The script is set up to allow for practice conversions in a development environment to test the procedure and ensure that the script is acting as expected.

    • To set the environment, open main.py and set the ENVIRONMENT variable to either development or production.

    • development will allow you to test the script without affecting the main documentation.

    • production will run the script as intended for the final documentation.

    ENVIRONMENT = "development"  # Change to "production" for final runs
  4. Create the necessary bib.json file for the document you are converting

    • All references contained within the DOCX must be stored in the appropriate bib.json file prior to conversion
    • Additional information on the bib.json file is provided in the Bibliographies section of this guide
  5. Assign Input and Output Paths

    • Open main.py in the docx_converter/ folder and set the following variables:

      • FIGSRC: File path of figures relative to the Docusaurus project root. (e.g., static/img/figures/software/v1.0/figures)

      • NAVLINK, NAVTITLE, NAVDOC: Navigation settings for the NavContainer component

      • DOCX_PATH: Path to your input .docx file

      • BIB_PATH: Path to your bibliography JSON file

      • FIGURES_DIR: Output directory for extracted figures (e.g., ../static/img/figures)

      • MDX_DIR: Output directory for generated MDX files (e.g., ../docs/generated)

  6. Run the Script

    • Open a new terminal using `Ctrl + Shift + ` or clicking "Terminal -> New Terminal"

    • Navigate to the docx_converter folder in the Terminal cd docx_converter

    • Activate your virtual environment if it is not already activated:

      venv\Scripts\activate

      When activated, you will see (venv) in the terminal before your terminal location.

    • From the docx_converter folder, run one of the following commands (depending on how Python is configured on your system):

      python main.py

      python3 main.py

      py main.py
    • The script will prompt you to confirm the following:

      • Whether the input and output variables are set correctly (Y or N). If you choose N, the script will exit.
      • Whether to regenerate figures in the Docusaurus project directory (Y or N). If all figures from the DOCX are already present in the figures directory, you can choose not to regenerate them (N). If you choose to regenerate figures (Y), the script will overwrite existing images in the figures directory.
  7. Check Outputs

    • Find your generated .mdx files in the output directory you specified (e.g., docs/generated/).
    • Find extracted figure images in the figures directory (e.g., static/img/figures/).
    • Confirm the output files match your expectations (formatting, content, React components, etc.)

Once you have confirmed the outputs are correct within the development environment, repeat steps 3-5 with the environment variable set to production and to convert the DOCX into the Docusaurus project.

danger

Ensure the production file paths are correct. If set incorrectly, existing MDX docs within the Docusaurus project can be overridden.

  1. Integrate with Docusaurus

    • The generated MDX files and images are now ready for use in the Docusaurus site.
info

This Python script is a preprocessing tool. It is not run by Docusaurus at runtime, but before you build or serve your site. For troubleshooting, check the console output and review the generated files for formatting or extraction issues.