Docx Converter
Overview
The docx_converter
Python script is designed to automate the conversion of Microsoft Word (.docx
) documents using the standard RMC Word Document Template into modular
MDX files, suitable for use with Docusaurus or other React-based documentation systems. It extracts and processes figures, tables, equations, citations, and document
structure, outputting clean, component-rich MDX files and associated assets. This script allows for efficient conversion of existing Word documents into the RMC Software
Documentation site format.
This script will not work if the DOCX file does not follow the standard RMC Word Document Template.
Ensure your document is formatted correctly before running the script.
File Structure
Below is a typical file structure for the docx_to_mdx
script, with a brief description of each file’s purpose.
docx_to_mdx/
│
├── main.py # Entry point; orchestrates the DOCX to MDX conversion process.
├── utils/ # Folder housing all utility and helper files for main.py
├── __init__.py # Marks the directory as a Python package
├── constants.py # Contains style mappings and global constants for parsing DOCX.
├── helpers.py # Utility functions for formatting, file operations, JSX/MDX generation, and user prompts.
├── figures.py # Handles extraction of figures and their captions from DOCX, saves images, and manages figure references.
├── tables.py # Extracts tables from DOCX, formats them for MDX, and manages table references.
├── citations.py # Parses citations and bibliography entries, formats them for MDX.
├── equations.py # Extracts and formats equations and equation references for MDX.
├── docx_processor.py # Main DOCX parsing logic; splits document into sections, identifies elements, and coordinates extraction.
├── mdx_writer.py # Writes the processed content (sections, figures, tables, etc.) to MDX files.
├── venv/ # Virtual environment folder
├── README.md # Documentation for the script, usage instructions, and requirements.
├── requirements.txt # Dependencies requires to run the script
File Descriptions
-
main.py
The main script to run. Handles user prompts, sets up input/output paths, and calls the appropriate modules to process the DOCX and generate MDX files and assets. -
constants.py
Stores style names, mappings, and other constants used to identify and process elements in the DOCX file (e.g., which styles correspond to headings, figures, tables, etc.). -
helpers.py
Provides utility functions for formatting text, handling JSX/MDX output, file and directory operations, user confirmations, and other shared logic. -
figures.py
Contains logic for detecting figures in the DOCX, extracting and saving figure images, parsing and cleaning captions, and generating MDX-compatible figure components and references. -
tables.py
Handles extraction of tables from the DOCX, formatting them as Markdown or MDX components, and managing table references. -
citations.py
Parses in-text citations and bibliography entries, formats them for MDX, and links them to the bibliography if provided. -
equations.py
Extracts equations and equation references from the DOCX, formats them for MDX, and generates appropriate components. -
docx_processor.py
The core parser that reads the DOCX file, identifies document structure (sections, headings, paragraphs), and coordinates the extraction of figures, tables, equations, and citations. -
mdx_writer.py
Responsible for writing the processed content to MDX files, organizing sections, and ensuring correct references and imports for assets. -
README.md
Provides an overview, usage instructions, requirements, and other documentation for users of the script.
What It Does
-
Configuration and Setup
- The user specifies input and output paths for the DOCX file, bibliography, figures directory, and MDX output directory in
main.py
. - The script checks for required dependencies and prompts the user to confirm whether the input variables are set correctly and whether figures should be regenerated within the figures folder of the Docusaurus project.
- The user specifies input and output paths for the DOCX file, bibliography, figures directory, and MDX output directory in
-
DOCX Parsing
- The script loads the DOCX file using
python-docx
and parses the document structure, identifying sections, headings, paragraphs, figures, tables, equations, and citations based on style mappings defined inconstants.py
.
- The script loads the DOCX file using
-
Figure Extraction
- Figures are detected by their style (e.g., "RMC_Figure").
- Each figure image is extracted and saved as a PNG file in the specified figures directory.
- Figure captions are parsed, cleaned, and associated with the correct image.
-
Table Extraction
- Tables are identified and converted into Markdown or MDX table components.
- Table captions and references are handled and linked appropriately.
-
Equation and Citation Handling
- Inline equations and references are detected and formatted for MDX.
- Citations are parsed and linked to the provided bibliography file.
-
Section and Navigation Structure
- The script builds a modular section structure, splitting the document into logical MDX files based on headings.
- Navigation metadata can be generated for Docusaurus sidebars.
-
MDX File Generation
- Each section, figure, table, and equation is written to an MDX file in the output directory.
- MDX files include custom React components for figures, tables, and references, ready for use in Docusaurus.
-
Asset Management
- All extracted images are saved to the figures directory.
- Relative paths are set so MDX files can reference images correctly.
Output
The script generates the necessary MDX files and figures to allow the Docusaurus project to build the document. When successful, the document will be in draft form and ready for review following minor edits.
Outputs include:
- MDX Files:
Modular.mdx
files for each section or chapter of the original DOCX, placed in your specified output directory (e.g.,docs/
ordocs/generated/
). - Figures:
Extracted figure images as.png
files, saved in the designated figures directory (e.g.,static/img/figures/
).
Usage
-
Install Python
- Ensure you have Python 3.8+ installed.
-
Create a Virtual Environment (Optional but Recommended) and Install Dependencies
-
Create a virtual environment in the
docx_converter/
folder by running one of the following commands (depending on how Python is configured on your system):python -m venv venv
python3 -m venv venv
py -m venv venv -
Activate the virtual environment:
venv\Scripts\activate
-
Install the required dependencies:
pip install -r requirements.txt
-
-
Set the Environment for the conversion
-
The script is set up to allow for practice conversions in a development environment to test the procedure and ensure that the script is acting as expected.
-
To set the environment, open
main.py
and set theENVIRONMENT
variable to eitherdevelopment
orproduction
. -
development
will allow you to test the script without affecting the main documentation. -
production
will run the script as intended for the final documentation.
ENVIRONMENT = "development" # Change to "production" for final runs
-
-
Create the necessary
bib.json
file for the document you are converting- All references contained within the DOCX must be stored in the appropriate
bib.json
file prior to conversion - Additional information on the
bib.json
file is provided in the Bibliographies section of this guide
- All references contained within the DOCX must be stored in the appropriate
-
Assign Input and Output Paths
-
Open
main.py
in thedocx_converter/
folder and set the following variables:-
FIGSRC
: File path of figures relative to the Docusaurus project root. (e.g.,static/img/figures/software/v1.0/figures
) -
NAVLINK
,NAVTITLE
,NAVDOC
: Navigation settings for the NavContainer component -
DOCX_PATH
: Path to your input.docx
file -
BIB_PATH
: Path to your bibliography JSON file -
FIGURES_DIR
: Output directory for extracted figures (e.g.,../static/img/figures
) -
MDX_DIR
: Output directory for generated MDX files (e.g.,../docs/generated
)
-
-
-
Run the Script
-
Open a new terminal using
`Ctrl + Shift + `
or clicking "Terminal -> New Terminal" -
Navigate to the docx_converter folder in the Terminal
cd docx_converter
-
Activate your virtual environment if it is not already activated:
venv\Scripts\activate
When activated, you will see
(venv)
in the terminal before your terminal location. -
From the
docx_converter
folder, run one of the following commands (depending on how Python is configured on your system):python main.py
python3 main.py
py main.py -
The script will prompt you to confirm the following:
- Whether the input and output variables are set correctly (Y or N). If you choose N, the script will exit.
- Whether to regenerate figures in the Docusaurus project directory (Y or N). If all figures from the DOCX are already present in the figures directory, you can choose not to regenerate them (N). If you choose to regenerate figures (Y), the script will overwrite existing images in the figures directory.
-
-
Check Outputs
- Find your generated
.mdx
files in the output directory you specified (e.g.,docs/generated/
). - Find extracted figure images in the figures directory (e.g.,
static/img/figures/
). - Confirm the output files match your expectations (formatting, content, React components, etc.)
- Find your generated
Once you have confirmed the outputs are correct within the development
environment, repeat steps 3-5 with the environment
variable
set to production
and to convert the DOCX into the Docusaurus project.
Ensure the production
file paths are correct. If set incorrectly, existing MDX docs within the Docusaurus project can be overridden.
-
Integrate with Docusaurus
- The generated MDX files and images are now ready for use in the Docusaurus site.
This Python script is a preprocessing tool. It is not run by Docusaurus at runtime, but before you build or serve your site. For troubleshooting, check the console output and review the generated files for formatting or extraction issues.