How to Convert Docx to HTML

My #awesome twitter followers implored that I post this weekend. So here goes!

For my current side project I needed to allow a user to upload a docx file and then continue editing it from their WordPress dashboard.

I chose docx because it is an open standard based on XML. I figured that might make things somewhat easier.

A simple Word document

Unzip Docx file

Many modern formats these days are compressed directories containing XML files. Because XML files tend to get bloated, as we will see, text compression is important but also very effective.

Docx looks like this when unzipped.

For our purposes the important file is /word/document.xml. That’s were the textual content of the document is stored.

When unpacking this is what you can expect. I’ve formatted it just a bit…


<?xml version="1.0" encoding="UTF-8"?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" ...
MISC ATTRIBUTES HERE... GENERAL WASTE OF SPACE>
   <w:body>
      <w:p w14:paraId="69CF3FA0" w14:textId="19D656BA" w:rsidR="008609A1" w:rsidRDefault="008609A1" w:rsidP="00C93000">
         <w:pPr>
            <w:pStyle w:val="Heading1" />
         </w:pPr>
         <w:bookmarkStart w:id="0" w:name="_GoBack" />
         <w:r>
            <w:t>Hello World</w:t>
         </w:r>
      </w:p>
      <w:bookmarkEnd w:id="0" />
      <w:p w14:paraId="462EF7B7" w14:textId="33FA0674" w:rsidR="004C1088" w:rsidRDefault="004C1088">
         <w:r>
            <w:t xml:space="preserve">This is a </w:t>
         </w:r>
         <w:r w:rsidRPr="004C1088">
            <w:rPr>
               <w:b />
            </w:rPr>
            <w:t xml:space="preserve">very short </w:t>
         </w:r>
         <w:r>
            <w:t xml:space="preserve">paragraph. It only contains </w:t>
         </w:r>
         <w:r w:rsidRPr="004C1088">
            <w:rPr>
               <w:i />
            </w:rPr>
            <w:t xml:space="preserve">three </w:t>
         </w:r>
         <w:r>
            <w:t xml:space="preserve">sentenses. This is the </w:t>
         </w:r>
         <w:r w:rsidRPr="004C1088">
            <w:rPr>
               <w:u w:val="single" />
            </w:rPr>
            <w:t>third sentence</w:t>
         </w:r>
         <w:r>
            <w:t>.</w:t>
         </w:r>
      </w:p>
      <w:sectPr w:rsidR="004C1088" w:rsidSect="001F0D6D">
         <w:pgSz w:w="12240" w:h="15840" />
         <w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="720" w:footer="720" w:gutter="0" />
         <w:cols w:space="720" />
         <w:docGrid w:linePitch="360" />
      </w:sectPr>
   </w:body>
</w:document>

As you can see it is well formed XML; I wouldn’t expect anything less from Microsoft. With a little trial and error, i.e., adding bold and italics randomly and seeing what changed in the XML I was able to figure out how it’s formatted.

  • Paragraphs are enclosed in w:p tags.
  • Groups of words with formatting are wrapped with w:r.
  • The text itself is wrapped with w:t.
  • Headings are set with a w:pStyle w:val=”Heading?”
    • The ? should be replaced with a numeral representing the heading level [1-6].
  • A word group containing a self-closing w:b tag is bold.
  • A word group containing a self-closing w:i tag is italic.
  • A word group containing a self-closing w:u tag is underlined.
    • The type of underline is defined with w:val. =”single” will be a single underline.

There’s a whole lot more here including Typeface, font size, etc. For my purposes I wanted to keep basic formatting, but only basic formatting. So this code will not take typeface or font size into account. But it should be a good start for you if you’d like to do that.

I didn’t start out trying to re-invent the wheel but couldn’t find a good solution.

DOCX to HTML Free already is out there. But it couldn’t handle files of any significant size. There were existing php classes that could, but these solutions were costly… So I decided to donate some of my time to the community.

Solution #1

The initial idea was to loop through all w:r tags and enclose them in the formatting tags that they contain. That did work; it created semantic HTML that included all formatting in the document. The problem was that when docx files get bigger the XML gets messier. Not that it’s not well-formed. But I noticed that there were entire sentences broken up to single words and wrapping each word in a bold tag. I found several instances with single spaces wrapped in multiple formatting tags. Basically, everything I hated in WYSIWYG HTML generators.

The quick and dirty solution #2.

For solution #2 I keep track where text first gets assigned a formatting tag, mark that it’s open, and only close it when there isn’t a that formatting tag anymore.

This solution almost works. It successfully creates non-semantic HTML in the rare cases where formatting tags overlap each other.

In a sentence like this one where part of the sentence is bold overlapping part of it that is italic.

You end up with code like this:


    <strong>In a sentence like this one where part of the sentence is <em>bold overlapping</strong> part of it that is italic.</em>

BIG NO NO

I call this the quick and dirty solution because I didn’t go the extra mile. People usually don’t format their text like the example above. It’s mostly academic. So I kept that solution. But in order to make sure my code is semantic I ran it through simplexml_import_dom()->asXML to fix cases of non-sematic html. The only issue is that it truncates the formatting when semantics break down. Since this will be a rare case I’ll ignore it for now.

At some point I’ll revisit this solution and work out the logic so that the code is well-formed, not bulked up, semantic AND can take rare formatting cases into account.

And now… For the Code:


    24 thoughts on “How to Convert Docx to HTML

      1. Hmmm may be a server settings. I had to mangle some things to get the zip functionality working on the server I was using. You can try replacing the zip library I’m using with something php native. Also, you may need to fiddle with the permissions on the wp-content dir as I use that as a tmp folder.

    1. At some point I should clean this up a bit. I thought it may be useful so others don’t have to work from scratch… but I really was developing it for a very specific use.

      I’m happy you found a solution.

    2. Solution was pretty simple CKEditor has option “Paste from Word” and it’s doing job pretty well. It does all the formatting, even adds tables… I know it’s off topic here. :)
      But it’s a good solution in some cases :)
      Cheers!

    3. I got this code pretty much working except that $goodHTML always ends up being pretty much blank except for .
      $text ends up as simply . I know my input .docx file contains data as I output the raw xml before processing and it looks ok. Any ideas as to why all the text is getting stripped out?

      1. Hey Phil,

        Hmmm, I’d test that by dumping the $text at the end. See what it comes up with.

        If that comes up blank, troubleshoot the XML reader. It may be that my understanding of the structure of docx XML is lacking a key aspect of the file you’re trying to unpack.

        Please comment if you find out what was causing the issue!

    4. This is really neat! One question. I am getting this error “parser error : Extra content at the end of the document” at the point which the reader actually reads the .docx file.

      Any ideas?

      Thanks in advance!

    5. I thought I should give you a heads up. I used your code as inspiration for a basic Docx2ePub converter. The code is still a mess, I only spent a little time hashing it together, and I’m not really a PHP guru either.
      I extended your code to include a slightly better handling of semantics of formatting, adding images and tables as well.
      (In Docx, what happens inside a w:r, stays there, making the conversion of the formatting a little easier.)
      https://github.com/Grandt/PHPDocxToEPub

      1. Thanks A!

        I’m glad you found it useful. That’s why I put it out there :)

        I’ve planning on revisiting that, my php skills have improved since, and I’d like to take another crack at it. I’ll keep you posted.

    6. Hi Jack, and thanks for this code here. It seems I need to write a docx2TXT and HTML converter, this is helpful, but not quite perfect… ;)
      I ran into a French word file. Guess what the headings are called? Titre[1-6] :D LOL
      (these style names are connected together in styles.xml)

    7. Hi there, thanks for the code. I was wondering how do I target a test.docx on the root to read in the document.xml? What is the $targetDir in this case?

      Thanks

      1. docx files are compressed folders. The document.xml is under the “word” folder when unzipped. What I do is set up a tmp folder, unzip the file, read the document.xml then delete the unzipped folder.

        1. Hi there, thanks for the response so quickly. I figured out that that’s probably what you did, so I modified some code to create a temp.xml for each file in a directory so I could loop through files and add them to a database or search them, etc. Basically I read in the filename, open the zip file, then save a temp.xml file for reading purposes. Your code then follows there after.

          $filename = “tmp/” . $filename;
          if(!$filename || !file_exists($filename)) return false;

          $zip = zip_open($filename);

          if (!$zip || is_numeric($zip)) return false;

          while ($zip_entry = zip_read($zip)) {

          if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

          if (zip_entry_name($zip_entry) != “word/document.xml”) continue;

          $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

          zip_entry_close($zip_entry);
          }// end while

          zip_close($zip);

          file_put_contents(‘temp.xml’, $content);

          $xmlFile = “temp.xml”;

          /////YOUR CODE STARTS HERE
          $reader = new XMLReader;
          $reader->open($xmlFile);

          The only other question I had was that I have accents in my word documents that are getting converted to ? characters instead of html entities. Do you know how to get an accented character to come out as HTML when the goodHTML is printed out? Example of special characters are: Aarón Sánjovani

    Leave a Reply