How to Extract XML from ZUGFERD PDF using PDFextract: A Comprehensive Guide

In this article, we will walk through how to use PDFextract, a command-line tool, to extract XML data from ZUGFERD invoices and address some common questions that users may have regarding its usage. Whether you are working with a few invoices or processing hundreds of them across multiple workstations, PDFextract can be a useful tool for automating ZUGFERD data extraction.

https://www.verypdf.com/app/pdf-extract-tool/index.html

https://www.verypdf.com/app/pdf-extract-tool/try-and-buy.html#buy

What is ZUGFERD?

ZUGFERD (Zentraler User Guide des Forums elektronische Rechnung Deutschland) is a standard for electronic invoices in Germany, primarily based on XML. It allows businesses to exchange invoices in a machine-readable format, with XML files embedded inside PDF documents. The XML data is often used for automated accounting and processing.

Step 1: Using PDFextract to Extract XML from a ZUGFERD PDF

The tool PDFextract allows you to extract various contents, including text, fonts, and XML data, from a PDF file. If you're dealing with ZUGFERD invoices, the tool can extract the embedded XML data, but it does not have a built-in option to extract only XML, as it pulls all available content from the document.

How to Run PDFextract from the Command Line

Here’s the command to extract content from a ZUGFERD PDF file:

pdfextract.exe -outfolder C:\Path\to\output_folder C:\Path\to\input_ZUGFERD_invoice.pdf

This will extract all contents, including the XML, text, fonts, and other embedded files from the PDF. However, if you only want to extract the XML data, you will need a custom version of PDFextract that supports this feature.

Custom Solution for XML-only Extraction

If extracting only the XML is critical for your workflow, the VeryPDF team offers to create a custom version of the tool that extracts just the XML data. Please feel free to contact VeryPDF Support Team if you are interested in this custom-built version which would allow you to extract XML data files and without additional files.

Step 2: Running PDFextract in Quiet Mode

To run PDFextract in quiet mode (without displaying the command-line window), you can append > nul to the command. This will suppress the standard output of the application, allowing it to run silently in the background.

Example Command for Quiet Mode

pdfextract.exe -outfolder D:\Downloads\1 D:\Downloads\EN16931_Einfach.pdf > nul

Step 3: Automating PDF Extraction Without a Console Window

If you prefer to automate the process of extracting XML from a ZUGFERD PDF without showing the console window, you can use C++ code to execute the command. The example code provided below demonstrates how to use the Windows API to run PDFextract in the background and capture its output.

C++ Code Example to Run PDFextract

#include <windows.h>
#include <iostream>
#include <string>

int main() {
    // Command to run
    const char* command = "pdfextract.exe -outfolder D:\\Downloads\\1 D:\\Downloads\\EN16931_Einfach.pdf";

    // Create necessary structures
    STARTUPINFO si;
    PROCESS_INFORMATION pi;
    SECURITY_ATTRIBUTES sa;

    // Zero out the structures
    ZeroMemory(&si, sizeof(si));
    ZeroMemory(&pi, sizeof(pi));
    ZeroMemory(&sa, sizeof(sa));

    // Set the SECURITY_ATTRIBUTES for pipe
    sa.nLength = sizeof(sa);
    sa.bInheritHandle = TRUE; // Allow handles to be inherited

    // Create a pipe for capturing output
    HANDLE hStdOutRead, hStdOutWrite;
    if (!CreatePipe(&hStdOutRead, &hStdOutWrite, &sa, 0)) {
        std::cerr << "CreatePipe failed with error: " << GetLastError() << std::endl;
        return 1;
    }

    // Set up the STARTUPINFO structure
    si.cb = sizeof(si);
    si.dwFlags = STARTF_USESTDHANDLES;
    si.hStdOutput = hStdOutWrite;  // Redirect standard output to pipe
    si.hStdError = hStdOutWrite;   // Redirect standard error to pipe

    // Create the process
    if (CreateProcess(
            NULL,                 // Application name (NULL uses command line)
            (LPSTR)command,       // Command line
            NULL,                 // Process security attributes
            NULL,                 // Thread security attributes
            TRUE,                 // Inherit handles
            0,                    // No creation flags
            NULL,                 // Environment variables
            NULL,                 // Current directory
            &si,                  // Startup information
            &pi                   // Process information
        ) == 0) {
        std::cerr << "CreateProcess failed with error: " << GetLastError() << std::endl;
        return 1;
    }

    // Close the write end of the pipe, we only need to read from it
    CloseHandle(hStdOutWrite);

    // Read the output from the pipe
    DWORD dwRead;
    CHAR chBuf[4096];
    std::string output;

    while (true) {
        if (!ReadFile(hStdOutRead, chBuf, sizeof(chBuf) - 1, &dwRead, NULL) || dwRead == 0)
            break;
        chBuf[dwRead] = '\0';  // Null-terminate the output
        output.append(chBuf);  // Append the output to the string
    }

    // Print the captured output
    std::cout << "Captured Output:\n" << output << std::endl;

    // Wait for the process to finish
    WaitForSingleObject(pi.hProcess, INFINITE);

    // Clean up
    CloseHandle(pi.hProcess);
    CloseHandle(pi.hThread);
    CloseHandle(hStdOutRead);

    return 0;
}

This code runs PDFextract silently, capturing any output it generates and waiting for the process to complete.

Step 4: Licensing for Multiple Workstations

For organizations needing to process ZUGFERD invoices across multiple workstations, there are two main licensing options available:

Server License: This option costs USD 299.95 per server. This license allows you to install PDFextract on one server. You will need to buy a separate server license for each server that you intend to run the software on.
Developer License: This option costs USD 1499.95 per developer and grants more flexibility, especially if you are developing software that integrates with PDFextract.

For processing on 1-5 workstations or servers, the Server License is the most cost-effective choice. However, if you require a Developer License for integrating PDFextract into custom workflows or plan to run PDFextract on more than 5 servers, the Developer License may be more suitable.

Conclusion

PDFextract is a powerful tool for extracting XML data from ZUGFERD PDF invoices, and with the right configuration, you can automate the process without displaying a console window. Although the tool does not natively support extracting only XML data, a custom version can be developed to meet this need. If you're processing invoices on multiple workstations, choosing the appropriate license can help ensure you get the best value for your organization.

For more details on purchasing and licensing, visit VeryPDF's Buy Page.