How to batch process all PDF files in a folder and subfolders using PDF to Any Converter and OCR to Any Converter?

I have a question that applies to both packages - it is only a single file that can be added at a time?

I'm looking for something that can process folders and subfolders of PDFs and export XMLs with coordinates of OCR layer text.  The HTML output does have coordinates, so would be close, but without batch processing the package doesn't offer the functionality I require. Have I missed something obvious and folders/subfolders can be added or there is a hot folder?

I tried previously OCRd PDFs in OCR to Any converter as it looks to have a hot folder capability - I see it has to re OCR the page rather than identify the previously OCRd text (and your software is an improvement on the previous OCR!), but there is no XML output.  The closest is html, but it isn't HTML, just text, with no HTML coding associated with it.

I look forward to hearing from you.

Customer
-----------------------

image

Thanks for your message, if you want to extract text and positions from OCRed text layer in a PDF file, you may download "PDF to HTML Converter Command Line" software from following web page to try,

https://www.verypdf.com/app/pdf-to-html-converter/try-and-buy.html#cmd

after you download and unzip it to a folder, you can run following command line to convert one PDF file to one HTML file with X/Y positions,

  pdf2html.exe C:\in.pdf C:\out.htm
   pdf2html.exe -r 150 C:\in.pdf C:\out.htm
   pdf2html.exe -imgformat 1 C:\in.pdf C:\out.htm
   pdf2html.exe -noimg C:\in.pdf C:\out.htm
   pdf2html.exe -onehtm C:\in.pdf C:\out.htm
   pdf2html.exe -oneword C:\in.pdf C:\out.htm
   pdf2html.exe -homeurl "http://www.verypdf.com" C:\in.pdf C:\out.htm
   pdf2html.exe -yoffset 20 C:\in.pdf C:\out.htm
   pdf2html.exe -notextinbody -notextinmeta C:\in.pdf C:\out.htm
  
You can use a command line loop to process all PDF files in a folder and its subfolders. Here's an example command using a for loop in Windows:

for /r "C:\input_folder" %%f in (*.pdf) do pdf2html.exe -onehtm "%%f" "C:\output_folder\%%~nf.htm"

This command will process all PDF files in the "C:\input_folder" directory and its subdirectories. For each PDF file found, it will execute the pdf2html.exe command with the appropriate input and output filenames.

Make sure to replace "C:\input_folder" with the path to your input folder containing the PDF files, and "C:\output_folder" with the path to your desired output folder for the HTML files.

If you're running this command directly from the command prompt instead of a batch file, you need to replace %%f with %f.

✅ "VeryPDF PDF to Any Converter Command Line" can be downloaded from following web page,

https://www.verypdf.com/app/pdf-to-any-converter/try-and-buy.html#buy-cmd

Here are PDF to Any Converter Command Lines to convert single PDF file at a time,

   pdf2any.exe -$ XXXXXXXXXXXXXXXX "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.html"
    pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.xml"
    pdf2any.exe -htmlformat "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.xml"
    pdf2any.exe -htmlmode 1 "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.xml"
    pdf2any.exe -f 1 -l 10 -htmlimgpos -htmlformat "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.xml"
    pdf2any.exe -htmlimgnotext -htmlimgpos -htmlformat "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.xml"
    pdf2any.exe -htmlimgtwo -htmlimgpos -htmlformat "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.xml"
    pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.docx"
    pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.doc"
    pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.xls"
    pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.pptx"
    pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.rtf"
    pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.txt"
    pdf2any.exe -layout "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.txt"
    pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.tif"
    pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.jpg"
    pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.png"
    pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.tga"
    pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.gif"
    pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.bmp"
    pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.ico"
    pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.ps"
    pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.eps"
    pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.pdf"
    pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.emf"

You can use "for" command to process all PDF files in a folder and subfolders, for example,

Batch process examples:
    for %F in (D:\temp\*.pdf) do pdf2any.exe "%F" "out_%~nF.doc"
    for %F in (D:\temp\*.pdf) do pdf2any.exe "%F" "C:\test\%~nF.txt"
    for %F in (D:\temp\*.pdf) do pdf2any.exe -skip "%F" "C:\test\%~nF.rtf"
    for %F in (D:\temp\*.pdf) do pdf2any.exe "%F" "out_%~nF.png"
    for /r D:\temp %F in (*.pdf) do pdf2any.exe "%F" "C:\test\%~nF.xls"
    for /r D:\temp %F in (*.pdf) do pdf2any.exe "%F" "%~dpnF.html"
    for /r D:\temp %F in (*.pdf) do pdf2any.exe "%F" "%~dpnF.ps"
    for /r D:\temp %F in (*.pdf) do pdf2any.exe "%F" "%~dpnF.eps"
    for /r D:\temp %F in (*.pdf) do pdf2any.exe "%F" "%~dpnF.tif"
    for /r D:\temp %F in (*.pdf) do pdf2any.exe "%F" "%~dpnF.jpg"

✅ "VeryPDF OCR to Any Converter Command Line" can be downloaded from following web page,

https://www.verypdf.com/app/ocr-to-any-converter-cmd/try-and-buy.html#buy

Here are some sample Command Line Examples in "VeryPDF OCR to Any Converter Command Line" software,

  ocr2any.exe C:\in.pdf C:\out.txt
   ocr2any.exe -firstpage 1 -lastpage 1 C:\in.pdf C:\out.txt
   ocr2any.exe -ocr -res 300 C:\in.pdf C:\out.txt
   ocr2any.exe -ownerpwd 123 -userpwd 456 C:\in.pdf C:\out.txt
   ocr2any.exe -layout C:\in.pdf C:\out.txt
   ocr2any.exe -layout2 C:\in.pdf C:\out.txt
   ocr2any.exe -table C:\in.pdf C:\out.txt
   ocr2any.exe -pdf2table C:\in.pdf C:\out.txt
   ocr2any.exe -noc C:\in.pdf C:\out.txt
   ocr2any.exe C:\in.tif C:\out.txt
   ocr2any.exe C:\in.jpg C:\out.txt
   ocr2any.exe C:\in.bmp C:\out.txt
   ocr2any.exe C:\in.png C:\out.txt
   ocr2any.exe -ocr -lang eng C:\in.pdf C:\out.txt
   ocr2any.exe -ocr -lang eng+kor C:\in.pdf C:\out.txt
   ocr2any.exe -ocr -lang eng+jpn C:\in.pdf C:\out.txt
   ocr2any.exe -ocr -bitcount 1 C:\in.pdf C:\out.txt
   ocr2any.exe -ocr -bitcount 8 C:\in.pdf C:\out.txt
   ocr2any.exe -ocr -bitcount 24 C:\in.pdf C:\out.txt
   ocr2any.exe -ocr -lang deu C:\in.pdf C:\out.txt
   ocr2any.exe -lang deu C:\in.tif C:\out.txt
   ocr2any.exe -text "PageText %PageNumber% of %PageCount%" C:\in.pdf C:\out.txt
   ocr2any.exe -subject "subject" C:\in.pdf C:\out.pdf
   ocr2any.exe -ownerpwdout 123 -keylen 2 -encryption 3900 C:\in.pdf C:\out.pdf
   ocr2any.exe -subject "subject" -title "title" C:\in.pdf C:\out.pdf
   ocr2any.exe -ocr -lang eng -ocrmode 0 C:\in.pdf C:\out.txt
   ocr2any.exe -ocr -lang deu -ocrmode 1 C:\in.pdf C:\out.pdf
   ocr2any.exe -ocr -lang eng -ocrmode 2 C:\in.pdf C:\out.pdf
   ocr2any.exe -ocr -lang eng -ocrmode 3 C:\in.pdf C:\out.pdf
   ocr2any.exe -ocr -lang eng -ocrmode 2 -outboxfile C:\in.pdf C:\out.pdf
   ocr2any.exe -ocr -lang fra -ocrmode 1 C:\in.pdf C:\out.pdf
   ocr2any.exe -ocr -lang ita -ocrmode 1 C:\in.pdf C:\out.pdf
   ocr2any.exe -ocr -lang nld -ocrmode 1 C:\in.pdf C:\out.pdf
   ocr2any.exe -ocr -lang spa -ocrmode 1 C:\in.pdf C:\out.pdf
   ocr2any.exe -bitcount 24 -ocrmode 4 -ocr C:\in.pdf C:\out.pdf
   ocr2any.exe -bitcount 8 -ocrmode 4 -ocr C:\in.pdf C:\out.pdf
   ocr2any.exe -ocrmode 4 -ocr C:\in.tif C:\out.pdf
   ocr2any.exe -ocrmode 3 -threshold 200 -ocr C:\in.tif C:\out.pdf
   ocr2any.exe -ocrmode 4 -rotate 90 -ocr C:\in.tif C:\out.pdf
   ocr2any.exe -ocr -lang jpn -ocrmode 4 -bitcount 24 -threshold 240 -res 200 C:\in.pdf C:\out.pdf
   ocr2any.exe -ocr -lang chi_sim -ocrmode 4 -threshold 240 -res 200 C:\in.pdf C:\out.pdf
   ocr2any.exe -ocr -lang chi_tra -ocrmode 4 -threshold 240 -res 200 C:\in.pdf C:\out.pdf
   ocr2any.exe -ocr -lang chi_sim+eng -ocrmode 4 -threshold 240 -res 200 C:\in.pdf C:\out.pdf
   ocr2any.exe -ocr -lang chi_sim+deu -ocrmode 4 -threshold 240 -res 200 C:\in.pdf C:\out.pdf
   ocr2any.exe -delblankpages D:\test.pdf D:\out.pdf
   ocr2any.exe -delblankpages -linewidth 8 D:\test.pdf D:\out.pdf
   ocr2any.exe -delblankpages -specklesize 20 D:\test.pdf D:\out.pdf

Use Enhanced OCR options:
   ocr2any.exe -ocr2 -ocr2aor C:\in.tif C:\out.rtf
   ocr2any.exe -ocr2 -ocr2aor C:\in.tif C:\out.doc
   ocr2any.exe -ocr2 -ocr2aor C:\in.tif C:\out.xls
   ocr2any.exe -ocr2 -ocr2aor C:\in.pdf C:\out.rtf
   ocr2any.exe -ocr2 -ocr2aor C:\in.pdf C:\out.doc
   ocr2any.exe -ocr2 -ocr2excelmode 0 C:\in.pdf C:\out.xls
   ocr2any.exe -ocr2 -ocr2excelmode 1 C:\in.pdf C:\out.xls
   ocr2any.exe -ocr2 -ocr2excelmode 2 C:\in.pdf C:\out.xls
   ocr2any.exe -ocr2 C:\in.pdf C:\out.doc
   ocr2any.exe -ocr2 C:\in.pdf C:\out.rtf
   ocr2any.exe -ocr2 C:\in.png C:\out.xls
   ocr2any.exe -ocr2 C:\in.tif C:\out.csv
   ocr2any.exe -ocr2 C:\in.bmp C:\out.txt
   ocr2any.exe -ocr2 C:\in.gif C:\out.htm
   ocr2any.exe -ocr2 C:\in.pdf C:\out.html
   ocr2any.exe -ocr2 D:\temp\*.pdf D:\temp\*.html
   ocr2any.exe -ocr2 D:\temp\*.pdf D:\temp\*.doc
   ocr2any.exe -ocr2 C:\in.pdf C:\out.rtf
   ocr2any.exe -ocr2 -lang deu C:\in.pdf C:\out.doc
   ocr2any.exe -ocr2 -lang deu C:\in.pdf C:\out.xls
   ocr2any.exe -ocr2 -dumpcharpos C:\in.pdf C:\out.txt
   ocr2any.exe -ocr2 -dumpwordpos C:\in.pdf C:\out.txt
   ocr2any.exe -ocr2 -dumpcharpos C:\in.pdf C:\out.rtf
   ocr2any.exe -ocr2 -dumpwordpos C:\in.pdf C:\out.rtf
   ocr2any.exe -ocr2 C:\in.pdf C:\text.pdf
   ocr2any.exe -ocr2 C:\in.tif C:\out.pdf
   ocr2any.exe -ocr2 C:\in.png C:\out.pdf
   ocr2any.exe -ocr2 C:\in.jpg C:\out.pdf
   ocr2any.exe -ocr2 C:\in.tif C:\out.doc
   ocr2any.exe -ocr2 C:\in.tif C:\out.rtf
   ocr2any.exe -ocr2 C:\in.tif C:\out.txt
   ocr2any.exe -ocr2 C:\in.tif C:\out.xls
   ocr2any.exe -ocr2 -ocr2autorotate C:\in.tif C:\out.pdf
   ocr2any.exe -ocr2 -ocr2autorotate C:\in.tif C:\out.doc
   ocr2any.exe -ocr2 -outputformat 1 C:\in.tif C:\out.rtf
   ocr2any.exe -ocr2 -outputformat 2 C:\in.tif C:\out.txt
   ocr2any.exe -ocr2 -outputformat 3 C:\in.tif C:\out.txt
   ocr2any.exe -ocr2 -outputformat 6 C:\in.tif C:\out.txt
   ocr2any.exe -ocr2 -outputformat 7 C:\in.tif C:\out.xls
   ocr2any.exe -ocr2 -outputformat 8 C:\in.tif C:\out.txt
   ocr2any.exe -ocr2 -outputformat 9 C:\in.tif C:\out.doc
   ocr2any.exe -ocr2 -outputformat 13 C:\in.tif C:\out.txt
   ocr2any.exe -ocr2 -outputformat 14 C:\in.tif C:\out.txt
   ocr2any.exe -ocr2 -outputformat 15 C:\in.tif C:\out.html
   ocr2any.exe -ocr2 -dumpcharpos -dumpwordpos -outputformat 8888 C:\in.tif C:\out.pdf
   ocr2any.exe -ocr2 -dumpcharpos -dumpwordpos -outputformat 8889 C:\in.tif C:\out.txt
   ocr2any.exe -ocr2 -dumpcharpos -dumpwordpos -outputformat 8890 C:\in.tif C:\out.html
   ocr2any.exe -ocr2 -dumpcharpos -dumpwordpos -outputformat 8891 C:\in.tif C:\out.csv

Process image files with Deskew, Despeckle and Noise Removal, Black Border Remova options:
   ocr2any.exe -imageopt C:\in.tif C:\out.tif
   ocr2any.exe -imageopt -rotate 45 C:\in.png C:\out.tif
   ocr2any.exe -imageopt -rotate 90 C:\in.png C:\out.tif
   ocr2any.exe -imageopt -threshold 0 C:\in.tif C:\out.bmp
   ocr2any.exe -threshold 240 C:\in.tif C:\out.bmp
   ocr2any.exe -dither 0 C:\in.bmp C:\out.png
   ocr2any.exe -dither 7 C:\in.bmp C:\out.png
   ocr2any.exe -imageopt -resizewidth 800 -resizeheight 600 C:\in.gif C:\out.tga
   ocr2any.exe -imageopt -flip C:\in.png C:\out.gif
   ocr2any.exe -imageopt -mirror C:\in.tif C:\out.pcx
   ocr2any.exe -imageopt C:\in.bmp C:\out.tif

You can use "for" command to process all PDF files in a folder and subfolders, for example,

Following command line will OCR all PDF files in D:\temp\ folder to text files:
   for %F in (D:\temp\*.pdf) do ocr2any.exe -ocr -lang deu "%F" "%~dpnF.txt"

Following command line will OCR all PDF files in D:\temp\ folder and subdirectories to text files:
   for /r D:\temp %F in (*.pdf) do ocr2any.exe -ocr "%F" "%~dpnF.txt"

Following command line will OCR all PDF files from D:\temp\ folder and output text files to C:\test folder:
   for %F in (D:\temp\*.pdf) do ocr2any.exe -ocr "%F" "C:\test\%~nF.txt"

Following command lines will use Enhanced OCR options:
   for %F in (D:\temp\*.pdf) do ocr2any.exe -ocr2 -lang deu "%F" "%~dpnF.txt"
   for %F in (D:\temp\*.pdf) do ocr2any.exe -ocr2 -lang eng "%F" "%~dpnF.doc"
   for %F in (D:\temp\*.tif) do ocr2any.exe -ocr2 "%F" "%~dpnF.doc"
   for %F in (D:\temp\*.tif) do ocr2any.exe -ocr2 -ocr2autorotate "%F" "%~dpnF.xls"
   for /r D:\temp %F in (*.pdf) do ocr2any.exe -ocr2 "%F" "%~dpnF.rtf"
   for %F in (D:\temp\*.pdf) do ocr2any.exe -ocr2 "%F" "C:\test\%~nF.html""
   ocr2any.exe -ocr2 D:\temp\*.tif D:\temp\*.html
   ocr2any.exe -ocr2 -ocr2excelmode 0 D:\temp\*.pdf D:\temp\*.xls
   ocr2any.exe -ocr2 D:\temp\*.png D:\temp\*.rtf
   ocr2any.exe -ocr2 D:\temp\*.tif D:\temp\*.csv
   ocr2any.exe -ocr2 D:\temp\*.pdf D:\temp\*.doc

We recommend that you download the command line software from our website to try it. The command line version will provide you with more and more flexible control methods.

VeryDPF

VN:F [1.9.20_1166]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.20_1166]
Rating: 0 (from 0 votes)

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *


Verify Code   If you cannot see the CheckCode image,please refresh the page again!