I have a question that applies to both packages - it is only a single file that can be added at a time?
I'm looking for something that can process folders and subfolders of PDFs and export XMLs with coordinates of OCR layer text. The HTML output does have coordinates, so would be close, but without batch processing the package doesn't offer the functionality I require. Have I missed something obvious and folders/subfolders can be added or there is a hot folder?
I tried previously OCRd PDFs in OCR to Any converter as it looks to have a hot folder capability - I see it has to re OCR the page rather than identify the previously OCRd text (and your software is an improvement on the previous OCR!), but there is no XML output. The closest is html, but it isn't HTML, just text, with no HTML coding associated with it.
I look forward to hearing from you.
Customer
-----------------------
Thanks for your message, if you want to extract text and positions from OCRed text layer in a PDF file, you may download "PDF to HTML Converter Command Line" software from following web page to try,
https://www.verypdf.com/app/pdf-to-html-converter/try-and-buy.html#cmd
after you download and unzip it to a folder, you can run following command line to convert one PDF file to one HTML file with X/Y positions,
pdf2html.exe C:\in.pdf C:\out.htm
pdf2html.exe -r 150 C:\in.pdf C:\out.htm
pdf2html.exe -imgformat 1 C:\in.pdf C:\out.htm
pdf2html.exe -noimg C:\in.pdf C:\out.htm
pdf2html.exe -onehtm C:\in.pdf C:\out.htm
pdf2html.exe -oneword C:\in.pdf C:\out.htm
pdf2html.exe -homeurl "http://www.verypdf.com" C:\in.pdf C:\out.htm
pdf2html.exe -yoffset 20 C:\in.pdf C:\out.htm
pdf2html.exe -notextinbody -notextinmeta C:\in.pdf C:\out.htm
You can use a command line loop to process all PDF files in a folder and its subfolders. Here's an example command using a for loop in Windows:
for /r "C:\input_folder" %%f in (*.pdf) do pdf2html.exe -onehtm "%%f" "C:\output_folder\%%~nf.htm"
This command will process all PDF files in the "C:\input_folder" directory and its subdirectories. For each PDF file found, it will execute the pdf2html.exe command with the appropriate input and output filenames.
Make sure to replace "C:\input_folder" with the path to your input folder containing the PDF files, and "C:\output_folder" with the path to your desired output folder for the HTML files.
If you're running this command directly from the command prompt instead of a batch file, you need to replace %%f with %f.
✅ "VeryPDF PDF to Any Converter Command Line" can be downloaded from following web page,
https://www.verypdf.com/app/pdf-to-any-converter/try-and-buy.html#buy-cmd
Here are PDF to Any Converter Command Lines to convert single PDF file at a time,
pdf2any.exe -$ XXXXXXXXXXXXXXXX "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.html"
pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.xml"
pdf2any.exe -htmlformat "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.xml"
pdf2any.exe -htmlmode 1 "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.xml"
pdf2any.exe -f 1 -l 10 -htmlimgpos -htmlformat "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.xml"
pdf2any.exe -htmlimgnotext -htmlimgpos -htmlformat "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.xml"
pdf2any.exe -htmlimgtwo -htmlimgpos -htmlformat "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.xml"
pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.docx"
pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.doc"
pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.xls"
pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.pptx"
pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.rtf"
pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.txt"
pdf2any.exe -layout "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.txt"
pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.tif"
pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.jpg"
pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.png"
pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.tga"
pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.gif"
pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.bmp"
pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.ico"
pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.ps"
pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.eps"
pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.pdf"
pdf2any.exe "D:\VeryPDF\in.pdf" "D:\VeryPDF\out.emf"
You can use "for" command to process all PDF files in a folder and subfolders, for example,
Batch process examples:
for %F in (D:\temp\*.pdf) do pdf2any.exe "%F" "out_%~nF.doc"
for %F in (D:\temp\*.pdf) do pdf2any.exe "%F" "C:\test\%~nF.txt"
for %F in (D:\temp\*.pdf) do pdf2any.exe -skip "%F" "C:\test\%~nF.rtf"
for %F in (D:\temp\*.pdf) do pdf2any.exe "%F" "out_%~nF.png"
for /r D:\temp %F in (*.pdf) do pdf2any.exe "%F" "C:\test\%~nF.xls"
for /r D:\temp %F in (*.pdf) do pdf2any.exe "%F" "%~dpnF.html"
for /r D:\temp %F in (*.pdf) do pdf2any.exe "%F" "%~dpnF.ps"
for /r D:\temp %F in (*.pdf) do pdf2any.exe "%F" "%~dpnF.eps"
for /r D:\temp %F in (*.pdf) do pdf2any.exe "%F" "%~dpnF.tif"
for /r D:\temp %F in (*.pdf) do pdf2any.exe "%F" "%~dpnF.jpg"
✅ "VeryPDF OCR to Any Converter Command Line" can be downloaded from following web page,
https://www.verypdf.com/app/ocr-to-any-converter-cmd/try-and-buy.html#buy
Here are some sample Command Line Examples in "VeryPDF OCR to Any Converter Command Line" software,
ocr2any.exe C:\in.pdf C:\out.txt
ocr2any.exe -firstpage 1 -lastpage 1 C:\in.pdf C:\out.txt
ocr2any.exe -ocr -res 300 C:\in.pdf C:\out.txt
ocr2any.exe -ownerpwd 123 -userpwd 456 C:\in.pdf C:\out.txt
ocr2any.exe -layout C:\in.pdf C:\out.txt
ocr2any.exe -layout2 C:\in.pdf C:\out.txt
ocr2any.exe -table C:\in.pdf C:\out.txt
ocr2any.exe -pdf2table C:\in.pdf C:\out.txt
ocr2any.exe -noc C:\in.pdf C:\out.txt
ocr2any.exe C:\in.tif C:\out.txt
ocr2any.exe C:\in.jpg C:\out.txt
ocr2any.exe C:\in.bmp C:\out.txt
ocr2any.exe C:\in.png C:\out.txt
ocr2any.exe -ocr -lang eng C:\in.pdf C:\out.txt
ocr2any.exe -ocr -lang eng+kor C:\in.pdf C:\out.txt
ocr2any.exe -ocr -lang eng+jpn C:\in.pdf C:\out.txt
ocr2any.exe -ocr -bitcount 1 C:\in.pdf C:\out.txt
ocr2any.exe -ocr -bitcount 8 C:\in.pdf C:\out.txt
ocr2any.exe -ocr -bitcount 24 C:\in.pdf C:\out.txt
ocr2any.exe -ocr -lang deu C:\in.pdf C:\out.txt
ocr2any.exe -lang deu C:\in.tif C:\out.txt
ocr2any.exe -text "PageText %PageNumber% of %PageCount%" C:\in.pdf C:\out.txt
ocr2any.exe -subject "subject" C:\in.pdf C:\out.pdf
ocr2any.exe -ownerpwdout 123 -keylen 2 -encryption 3900 C:\in.pdf C:\out.pdf
ocr2any.exe -subject "subject" -title "title" C:\in.pdf C:\out.pdf
ocr2any.exe -ocr -lang eng -ocrmode 0 C:\in.pdf C:\out.txt
ocr2any.exe -ocr -lang deu -ocrmode 1 C:\in.pdf C:\out.pdf
ocr2any.exe -ocr -lang eng -ocrmode 2 C:\in.pdf C:\out.pdf
ocr2any.exe -ocr -lang eng -ocrmode 3 C:\in.pdf C:\out.pdf
ocr2any.exe -ocr -lang eng -ocrmode 2 -outboxfile C:\in.pdf C:\out.pdf
ocr2any.exe -ocr -lang fra -ocrmode 1 C:\in.pdf C:\out.pdf
ocr2any.exe -ocr -lang ita -ocrmode 1 C:\in.pdf C:\out.pdf
ocr2any.exe -ocr -lang nld -ocrmode 1 C:\in.pdf C:\out.pdf
ocr2any.exe -ocr -lang spa -ocrmode 1 C:\in.pdf C:\out.pdf
ocr2any.exe -bitcount 24 -ocrmode 4 -ocr C:\in.pdf C:\out.pdf
ocr2any.exe -bitcount 8 -ocrmode 4 -ocr C:\in.pdf C:\out.pdf
ocr2any.exe -ocrmode 4 -ocr C:\in.tif C:\out.pdf
ocr2any.exe -ocrmode 3 -threshold 200 -ocr C:\in.tif C:\out.pdf
ocr2any.exe -ocrmode 4 -rotate 90 -ocr C:\in.tif C:\out.pdf
ocr2any.exe -ocr -lang jpn -ocrmode 4 -bitcount 24 -threshold 240 -res 200 C:\in.pdf C:\out.pdf
ocr2any.exe -ocr -lang chi_sim -ocrmode 4 -threshold 240 -res 200 C:\in.pdf C:\out.pdf
ocr2any.exe -ocr -lang chi_tra -ocrmode 4 -threshold 240 -res 200 C:\in.pdf C:\out.pdf
ocr2any.exe -ocr -lang chi_sim+eng -ocrmode 4 -threshold 240 -res 200 C:\in.pdf C:\out.pdf
ocr2any.exe -ocr -lang chi_sim+deu -ocrmode 4 -threshold 240 -res 200 C:\in.pdf C:\out.pdf
ocr2any.exe -delblankpages D:\test.pdf D:\out.pdf
ocr2any.exe -delblankpages -linewidth 8 D:\test.pdf D:\out.pdf
ocr2any.exe -delblankpages -specklesize 20 D:\test.pdf D:\out.pdf
Use Enhanced OCR options:
ocr2any.exe -ocr2 -ocr2aor C:\in.tif C:\out.rtf
ocr2any.exe -ocr2 -ocr2aor C:\in.tif C:\out.doc
ocr2any.exe -ocr2 -ocr2aor C:\in.tif C:\out.xls
ocr2any.exe -ocr2 -ocr2aor C:\in.pdf C:\out.rtf
ocr2any.exe -ocr2 -ocr2aor C:\in.pdf C:\out.doc
ocr2any.exe -ocr2 -ocr2excelmode 0 C:\in.pdf C:\out.xls
ocr2any.exe -ocr2 -ocr2excelmode 1 C:\in.pdf C:\out.xls
ocr2any.exe -ocr2 -ocr2excelmode 2 C:\in.pdf C:\out.xls
ocr2any.exe -ocr2 C:\in.pdf C:\out.doc
ocr2any.exe -ocr2 C:\in.pdf C:\out.rtf
ocr2any.exe -ocr2 C:\in.png C:\out.xls
ocr2any.exe -ocr2 C:\in.tif C:\out.csv
ocr2any.exe -ocr2 C:\in.bmp C:\out.txt
ocr2any.exe -ocr2 C:\in.gif C:\out.htm
ocr2any.exe -ocr2 C:\in.pdf C:\out.html
ocr2any.exe -ocr2 D:\temp\*.pdf D:\temp\*.html
ocr2any.exe -ocr2 D:\temp\*.pdf D:\temp\*.doc
ocr2any.exe -ocr2 C:\in.pdf C:\out.rtf
ocr2any.exe -ocr2 -lang deu C:\in.pdf C:\out.doc
ocr2any.exe -ocr2 -lang deu C:\in.pdf C:\out.xls
ocr2any.exe -ocr2 -dumpcharpos C:\in.pdf C:\out.txt
ocr2any.exe -ocr2 -dumpwordpos C:\in.pdf C:\out.txt
ocr2any.exe -ocr2 -dumpcharpos C:\in.pdf C:\out.rtf
ocr2any.exe -ocr2 -dumpwordpos C:\in.pdf C:\out.rtf
ocr2any.exe -ocr2 C:\in.pdf C:\text.pdf
ocr2any.exe -ocr2 C:\in.tif C:\out.pdf
ocr2any.exe -ocr2 C:\in.png C:\out.pdf
ocr2any.exe -ocr2 C:\in.jpg C:\out.pdf
ocr2any.exe -ocr2 C:\in.tif C:\out.doc
ocr2any.exe -ocr2 C:\in.tif C:\out.rtf
ocr2any.exe -ocr2 C:\in.tif C:\out.txt
ocr2any.exe -ocr2 C:\in.tif C:\out.xls
ocr2any.exe -ocr2 -ocr2autorotate C:\in.tif C:\out.pdf
ocr2any.exe -ocr2 -ocr2autorotate C:\in.tif C:\out.doc
ocr2any.exe -ocr2 -outputformat 1 C:\in.tif C:\out.rtf
ocr2any.exe -ocr2 -outputformat 2 C:\in.tif C:\out.txt
ocr2any.exe -ocr2 -outputformat 3 C:\in.tif C:\out.txt
ocr2any.exe -ocr2 -outputformat 6 C:\in.tif C:\out.txt
ocr2any.exe -ocr2 -outputformat 7 C:\in.tif C:\out.xls
ocr2any.exe -ocr2 -outputformat 8 C:\in.tif C:\out.txt
ocr2any.exe -ocr2 -outputformat 9 C:\in.tif C:\out.doc
ocr2any.exe -ocr2 -outputformat 13 C:\in.tif C:\out.txt
ocr2any.exe -ocr2 -outputformat 14 C:\in.tif C:\out.txt
ocr2any.exe -ocr2 -outputformat 15 C:\in.tif C:\out.html
ocr2any.exe -ocr2 -dumpcharpos -dumpwordpos -outputformat 8888 C:\in.tif C:\out.pdf
ocr2any.exe -ocr2 -dumpcharpos -dumpwordpos -outputformat 8889 C:\in.tif C:\out.txt
ocr2any.exe -ocr2 -dumpcharpos -dumpwordpos -outputformat 8890 C:\in.tif C:\out.html
ocr2any.exe -ocr2 -dumpcharpos -dumpwordpos -outputformat 8891 C:\in.tif C:\out.csv
Process image files with Deskew, Despeckle and Noise Removal, Black Border Remova options:
ocr2any.exe -imageopt C:\in.tif C:\out.tif
ocr2any.exe -imageopt -rotate 45 C:\in.png C:\out.tif
ocr2any.exe -imageopt -rotate 90 C:\in.png C:\out.tif
ocr2any.exe -imageopt -threshold 0 C:\in.tif C:\out.bmp
ocr2any.exe -threshold 240 C:\in.tif C:\out.bmp
ocr2any.exe -dither 0 C:\in.bmp C:\out.png
ocr2any.exe -dither 7 C:\in.bmp C:\out.png
ocr2any.exe -imageopt -resizewidth 800 -resizeheight 600 C:\in.gif C:\out.tga
ocr2any.exe -imageopt -flip C:\in.png C:\out.gif
ocr2any.exe -imageopt -mirror C:\in.tif C:\out.pcx
ocr2any.exe -imageopt C:\in.bmp C:\out.tif
You can use "for" command to process all PDF files in a folder and subfolders, for example,
Following command line will OCR all PDF files in D:\temp\ folder to text files:
for %F in (D:\temp\*.pdf) do ocr2any.exe -ocr -lang deu "%F" "%~dpnF.txt"
Following command line will OCR all PDF files in D:\temp\ folder and subdirectories to text files:
for /r D:\temp %F in (*.pdf) do ocr2any.exe -ocr "%F" "%~dpnF.txt"
Following command line will OCR all PDF files from D:\temp\ folder and output text files to C:\test folder:
for %F in (D:\temp\*.pdf) do ocr2any.exe -ocr "%F" "C:\test\%~nF.txt"
Following command lines will use Enhanced OCR options:
for %F in (D:\temp\*.pdf) do ocr2any.exe -ocr2 -lang deu "%F" "%~dpnF.txt"
for %F in (D:\temp\*.pdf) do ocr2any.exe -ocr2 -lang eng "%F" "%~dpnF.doc"
for %F in (D:\temp\*.tif) do ocr2any.exe -ocr2 "%F" "%~dpnF.doc"
for %F in (D:\temp\*.tif) do ocr2any.exe -ocr2 -ocr2autorotate "%F" "%~dpnF.xls"
for /r D:\temp %F in (*.pdf) do ocr2any.exe -ocr2 "%F" "%~dpnF.rtf"
for %F in (D:\temp\*.pdf) do ocr2any.exe -ocr2 "%F" "C:\test\%~nF.html""
ocr2any.exe -ocr2 D:\temp\*.tif D:\temp\*.html
ocr2any.exe -ocr2 -ocr2excelmode 0 D:\temp\*.pdf D:\temp\*.xls
ocr2any.exe -ocr2 D:\temp\*.png D:\temp\*.rtf
ocr2any.exe -ocr2 D:\temp\*.tif D:\temp\*.csv
ocr2any.exe -ocr2 D:\temp\*.pdf D:\temp\*.doc
We recommend that you download the command line software from our website to try it. The command line version will provide you with more and more flexible control methods.
VeryDPF