How to Extract Text from PPTX,DOC & PDF using Python

Different Methods to extract text using Tika

1)Using TIKA

from tika import parser

parsed = parser.from_file("path.pptx")  -- PPTX

parsed = parser.from_file("path.docx") -- DOCX

parsed = parser.from_file("path.pdf")   -- PDF

print(parsed["metadata"])  

print(parsed["content"]) 

 

2)Using Textract

import textract

text = textract.process("path.pptx")

print(text)

 

3)Using pptx import presentation

local_pptxFileList = ["path.pptx"]

for i in local_pptxFileList:

            ppt = Presentation(i)

            for slide in ppt.slides:

                for shape in slide.shapes:

                    if shape.has_text_frame:

                        print(shape.text)



Comments

Popular posts from this blog

Db2 export command example using file format (del , ixf)

How to determine fenced User-ID of a DB2 instance

How to fix DB2 Tablespace OFFLINE state issue?