How to Extract Text from PPTX,DOC & PDF using Python
Different Methods to extract text using Tika
1)Using TIKA
from tika import parser
parsed =
parser.from_file("path.pptx") -- PPTX
parsed = parser.from_file("path.docx")
-- DOCX
parsed = parser.from_file("path.pdf")
-- PDF
print(parsed["metadata"])
print(parsed["content"])
2)Using Textract
import textract
text = textract.process("path.pptx")
print(text)
3)Using pptx import presentation
local_pptxFileList = ["path.pptx"]
for i in local_pptxFileList:
ppt =
Presentation(i)
for
slide in ppt.slides:
for shape in slide.shapes:
if shape.has_text_frame:
print(shape.text)
Comments
Post a Comment