How to Extract Text from PPTX,DOC & PDF using Python

Different Methods to extract text using Tika

1)Using TIKA

from tika import parser

parsed = parser.from_file("path.pptx") -- PPTX

parsed = parser.from_file("path.docx") -- DOCX

parsed = parser.from_file("path.pdf") -- PDF

print(parsed["metadata"])

print(parsed["content"])

2)Using Textract

import textract

text = textract.process("path.pptx")

print(text)

3)Using pptx import presentation

local_pptxFileList = ["path.pptx"]

for i in local_pptxFileList:

ppt = Presentation(i)

for slide in ppt.slides:

for shape in slide.shapes:

if shape.has_text_frame:

print(shape.text)

Insight and perspectives