How to Extract Text from PPTX,DOC & PDF using Python
Different Methods to extract text using Tika 1)Using TIKA from tika import parser parsed = parser.from_file("path.pptx") -- PPTX parsed = parser.from_file("path.docx") -- DOCX parsed = parser.from_file("path.pdf") -- PDF print(parsed["metadata"]) print(parsed["content"]) 2)Using Textract import textract text = textract.process("path.pptx") print(text) 3)Using pptx import presentation local_pptxFileList = ["path.pptx"] for i in local_pptxFileList: ppt = Presentation(i) for slide in ppt.slides: for shape in slide.shapes: if shape.has_text_frame: print(shape.text)