Insight and perspectives

Posts

Showing posts from September, 2020

How to Extract Text from PPTX,DOC & PDF using Python

- September 16, 2020

Different Methods to extract text using Tika 1)Using TIKA from tika import parser parsed = parser.from_file("path.pptx") -- PPTX parsed = parser.from_file("path.docx") -- DOCX parsed = parser.from_file("path.pdf") -- PDF print(parsed["metadata"]) print(parsed["content"]) 2)Using Textract import textract text = textract.process("path.pptx") print(text) 3)Using pptx import presentation local_pptxFileList = ["path.pptx"] for i in local_pptxFileList: ppt = Presentation(i) for slide in ppt.slides: for shape in slide.shapes: if shape.has_text_frame: print(shape.text)

How to Load Data(JSON and TSV FORMATS) into PostgresDB using Python

- September 16, 2020

#Import Libraries import psycopg2 import json import os import glob import pandas as pd import subprocess from sqlalchemy import create_engine #Establishing the Connection to PostgresDB conn = psycopg2.connect(user = "xxxx", password = "xxxxx", host = "127.0.0.1", port = "5432", database = "DB" ) filenames = glob.glob('path/*.json') df = [] for file in filenames: data = pd.read_json(file, lines=True) (For JSON) data = pd.read_csv(file,'\t') (For Tsv) ...