Posts

Showing posts from September, 2020

How to Extract Text from PPTX,DOC & PDF using Python

Different Methods to extract text using Tika 1)Using TIKA from tika import parser parsed = parser.from_file("path.pptx")  -- PPTX parsed = parser.from_file("path.docx") -- DOCX parsed = parser.from_file("path.pdf")   -- PDF print(parsed["metadata"])   print(parsed["content"])    2)Using Textract import textract text = textract.process("path.pptx") print(text)   3)Using pptx import presentation local_pptxFileList = ["path.pptx"] for i in local_pptxFileList:             ppt = Presentation(i)             for slide in ppt.slides:                 for shape in slide.shapes:                     if shape.has_text_frame:                         print(shape.text)

How to Load Data(JSON and TSV FORMATS) into PostgresDB using Python

#Import Libraries  import psycopg2 import json import os import glob import pandas as pd import subprocess from sqlalchemy import create_engine #Establishing the  Connection to PostgresDB conn = psycopg2.connect(user = "xxxx",                          password = "xxxxx",                          host = "127.0.0.1",                          port = "5432",                          database = "DB"                         ) filenames = glob.glob('path/*.json') df = [] for file in filenames:     data = pd.read_json(file, lines=True)  (For JSON)      data = pd.read_csv(file,'\t')  (For Tsv) ...