Skip to main content
Warning: You are using the test version of PyPI. This is a pre-production deployment of Warehouse. Changes made here affect the production instance of TestPyPI (testpypi.python.org).
Help us improve Python packaging - Donate today!

Extract flat data and load it as relational data

Project Description

W-Drive Extractor (wextractor)

The W-Drive Extractor (or the wextractor), named after the home of the shared list of contracts in the City of Pittsburgh, is an attempt to extract and standardize data from spreadsheets, .csvs, and other files for a relational destination.

Current status: alpha in development

Using the W-Drive-Extractor

Getting Started

W-Drive Extractor has some external dependencies, which can be installed via pip. It is recommended that you use a virtualenv to manage these.

The W-Drive-Extractor is an object-oriented application. In order to use it, you must first extract data from its original source using an Extractor’s extract method (the ExcelExtractor is currently the only supported example). Once the data is extracted, it can then be loaded back into some other datasource using a Loader’s load method. (only PostgresLoader has been implemented thus far). For a more detailed example on how this works, check out the sample usage at the bottom of this file.

TODO Features:

  • Add cli support
  • Change loader and extractor methods to use kwargs
  • Add better exception messaging for the load and extract methods

Developing W-Drive-Extractor

To get started with development, see the getting started section above.

Extractors

The Extractor base class is an interface for implementing data extraction from different sources. It requires taking in a target which can be a file or URL and two optional params. Headers is the title of the columns that will ultimately be extracted from your store, and dtypes is a list of native python types that each column should have.

Current Implementations:
  • Excel (.xls, .xlsx)
  • Comma-Separated Values (.csv)
TODO implementations:
  • Generic Text Files (.txt)
  • Postgres
  • MS Access

Loaders

The Loader base class is an interface for implementing data loading into new sources. It requires connection parameters (a python dictionary of connection params) and optional schema. The goal is for a single input source (spreadsheet, denormalized table, etc.) to be split into many tables.

Current Implementations:
  • Postgres [with relationships and simple deduplication!]
TODO Implementations:
  • Simple key/value cache (Memcached/Redis)
  • Other relational data stores

Tests

Tests are located in the test directory. To run the tests, run

PYTHONPATH=. nosetests test/

from inside the root directory. For more coverage information, run

PYTHONPATH=. nosetests test/ -vs --with-coverage --cover-package=wextractor --cover-erase

Sample Usage

Below is an example of extracting data from Excel and loading it into a local postgres database with defined relationships. NOTE: This implementation is still fragile and likely to be dependent on the fact that to_relations is the last table in the list below.

import datetime

from wextractor.extractors import ExcelExtractor
from wextractor.loaders import PostgresLoader

one_sheet = ExcelExtractor(
    'files/one sheet contract list.xlsx',
    dtypes=[
        unicode, unicode, unicode, int, unicode,
        unicode, datetime.datetime, int, unicode, unicode,
        unicode, unicode, unicode, unicode, unicode,
        unicode, unicode, unicode, unicode
    ]
)
data = one_sheet.extract()

loader = PostgresLoader(
    {'database': 'w_drive', 'user': 'bensmithgall', 'host': 'localhost'},
    [{
        'table_name': 'contract',
        'to_relations': [],
        'from_relations': ['company'],
        'pkey': None,
        'columns': (
            ('description', 'TEXT'),
            ('notes', 'TEXT'),
            ('contract_number', 'VARCHAR(255)'),
            ('county', 'VARCHAR(255)'),
            ('type_of_contract', 'VARCHAR(255)'),
            ('pa', 'VARCHAR(255)'),
            ('expiration', 'TIMESTAMP'),
            ('spec_number', 'VARCHAR(255)'),
            ('controller_number', 'INTEGER'),
            ('commcode', 'INTEGER')
        )
    },
    {
        'table_name': 'company_contact',
        'to_relations': [],
        'from_relations': ['company'],
        'pkey': None,
        'columns': (
            ('contact_name', 'VARCHAR(255)'),
            ('address_1', 'VARCHAR(255)'),
            ('address_2', 'VARCHAR(255)'),
            ('phone_number', 'VARCHAR(255)'),
            ('email', 'VARCHAR(255)'),
            ('fax_number', 'VARCHAR(255)'),
            ('fin', 'VARCHAR(255)'),
        )
    },
    {
        'table_name': 'company',
        'to_relations': ['company_contact', 'contract'],
        'from_relations': [],
        'pkey': None,
        'columns': (
            ('company', 'VARCHAR(255)'),
            ('bus_type', 'VARCHAR(255)'),
        )
    }]
)

loader.load(data, True)
Release History

Release History

This version
History Node

0.1.a1

History Node

0.1.dev1

Download Files

Download Files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
wextractor-0.1.a1.tar.gz (15.6 kB) Copy SHA256 Checksum SHA256 Source Feb 20, 2015

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting