Skip to main content
Warning: You are using the test version of PyPI. This is a pre-production deployment of Warehouse. Changes made here affect the production instance of TestPyPI (testpypi.python.org).
Help us improve Python packaging - Donate today!

Extract structured information from web pages

Project Description

Extract structured information from web pages.

For now, extruct only extracts HTML microdata from web pages.

Installation

pip install extruct

Usage

>>> from pprint import pprint
>>>
>>> from extruct.w3cmicrodata import MicrodataExtractor
>>>
>>> # example from http://www.w3.org/TR/microdata/#associating-names-with-items
>>> html = """<!DOCTYPE HTML>
... <html>
...  <head>
...   <title>Photo gallery</title>
...  </head>
...  <body>
...   <h1>My photos</h1>
...   <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses">
...    <img itemprop="work" src="images/house.jpeg" alt="A white house, boarded up, sits in a forest.">
...    <figcaption itemprop="title">The house I found.</figcaption>
...   </figure>
...   <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses">
...    <img itemprop="work" src="images/mailbox.jpeg" alt="Outside the house is a mailbox. It has a leaflet inside.">
...    <figcaption itemprop="title">The mailbox.</figcaption>
...   </figure>
...   <footer>
...    <p id="licenses">All images licensed under the <a itemprop="license"
...    href="http://www.opensource.org/licenses/mit-license.php">MIT
...    license</a>.</p>
...   </footer>
...  </body>
... </html>"""
>>>
>>> mde = MicrodataExtractor()
>>> data = mde.extract(html)
>>> pprint(data)
{'items': [{'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',
                           'title': u'The house I found.',
                           'work': 'http://www.example.com/images/house.jpeg'},
            'type': 'http://n.whatwg.org/work'},
           {'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',
                           'title': u'The mailbox.',
                           'work': 'http://www.example.com/images/mailbox.jpeg'},
            'type': 'http://n.whatwg.org/work'}]}

REST API service

extruct also ships with a REST API service to test its output from URLs.

Dependencies

Usage

python -m extruct.service

launches an HTTP server listening on port 10005.

Methods supported

/extruct/<URL>
method = GET


/extruct/batch
method = POST
params:
    urls - a list of URLs separted by newlines
    urlsfile - a file with one URL per line

E.g. http://localhost:10005/extruct/http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412

will output something like this:

{
   "url":"http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412",
   "status":"ok",
   "microdata":{
      "items":[
         {
            "type":"http://schema.org/Product",
            "properties":{
               "name":"Susket",
               "color":[
                  "http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412",
                  "http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412"
               ],
               "brand":"http://www.sarenza.com/i-love-shoes",
               "aggregateRating":{
                  "type":"http://schema.org/AggregateRating",
                  "properties":{
                     "description":"Soyez le premier \u00e0 donner votre avis"
                  }
               },
               "offers":{
                  "type":"http://schema.org/AggregateOffer",
                  "properties":{
                     "lowPrice":"59,00 \u20ac",
                     "price":"A partir de\r\n                  59,00 \u20ac",
                     "priceCurrency":"EUR",
                     "highPrice":"59,00 \u20ac",
                     "availability":"http://schema.org/InStock"
                  }
               },
               "size":[
                  "36 - Epuis\u00e9 - \u00catre alert\u00e9",
                  "37 - Epuis\u00e9 - \u00catre alert\u00e9",
                  "38 - Epuis\u00e9 - \u00catre alert\u00e9",
                  "39 - Derni\u00e8re paire !",
                  "40",
                  "41",
                  "42 - Derni\u00e8re paire !"
               ],
               "image":[
                  "http://cdn2.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_09.jpg?201509221045",
                  "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_03.jpg?201509221045",
                  "http://cdn3.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_04.jpg?201509221045",
                  "http://cdn2.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_05.jpg?201509221045",
                  "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_06.jpg?201509221045",
                  "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_07.jpg?201509221045",
                  "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_08.jpg?201509221045",
                  "http://cdn2.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_02.jpg?201509291747"
               ],
               "description":""
            }
         }
      ]
   }
}

Development version

mkvirtualenv extruct
pip install -r requirements-dev.txt

Tests

Run tests in current environment:

py.test tests

Use tox to run tests with different Python versions:

tox

Versioning

Use bumpversion to conveniently change project version:

bumpversion patch  # 0.0.0 -> 0.0.1
bumpversion minor  # 0.0.1 -> 0.1.0
bumpversion major  # 0.1.0 -> 1.0.0
Release History

Release History

This version
History Node

0.0.0

Download Files

Download Files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
extruct-0.0.0.tar.gz (5.4 kB) Copy SHA256 Checksum SHA256 Source Oct 26, 2015

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting