Web Data Extraction Library Written in Python
Project DescriptionRelease History
Wextracto is a toolkit for command-line web data extraction.
$ pip install wextracto
Kicking the Tyres
$ echo -e "[wex]\nsitemaps=wex.sitemaps:urls_from_sitemaps" > entry_points.txt $ wex "http://www.ebay.com/robots.txt"
The documentation can be found here:
- Add support for reading WARC response format
- Fix bug in handling of invalid numeric character references
- Allow utf-8 in HTTP headers (only applies to PY2)
- Fix bug in HTTP decode caused by magic bytes handling.
- Add magic_bytes to Response for more reliable wex.http:decode behaviour.
- Re-worked encoding for HTML to pre-parse
- Better proxy support
- Now we flatten labels and values.
- href and src become href_url and src_url.
- Some API changes + switch to “tab-separated JSON”.
- Uploaded sdist to PyPI for “pip install wextracto” simplicity.
- Initial release as open source