Extract the main article content (and optionally comments) from a web page
Dragnet isn’t interested in the shiny chrome or boilerplate dressing of a web page. It’s interested in… ‘just the facts.’ The machine learning models in Dragnet extract the main article content and optionally user generated comments from a web page. They provide state of the art performance on variety of test benchmarks.
For more information on our approach check out:
The build requires numpy, lxml and a new version of Cython, so first make sure they are installed, then install Dragnet:
pip install numpy pip install --upgrade cython pip install lxml pip install dragnet
Depending on your use case, we provide two separate models to extract just the main article content or the content and any user generated comments. Each model implements the analyze method that takes an HTML string and returns the content string.
import requests from dragnet import content_extractor, content_comments_extractor # fetch HTML url = 'https://moz.com/devblog/dragnet-content-extraction-from-diverse-feature-sets/' r = requests.get(url) # get main article without comments content = content_extractor.analyze(r.content) # get article and comments content_comments = content_comments_extractor.analyze(r.content)