Dark Mode

Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

vinta/haul

Repository files navigation

Haul

Find thumbnails and original images from URL or HTML file.

Demo

Hauler on Heroku

Installation

on Ubuntu

$ sudo apt-get install build-essential python-dev libxml2-dev libxslt1-dev
$ pip install haul

on Mac OS X

$ pip install haul

Fail to install haul? It is probably caused by lxml.

Usage

Find images from img src, a href and even background-image:

import haul

url = 'http://gibuloto.tumblr.com/post/62525699435/fuck-yeah'
result = haul.find_images(url)

print(result.image_urls)
"""
output:
[
'http://25.media.tumblr.com/3f5f10d7216f1dd5eacb5eb3e302286a/tumblr_mtpcwdzKBT1qh9n5lo1_500.png',
...
'http://24.media.tumblr.com/avatar_a3a119b674e2_16.png',
'http://25.media.tumblr.com/avatar_9b04f54875e1_16.png',
'http://31.media.tumblr.com/avatar_0acf8f9b4380_16.png',
]
"""

Find original (or bigger size) images with extend=True:

import haul

url = 'http://gibuloto.tumblr.com/post/62525699435/fuck-yeah'
result = haul.find_images(url, extend=True)

print(result.image_urls)
"""
output:
[
'http://25.media.tumblr.com/3f5f10d7216f1dd5eacb5eb3e302286a/tumblr_mtpcwdzKBT1qh9n5lo1_500.png',
...
'http://24.media.tumblr.com/avatar_a3a119b674e2_16.png',
'http://25.media.tumblr.com/avatar_9b04f54875e1_16.png',
'http://31.media.tumblr.com/avatar_0acf8f9b4380_16.png',
# bigger size, extended from above urls
'http://25.media.tumblr.com/3f5f10d7216f1dd5eacb5eb3e302286a/tumblr_mtpcwdzKBT1qh9n5lo1_1280.png',
...
'http://24.media.tumblr.com/avatar_a3a119b674e2_128.png',
'http://25.media.tumblr.com/avatar_9b04f54875e1_128.png',
'http://31.media.tumblr.com/avatar_0acf8f9b4380_128.png',
]
"""

Advanced Usage

Custom finder / extender pipeline:

from haul import Haul
from haul.compat import str


def img_data_src_finder(pipeline_index,
soup,
finder_image_urls=[],
*args, **kwargs):
"""
Find image URL in 's data-src attribute
"""

now_finder_image_urls = []

for img in soup.find_all('img'):
src = img.get('data-src', None)
if src:
src = str(src)
now_finder_image_urls.append(src)

output = {}
output['finder_image_urls'] = finder_image_urls + now_finder_image_urls

return output

MY_FINDER_PIPELINE = (
'haul.finders.pipeline.html.img_src_finder',
'haul.finders.pipeline.css.background_image_finder',
img_data_src_finder,
)

GOOGLE_SITES_EXTENDER_PIEPLINE = (
'haul.extenders.pipeline.google.blogspot_s1600_extender',
'haul.extenders.pipeline.google.ggpht_s1600_extender',
'haul.extenders.pipeline.google.googleusercontent_s1600_extender',
)

url = 'http://fashion-fever.nl/dressing-up/'
h = Haul(parser='lxml',
finder_pipeline=MY_FINDER_PIPELINE,
extender_pipeline=GOOGLE_SITES_EXTENDER_PIEPLINE)
result = h.find_images(url, extend=True)

Run Tests

$ python setup.py test

About

An Extensible Image Crawler

Resources

Readme

License

MIT license

Stars

Watchers

Forks

Packages

Contributors

Languages