Name	Name	Last commit message	Last commit date
Latest commit History 482 Commits
goose	goose
tests	tests
.gitignore	.gitignore
.travis.yml	.travis.yml
LICENSE.txt	LICENSE.txt
MANIFEST.in	MANIFEST.in
README.rst	README.rst
THANKS	THANKS
requirements.txt	requirements.txt
setup.py	setup.py

Python-Goose - Article Extractor

Intro

Goose was originally an article extractor written in Java that has most recently (Aug2011) been converted to a scala project.

This is a complete rewrite in Python. The aim of the software is to take any news article or article-type web page and not only extract what is the main body of the article but also all meta data and most probable image candidate.

Goose will try to extract the following information:

Main text of an article
Main image of article
Any YouTube/Vimeo movies embedded in article
Meta Description
Meta tags

The Python version was rewritten by:

Xavier Grangier

Licensing

If you find Goose useful or have issues please drop me a line. I'd love to hear how you're using it or what features should be improved.

Goose is licensed by Gravity.com under the Apache 2.0 license; see the LICENSE file for more details.

Setup

mkvirtualenv --no-site-packages goose git clone https://github.com/grangier/python-goose.git cd python-goose pip install -r requirements.txt python setup.py install

Take it for a spin

>>> from goose import Goose >>> url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2' >>> g = Goose() >>> article = g.extract(url=url) >>> article.title u'Occupy London loses eviction fight' >>> article.meta_description "Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London's Court of Appeal." >>> article.cleaned_text[:150] (CNN) -- Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoi >>> article.top_image.src http://i2.cdn.turner.com/cnn/dam/assets/111017024308-occupy-london-st-paul-s-cathedral-story-top.jpg

Configuration

There are two ways to pass configuration to goose. The first one is to pass goose a Configuration() object. The second one is to pass a configuration dict.

For instance, if you want to change the userAgent used by Goose just pass:

>>> g = Goose({'browser_user_agent': 'Mozilla'})

Switching parsers : Goose can now be used with lxml html parser or lxml soup parser. By default the html parser is used. If you want to use the soup parser pass it in the configuration dict :

>>> g = Goose({'browser_user_agent': 'Mozilla', 'parser_class':'soup'})

Goose is now language aware

For example, scraping a Spanish content page with correct meta language tags:

>>> from goose import Goose >>> url = 'http://sociedad.elpais.com/sociedad/2012/10/27/actualidad/1351332873_157836.html' >>> g = Goose() >>> article = g.extract(url=url) >>> article.title u'Las listas de espera se agravan' >>> article.cleaned_text[:150] u'Los recortes pasan factura a los pacientes. De diciembre de 2010 a junio de 2012 las listas de espera para operarse aumentaron un 125%. Hay m\xe1s ciudad'

Some pages don't have correct meta language tags, you can force it using configuration :

>>> from goose import Goose >>> url = 'http://www.elmundo.es/elmundo/2012/10/28/espana/1351388909.html' >>> g = Goose({'use_meta_language': False, 'target_language':'es'}) >>> article = g.extract(url=url) >>> article.cleaned_text[:150] u'Importante golpe a la banda terrorista ETA en Francia. La Guardia Civil ha detenido en un hotel de Macon, a 70 kil\xf3metros de Lyon, a Izaskun Lesaka y '

Passing {'use_meta_language': False, 'target_language':'es'} will forcibly select Spanish.

Video extraction

>>> import goose >>> url = 'http://www.liberation.fr/politiques/2013/08/12/journee-de-jeux-pour-ayrault-dans-les-jardins-de-matignon_924350' >>> g = goose.Goose({'target_language':'fr'}) >>> article = g.extract(url=url) >>> article.movies [] >>> article.movies[0].src 'http://sa.kewego.com/embed/vp/?language_code=fr&playerKey=1764a824c13c&configKey=dcc707ec373f&suffix=&sig=9bc77afb496s&autostart=false' >>> article.movies[0].embed_code ''<br>>>> article.movies[0].embed_type<br>'iframe'<br>>>> article.movies[0].width<br>'476'<br>>>> article.movies[0].height<br>'357'<br></tt> <a name="user-content-goose-in-chinese"></a> <div dir="auto"><h2 tabindex="-1" dir="auto">Goose in Chinese</h2></div> <p dir="auto">Some users want to use Goose for Chinese content. Chinese word segmentation is way more difficult to deal with than occidental languages. Chinese needs a dedicated StopWord analyser that need to be passed to the config object.</p> <tt>>>> from goose import Goose<br>>>> from goose.text import StopWordsChinese<br>>>> url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'<br>>>> g = Goose({'stopwords_class': StopWordsChinese})<br>>>> article = g.extract(url=url)<br>>>> print article.cleaned_text[:150]<br>Xiang Gang Xing Zheng Chang Guan Liang Zhen Ying Zai Ge Fang Ya Li Xia Jiu Qi Da Zhai De Wei Zhang Jian Zhu (Jian Jian )Wen Ti Dao Li Fa Hui Jie Shou Zhi Xun ,Bing Xiang Xiang Gang Min Zhong Dao Qian . <br><br>Liang Zhen Ying Zai Xing Qi Er (12Yue 10Ri )De Da Wen Da Hui Kai Shi Zhi Ji Zai Qi Yan Shuo Zhong Dao Qian ,Dan Qiang Diao Ta Zai Wei Zhang Jian Zhu Wen Ti Shang Mei You Yin Man De Yi Tu He Dong Ji . <br><br>Yi Xie Qin Bei Jing Zhen Ying Yi Yuan Huan Ying Liang Zhen Ying Dao Qian ,Qie Ren Wei Ying Neng Huo De Xiang Gang Min Zhong Jie Shou ,Dan Zhe Xie Yi Yuan Ye Zhi Wen Liang Zhen Ying You <br></tt> <a name="user-content-goose-in-arabic"></a> <div dir="auto"><h2 tabindex="-1" dir="auto">Goose in Arabic</h2></div> <p dir="auto">In order to use Goose in Arabic you have to use the StopWordsArabic class.</p> <tt>>>> from goose import Goose<br>>>> from goose.text import StopWordsArabic<br>>>> url = 'http://arabic.cnn.com/2013/middle_east/8/3/syria.clashes/index.html'<br>>>> g = Goose({'stopwords_class': StopWordsArabic})<br>>>> article = g.extract(url=url)<br>>>> print article.cleaned_text[:150]<br>dmshq, swry (CNN) -- 'kdt jht swry@ m`rD@ 'n fSy'l mslH@ m`rD@ lnZm lry'ys bshr l'sd w`l~ Sl@ b"ljysh lHr" tmknt mn lsyTr@ `l~ mstwd`t ll'sl<br></tt> <a name="user-content-goose-in-korean"></a> <div dir="auto"><h2 tabindex="-1" dir="auto">Goose in Korean</h2></div> <p dir="auto">In order to use Goose in Korean you have to use the StopWordsKorean class.</p> <tt>>>> from goose import Goose<br>>>> from goose.text import StopWordsKorean<br>>>> url='http://news.donga.com/3/all/20131023/58406128/1'<br>>>> g = Goose({'stopwords_class':StopWordsKorean})<br>>>> article = g.extract(url=url)<br>>>> print article.cleaned_text[:150]<br>gyeonggido yongine jari jabeun mingan siheominjeung jeonmungieob (ju)dijiteoliemssi(www.digitalemc.com).<br>14nyeonjjae segye gaggugyi tongsin*anjeon*jeonpa gyugyeog siheomgwa injeung han umulman pago issneun i hoesa bagcaegyu daepyoga mannagiro han juingongida.<br>geuneun jeongijeonja*museontongsin*jadongca jeonjangpum bunyae<br></tt> <a name="user-content-known-issues"></a> <div dir="auto"><h2 tabindex="-1" dir="auto">Known issues</h2></div> <ul dir="auto"> <li><p dir="auto">There are some issues with unicode URLs.</p> </li> <li><p dir="auto">Cookie handling : Some websites need cookie handling. At the moment the only work around is to use the raw_html extraction. For instance:</p> <blockquote> <div dir="auto"CAIRO \u2014 For a moment, at least, American and European diplomats trying to defuse the volatile standoff in Egypt thought they had a breakthrough.\n\nAs t'"><tt>>>> <span>import</span> urllib2<br>>>> <span>import</span> goose<br>>>> url <span>=</span> <span><span>"</span>http://www.nytimes.com/2013/08/18/world/middleeast/pressure-by-us-failed-to-sway-egypts-leaders.html?hp<span>"</span></span><br>>>> opener <span>=</span> urllib2.build_opener(urllib2.HTTPCookieProcessor())<br>>>> response <span>=</span> opener.open(url)<br>>>> raw_html <span>=</span> response.read()<br>>>> g <span>=</span> goose.Goose()<br>>>> a <span>=</span> g.extract(<span>raw_html</span><span>=</span>raw_html)<br>>>> a.cleaned_text<br>u'CAIRO \u2014 For a moment, at least, American and European diplomats trying to defuse the volatile standoff in Egypt thought they had a breakthrough.\n\nAs t'</tt></div> </blockquote> </li> </ul> <a name="user-content-todo"></a> <div dir="auto"><h2 tabindex="-1" dir="auto">TODO</h2></div> <ul dir="auto"> <li>Video html5 tag extraction</li> </ul> </article></div></div></div></div></div></div></div><div><div></div><div><rails-partial> <div> <div> <div> <div> <h2>About</h2> <p> Html Content / Article Extractor, web scrapping lib in Python </p> <h3>Resources</h3> <div> Readme </div> <h3>License</h3> <div> Apache-2.0 license </div> <include-fragment src="/grangier/python-goose/hovercards/citation/sidebar_partial?tree_name=develop"> <div data-show-on-forbidden-error hidden> <div> <div> <div> <h3> Uh oh! </h3> <p> <p>There was an error while loading. <a href="" aria-label="Please reload this page">Please reload this page</a>.</p> </p> </div> </div> </div> </div> </include-fragment> <div> <a href="/github.com/grangier/python-goose/activity"> <span>Activity</span></a> </div> <h3>Stars</h3> <div> <a href="/github.com/grangier/python-goose/stargazers"> <strong>4.1k</strong> stars</a> </div> <h3>Watchers</h3> <div> <a href="/github.com/grangier/python-goose/watchers"> <strong>196</strong> watching</a> </div> <h3>Forks</h3> <div> <a href="/github.com/grangier/python-goose/forks"> <strong>782</strong> forks</a> </div> <div> <a href="/github.com/contact/report-content?content_url=https%3A%2F%2Fgithub.com%2Fgrangier%2Fpython-goose&report=grangier+%28user%29"> Report repository </a> </div> </div> </div> </div> <div> <div> <h2> <a href="/github.com/grangier/python-goose/releases">Releases</a></h2> <a href="/github.com/grangier/python-goose/tags"> <span>27</span> <span>tags</span> </a> </div> </div> <div> <div> <include-fragment aria-busy="true" aria-label="Loading latest packages" src="/grangier/python-goose/packages_list?current_repository=python-goose"> <h2> <a href="/github.com/users/grangier/packages?repo_name=python-goose">Packages <span title="0" hidden="hidden">0</span></a></h2> <div> <div></div> <div> </div> </div> <div> <div></div> <div> </div> </div> <div> <div></div> <div> </div> </div> <div data-show-on-forbidden-error hidden> <div> <div> <div> <h3> Uh oh! </h3> <p> <p>There was an error while loading. <a href="" aria-label="Please reload this page">Please reload this page</a>.</p> </p> </div> </div> </div> </div> </include-fragment> </div> </div> <div hidden> <div> <include-fragment src="/grangier/python-goose/used_by_list" accept="text/fragment+html"> <div data-show-on-forbidden-error hidden> <div> <div> <div> <h3> Uh oh! </h3> <p> <p>There was an error while loading. <a href="" aria-label="Please reload this page">Please reload this page</a>.</p> </p> </div> </div> </div> </div> </include-fragment> </div> </div> <div> <div> <include-fragment aria-busy="true" aria-label="Loading contributors" src="/grangier/python-goose/contributors_list?current_repository=python-goose&deferred=true"> <h2> <a href="/github.com/grangier/python-goose/graphs/contributors">Contributors</a> </h2> <ul> <li> <div></div> </li> <li> <div></div> </li> <li> <div></div> </li> </ul> <div data-show-on-forbidden-error hidden> <div> <div> <div> <h3> Uh oh! </h3> <p> <p>There was an error while loading. <a href="" aria-label="Please reload this page">Please reload this page</a>.</p> </p> </div> </div> </div> </div> </include-fragment> </div> </div> <div> <div> <h2>Languages</h2> <div> <span> <span itemprop="keywords"></span> <span itemprop="keywords"></span> </span></div> <ul> <li> <a href="/github.com/grangier/python-goose/search?l=html"> <span>HTML</span> <span>96.0%</span> </a> </li> <li> <a href="/github.com/grangier/python-goose/search?l=python"> <span>Python</span> <span>4.0%</span> </a> </li> </ul> </div> </div> </div> </rails-partial></div><div></div></div></div></div></div></div></div></div></div></div></div><div id="find-result-marks-container"></div><button hidden=""></button><button hidden=""></button></div>   </div> </react-app> </div> </turbo-frame> </main> </div> </div> <footer role="contentinfo" > <h2>Footer</h2> <div> <div> <a href="/github.com"> </a> <span> (c) 2026 GitHub, Inc. </span> </div> <nav aria-label="Footer"> <h3 id="sr-footer-heading">Footer navigation</h3> <ul aria-labelledby="sr-footer-heading"> <li> <a href="/docs.github.com/site-policy/github-terms/github-terms-of-service">Terms</a> </li> <li> <a href="/docs.github.com/site-policy/privacy-policies/github-privacy-statement">Privacy</a> </li> <li> <a href="/github.com/security">Security</a> </li> <li> <a href="/www.githubstatus.com/">Status</a> </li> <li> <a href="/github.community/">Community</a> </li> <li> <a href="/docs.github.com/">Docs</a> </li> <li> <a href="/support.github.com?tags=dotcom-footer">Contact</a> </li> <li > <cookie-consent-link> <button type="button" > Manage cookies </button> </cookie-consent-link> </li> <li> <cookie-consent-link> <button type="button" > Do not share my personal information </button> </cookie-consent-link> </li> </ul> </nav> </div> </footer> <ghcc-consent id="ghcc" ></ghcc-consent> <div id="ajax-error-message" hidden> <button type="button" aria-label="Dismiss error"> </button> You can't perform that action at this time. </div> <template id="site-details-dialog"> <details open> <summary role="button" aria-label="Close dialog"></summary> <details-dialog> <button type="button" aria-label="Close dialog" data-close-dialog> </button> <div></div> </details-dialog> </details> </template> <div> <div> </div> </div> <template id="snippet-clipboard-copy-button"> <div> <clipboard-copy aria-label="Copy"> </clipboard-copy> </div> </template> <template id="snippet-clipboard-copy-button-unpositioned"> <div> <clipboard-copy aria-label="Copy"> </clipboard-copy> </div> </template> </div> <div id="js-global-screen-reader-notice" aria-live="polite" aria-atomic="true" ></div> <div id="js-global-screen-reader-notice-assertive" aria-live="assertive" aria-atomic="true"></div> </body> </html>

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grangier/python-goose

Folders and files

Latest commit

History

Repository files navigation