新闻网页正文提取 Python 库

GeneralNewsExtractor

GeneralNewsExtractor(GNE)是一个通用新闻网站正文抽取模块,输入一篇新闻网页的 HTML, 输出正文内容、标题、作者、发布时间、正文中的图片地址和正文所在的标签源代码。GNE在提取今日头条、网易新闻、游民星空、 观察者网、凤凰网、腾讯新闻、ReadHub、新浪新闻等数百个中文新闻网站上效果非常出色,几乎能够达到100%的准确率。

项目地址

https://github.com/GeneralNewsExtractor/GeneralNewsExtractor.git

安装

# 使用 pip 安装
pip install --upgrade gne

使用

from gne import GeneralNewsExtractor

html = '''经过渲染的网页 HTML 代码'''

extractor = GeneralNewsExtractor()
result = extractor.extract(html)
print(result)

{"title": "xxxx", "publish_time": "2019-09-10 11:12:13", "author": "yyy", "content": "zzzz", "images": ["/xxx.jpg", "/yyy.png"]}

更多使用说明,请参阅 GNE 的文档

Newspaper4k

Newspaper4k 项目是从 Codelucas 著名的 Newspaper3k 的一个分支发展而来的,该分支自 2020 年 9 月以来就没有更新过。这个分支的最初目标是保持项目活跃并添加新功能和修复错误。从 0.9.3 版本开始,有许多新功能和改进使 Newspaper4k 成为文章抓取和管理的绝佳工具。为了更轻松地迁移到 Newspaper4k,保留了原始项目中的所有类和方法,并在它们之上添加了新功能。来自原始项目的所有 API 调用仍按预期工作,因此,对于熟悉 newspaper3k 的用户,您将对 Newspaper4k 感到宾至如归。

项目地址

https://github.com/AndyTheFactory/newspaper4k.git

安装

# python 必须为 3.8+
pip install newspaper4k lxml_html_clean

使用

import newspaper

article = newspaper.article('https://edition.cnn.com/2023/10/29/sport/nfl-week-8-how-to-watch-spt-intl/index.html')
# 支持 html 
# article = newspaper.article(url=url, input_html=html)

print(article.authors)
# ['Hannah Brewitt', 'Minute Read', 'Published', 'Am Edt', 'Sun October']

print(article.publish_date)
# 2023-10-29 09:00:15.717000+00:00

print(article.text)
# New England Patriots head coach Bill Belichick, right, embraces Buffalo Bills head coach Sean McDermott ...

print(article.top_image)
# https://media.cnn.com/api/v1/images/stellar/prod/231015223702-06-nfl-season-gallery-1015.jpg?c=16x9&q=w_800,c_fill

print(article.movies)
# []

开启 nlp

# 安装 自然语言工具包 (NLTK)
pip install nltk
import newspaper

article = newspaper.article('https://edition.cnn.com/2023/10/29/sport/nfl-week-8-how-to-watch-spt-intl/index.html')

# 开启
article.nlp()

print(article.keywords)
# ['broncos', 'game', 'et', 'wide', 'chiefs', 'mahomes', 'patrick', 'denver', 'nfl', 'stadium', 'week', 'quarterback', 'win', 'history', 'images']

print(article.summary)
# Kevin Sabitus/Getty Images Denver Broncos running back Javonte Williams evades Green Bay Packers safety Darnell Savage, bottom.
# Kathryn Riley/Getty Images Kansas City Chiefs quarterback Patrick Mahomes calls a play during the Chiefs' 19-8 Thursday Night Football win over the Denver Broncos on October 12.
# Paul Sancya/AP New York Jets running back Breece Hall carries the ball during a game against the Denver Broncos.
# The Broncos have not beaten the Chiefs since 2015, and have never beaten Chiefs quarterback Patrick Mahomes.
# Australia: NFL+, ESPN, 7Plus Brazil: NFL+, ESPN Canada: NFL+, CTV, TSN, RDS Germany: NFL+, ProSieben MAXX, DAZN Mexico: NFL+, TUDN, ESPN, Fox Sports, Sky Sports UK: NFL+, Sky Sports, ITV, Channel 5 US: NFL+, CBS, NBC, FOX, ESPN, Amazon Prime

支持中文网页解析

# jieba 中文分词组件
pip install jieba
import newspaper

article = newspaper.article('https://news.sina.com.cn/w/2024-10-14/doc-incspmqw8123928.shtml', language='zh')

print(article.text)

版权声明:
作者:lrbmike
链接:https://blog.liurb.org/2024/10/16/python_gne/
来源:大卷学长
文章版权归作者所有,未经允许请勿转载。

THE END
分享
二维码
< <上一篇
下一篇>>