Scraping the News

Author

marten walk

Testing the Google News API for extracting news articles

Google News

Imports

Code
from gnews import GNews
import newspaper
import base64
import requests
import pandas as pd

Initialise the API with German Settings and 7 days period

gnews = GNews(
    language='de', # language of the news articles
    country='DE', # country of the news articles
    period='7d' # time period of the news articles (last 7 days)
)

Run test with keyword Ukraine EU

news = gnews.get_news(
    'Ukraine EU' # search query
)

Display Output using pandas (only 10 of 33 articles)

Code
df_news = pd.DataFrame(news)
# Expand the publisher object to 3 columns
df_pub = df_news['publisher'].apply(pd.Series)
df_news[['pub_url', 'pub_title']] = df_pub
df_news.drop(columns=['publisher'], inplace=True)
df_news.head(10)
title description published date url pub_url pub_title
0 Brüssel - EU-Staaten geben grünes Licht für ne... Brüssel - EU-Staaten geben grünes Licht für ne... Tue, 06 Aug 2024 14:28:01 GMT https://news.google.com/rss/articles/CBMirgFBV... https://www.deutschlandfunk.de Deutschlandfunk
1 Ukrainekrieg: Tausende Menschen fliehen aus ru... Ukrainekrieg: Tausende Menschen fliehen aus ru... Mon, 05 Aug 2024 09:42:20 GMT https://news.google.com/rss/articles/CBMicEFVX... https://www.zeit.de zeit.de
2 Ukraine-Liveticker: Medienberichte: Youtube in... Ukraine-Liveticker: Medienberichte: Youtube in... Thu, 01 Aug 2024 18:59:17 GMT https://news.google.com/rss/articles/CBMixAFBV... https://www.faz.net FAZ - Frankfurter Allgemeine Zeitung
3 Getreideexport aus der Ukraine läuft auf Hocht... Getreideexport aus der Ukraine läuft auf Hocht... Sat, 03 Aug 2024 03:15:22 GMT https://news.google.com/rss/articles/CBMingFBV... https://www.agrarheute.com agrarheute.com
4 Putins Krieg in der Ukraine: Das sind die Entw... Putins Krieg in der Ukraine: Das sind die Entw... Tue, 06 Aug 2024 15:42:00 GMT https://news.google.com/rss/articles/CBMivAFBV... https://www.suedkurier.de SÜDKURIER Online
5 Ukraine: Hoffnung auf weitere EU-Agrarhilfen -... Ukraine: Hoffnung auf weitere EU-Agrarhilfen ... Wed, 07 Aug 2024 12:33:00 GMT https://news.google.com/rss/articles/CBMinwFBV... https://www.agrarzeitung.de agrarzeitung online
6 Mützenich erwartet Debatte um Ukraine-Territor... Mützenich erwartet Debatte um Ukraine-Territor... Thu, 01 Aug 2024 06:06:00 GMT https://news.google.com/rss/articles/CBMiowFBV... https://www.politico.eu POLITICO Europe
7 Energiekrise droht EU-Land – Ukraine nach Sank... Energiekrise droht EU-Land – Ukraine nach Sank... Sun, 04 Aug 2024 03:02:00 GMT https://news.google.com/rss/articles/CBMitgFBV... https://www.fr.de fr.de
8 Ukraine - top agrar online Ukraine top agrar online Thu, 01 Aug 2024 07:00:00 GMT https://news.google.com/rss/articles/CBMiY0FVX... https://www.topagrar.com top agrar online
9 Selenski will Gebiet nur mit Zustimmung des Vo... Selenski will Gebiet nur mit Zustimmung des Vo... Fri, 02 Aug 2024 06:28:00 GMT https://news.google.com/rss/articles/CBMizAFBV... https://www.handelsblatt.com Handelsblatt
Note

Looks promising, but the URL is encoded. We need to decode it to get the full text of the article

Extract Full Text Test

decode the URL using this comment on GitHub

Code
import base64
import requests

def fetch_decoded_batch_execute(id):
    s = (
        '[[["Fbv4je","[\\"garturlreq\\",[[\\"en-US\\",\\"US\\",[\\"FINANCE_TOP_INDICES\\",\\"WEB_TEST_1_0_0\\"],'
        'null,null,1,1,\\"US:en\\",null,180,null,null,null,null,null,0,null,null,[1608992183,723341000]],'
        '\\"en-US\\",\\"US\\",1,[2,3,4,8],1,0,\\"655000234\\",0,0,null,0],\\"' +
        id +
        '\\"]",null,"generic"]]]'
    )

    headers = {
        "Content-Type": "application/x-www-form-urlencoded;charset=utf-8",
        "Referer": "https://news.google.com/"
    }

    response = requests.post(
        "https://news.google.com/_/DotsSplashUi/data/batchexecute?rpcids=Fbv4je",
        headers=headers,
        data={"f.req": s}
    )

    if response.status_code != 200:
        raise Exception("Failed to fetch data from Google.")

    text = response.text
    header = '[\\"garturlres\\",\\"'
    footer = '\\",'
    if header not in text:
        raise Exception(f"Header not found in response: {text}")
    start = text.split(header, 1)[1]
    if footer not in start:
        raise Exception("Footer not found in response.")
    url = start.split(footer, 1)[0]
    return url


def decode_google_news_url(source_url):
    url = requests.utils.urlparse(source_url)
    path = url.path.split("/")
    if url.hostname == "news.google.com" and len(path) > 1 and path[-2] == "articles":
        base64_str = path[-1]
        decoded_bytes = base64.urlsafe_b64decode(base64_str + '==')
        decoded_str = decoded_bytes.decode('latin1')

        prefix = b'\x08\x13\x22'.decode('latin1')
        if decoded_str.startswith(prefix):
            decoded_str = decoded_str[len(prefix):]

        suffix = b'\xd2\x01\x00'.decode('latin1')
        if decoded_str.endswith(suffix):
            decoded_str = decoded_str[:-len(suffix)]

        bytes_array = bytearray(decoded_str, 'latin1')
        length = bytes_array[0]
        if length >= 0x80:
            decoded_str = decoded_str[2:length+1]
        else:
            decoded_str = decoded_str[1:length+1]

        if decoded_str.startswith("AU_yqL"):
            return fetch_decoded_batch_execute(base64_str)

        return decoded_str
    else:
        return source_url

Now Extract the full text of the article

Code
url = decode_google_news_url(news[3]['url'])
art = newspaper.article(url)

Take a look at the original article here to compare it with the extracted info

print(f"Titel: {art.title}")
print(f"Autor: {art.authors}")
print(f"Datum: {art.publish_date}")
Titel: Getreideexport aus der Ukraine läuft auf Hochtouren - auch in die EU
Autor: ['Norbert Lehmann']
Datum: 2024-08-03 05:15:22+02:00

The Full Text of teh Article (only first 400 characters)

Code
print(art.text[:400]+"..")
In der Ukraine wird gerade die dritte Getreideernte seit Beginn des russischen Überfalls eingebracht. Zu Beginn des neuen Wirtschaftsjahres 2024/25 fließen die Ausfuhren an Weizen, Gerste und Mais deutlich schneller ab als vor einem Jahr. Das liegt auch daran, dass der Frachtverkehr über das Schwarze Meer weniger stark gestört ist als noch 2023.

Nach Angaben der nationalen Zollbehörde exportierte..
Note

Recognizes all Parameters of the given article (author, title, date) and extracts the full text. The API is ready for further use, e.g doing a sentiment analysis on the text.

Conclusion

  • technically possible to use Google News API
  • the sorting of articles is not always clear
  • search can be based only on keywords, but limited to specific regions and languages
  • full text extract can be used further (legalities need to be checked, but should be fair use)