Web Scraping Techniques for Fed Communications

This post explains web scraping techniques for Fed communications using Python libraries like Requests, BeautifulSoup, and pdfminer. It covers fetching JSON data, extracting text from HTML and PDF formats, and emphasizes ethical scraping practices and handling different file formats.

Xian Wed 07 August 2024

Introduction to Web Scraping

Web scraping is a technique used to extract data from websites. It involves fetching the web page content and parsing it to retrieve the required information. This is particularly useful for analyzing Federal Reserve communications, which are often published online in various formats such as HTML and PDF.

Essential Web Scraping Packages

  1. Scrapy

    Scrapy is a powerful and versatile web crawling framework designed for large-scale web scraping projects. It allows for defining navigation on websites and efficiently extracting the required data. Scrapy handles requests, responses, data extraction, and item pipelines seamlessly, making it ideal for complex and scalable scraping tasks.

  2. Selenium

    Selenium is a robust tool for automating web browsers. It's particularly useful for scraping dynamic websites where content is generated by JavaScript. Selenium can simulate user interactions like clicking buttons, filling forms, and scrolling, making it a versatile choice for scraping interactive pages.

  3. BeautifulSoup

    BeautifulSoup is a popular library for parsing HTML and XML documents. It creates a parse tree from page source code and provides Pythonic ways for navigating and modifying the parse tree. BeautifulSoup is known for its simplicity and ease of use, making it great for beginners.

  4. lxml

    lxml is a high-performance library for parsing and manipulating XML and HTML documents. Built on top of the powerful libxml2 and libxslt libraries, lxml is highly efficient and flexible, making it well-suited for handling large and complex documents with ease.

  5. Requests

    Requests is a simple and elegant HTTP library for Python, designed to make HTTP requests straightforward and human-friendly. It abstracts the complexities of making requests behind a beautiful, simple API, allowing for sending HTTP requests with minimal effort. Requests is highly reliable and widely used for accessing web pages and APIs.

Due to the straightforward nature of the Federal Reserve website, this project only needs requests and beautifulsoup.

Setting Up the Environment

Ensure the necessary libraries are installed. Use pip to install them:

pip install requests beautifulsoup4 pandas

Code Explanation and Implementation

The following code demonstrates how to scrape and process Federal Reserve communications. It fetches data from JSON endpoints, processes it into a DataFrame, and extracts text from HTML pages.

Fetch/XHR and API Usage

Understanding fetch/XHR (XMLHttpRequest) is crucial for web scraping, especially for websites that load content dynamically.

Fetch/XHR: Fetch and XMLHttpRequest are APIs used by websites to asynchronously fetch data from servers. This means that parts of the web page can be updated without reloading the entire page. When inspecting a web page's network activity (using browser developer tools), these requests can be seen under the 'Network' tab, often labeled as XHR or Fetch. These requests typically fetch JSON or other structured data formats, which can then be used to update the webpage content dynamically.

To scrape such dynamic content:

  1. Inspect the Network Activity: Open the web page in a browser, right-click, and select "Inspect" to open the developer tools. Navigate to the "Network" tab and reload the page. Look for requests labeled as Fetch/XHR.
  2. Locate the API Endpoints: Find the URLs being accessed by these requests. These URLs often return JSON data, which can be easier to handle and parse in Python.
  3. Use the API Directly: Instead of scraping the rendered HTML, directly access these endpoints using the requests library to fetch the data in JSON format and process it as needed.

For example, the Federal Reserve's website at https://www.federalreserve.gov/monetarypolicy/materials/ has an API that provides FOMC materials in JSON format. By inspecting the network activity, API endpoints such as https://www.federalreserve.gov/monetarypolicy/materials/assets/final-recent.json can be located.

Collecting FOMC Statements and Minutes

First Step: Get Page URL

import requests
import pandas as pd

# Function to fetch and process the data from the given URL
def fetch_and_process(url):
    response = requests.get(url, verify=False)  # Make a GET request to the URL
    data = response.json()  # Parse the JSON response

    # Normalize the JSON data into a DataFrame and filter for statements and minutes
    df = pd.json_normalize(data['mtgitems'])
    df = df[df['type'].isin(['St', 'Mn'])]

    files_data = []
    # Iterate over the rows of the DataFrame
    for i, row in df.iterrows():
        if 'files' in row and isinstance(row['files'], list):
            for file in row['files']:
                if 'name' in file and (file['name'] == 'HTML' or file['name'] == 'Minutes'):
                    files_data.append({
                        'date': row['d'], 'release': row['dt'], 'type': row['type'],
                        'url_1': row['url'], 'url_2': file['url']
                    })
        else:
            files_data.append({
                'date': row['d'], 'release': row['dt'], 'type': row['type'],
                'url_1': row['url'], 'url_2': None
            })

    return pd.DataFrame(files_data)  # Return the processed DataFrame

Second Step: Get Text

from bs4 import BeautifulSoup

# Function to get the text content from the given URL
def get_text_new(url):
    if url.startswith('http'):
        full_url = url
    else:
        full_url = "https://www.federalreserve.gov" + url

    response = requests.get(full_url, verify=False)  # Make a GET request to the full URL
    soup = BeautifulSoup(response.text, "html.parser")  # Parse the HTML content
    article = soup.find('div', {'id': 'article'})  # Find the article div
    return article.get_text() if article else ""  # Return the text content if found

Step Three: Run for All

api_url = "https://www.federalreserve.gov/monetarypolicy/materials/assets/final-recent.json"
file_list = fetch_and_process(api_url)  # Fetch and process the data from the API
statements = file_list[file_list['type'].isin(['St'])]  # Filter for statements

# Iterate over the rows of the statements DataFrame
for i, row in statements.iterrows():
    text = get_text_new(row["url_2"])  # Get the text content from the URL
    filename = row["date"] + "_" + row["release"] + "_" + row["type"]  # Create a filename
    # Write the text content to a file
    with open("FOMC/" + filename + ".txt", 'w', encoding='utf-8') as f:
        f.write(text)

For the press conference transcripts, since the Federal Reserve only provides PDF format, pdfminer is used to convert them to text format.

from pdfminer.high_level import extract_text

# Function to convert PDF content to text
def pdf_to_text(pdf_path, txt_path):
    text = extract_text(pdf_path)  # Extract text from the PDF
    # Write the extracted text to a text file
    with open(txt_path, 'w', encoding='utf-8') as txt_file:
        txt_file.write(text)

The source for the speech list is at https://www.federalreserve.gov/newsevents/speech/{year}-speeches.htm. Follow the previous approach to get each speech's URL and collect the original text from each page.

Please note that the Federal Reserve website has a long history, and the data format may change with advancements in technology. However, the general structure remains similar.

Handling Different File Formats

The Federal Reserve provides documents in both HTML and PDF formats. Since handling text files is easier in Python, the focus is on converting HTML content to text. This approach simplifies text processing and analysis.

Ethical Considerations

While web scraping, it is important to respect the website's robots.txt file, which specifies the rules for web crawlers. Always ensure that scraping activities comply with the website's terms of service.




Related posts: