Skip to the content.

Source Scrapers

Horizon fetches content from four source types. All scrapers inherit from BaseScraper, share an async HTTP client, and implement a fetch(since) method that returns a list of ContentItem objects. Sources are fetched concurrently via asyncio.gather.

Hacker News

File: src/scrapers/hackernews.py

Uses the Firebase HN API:

Stories and their comments are fetched concurrently. For each story, the top 5 comments are included (deleted/dead comments excluded, HTML stripped, truncated at 500 chars).

Config (sources.hackernews):

{
  "enabled": true,
  "fetch_top_stories": 30,
  "min_score": 100
}

Extracted data: title, URL (falls back to HN discussion URL), author, score, comment count, and top comment text.

GitHub

File: src/scrapers/github.py

Uses the GitHub REST API:

Two source types are supported:

Config (sources.github, list of entries):

{
  "type": "user_events",
  "username": "torvalds",
  "enabled": true
}
{
  "type": "repo_releases",
  "owner": "golang",
  "repo": "go",
  "enabled": true
}

Authentication: Set GITHUB_TOKEN in your environment for higher rate limits (5000 req/hr vs 60 without).

RSS

File: src/scrapers/rss.py

Fetches any Atom/RSS feed using the feedparser library. Tries multiple date fields (published, updated, created) with fallback parsing.

Config (sources.rss, list of entries):

{
  "name": "Simon Willison",
  "url": "https://simonwillison.net/atom/everything/",
  "enabled": true,
  "category": "ai-tools"
}

Extracted data: title, URL, author, content (from summary/description/content fields), feed name, category, and entry tags.

Reddit

File: src/scrapers/reddit.py

Uses Reddit’s public JSON API (www.reddit.com):

Subreddits and users are fetched concurrently. Comments are sorted by score, limited to the configured count, and exclude moderator-distinguished comments. Self-text is truncated at 1500 chars, comments at 500 chars.

Config (sources.reddit):

{
  "enabled": true,
  "fetch_comments": 5,
  "subreddits": [
    {
      "subreddit": "MachineLearning",
      "sort": "hot",
      "fetch_limit": 25,
      "min_score": 10
    }
  ],
  "users": [
    {
      "username": "spez",
      "sort": "new",
      "fetch_limit": 10
    }
  ]
}

Rate limiting: Detects HTTP 429 responses, reads the Retry-After header, waits, and retries once. Uses a descriptive User-Agent as required by Reddit’s API guidelines.

Extracted data: title, URL, author, score, upvote ratio, comment count, subreddit, flair, self-text, and top comments.