Skip to the content.

Source Scrapers

Horizon fetches content from multiple source types. All scrapers inherit from BaseScraper, share an async HTTP client, and implement a fetch(since) method that returns a list of ContentItem objects. Sources are fetched concurrently via asyncio.gather.

Hacker News

File: src/scrapers/hackernews.py

Uses the Firebase HN API:

Stories and their comments are fetched concurrently. For each story, the top 5 comments are included (deleted/dead comments excluded, HTML stripped, truncated at 500 chars).

Config (sources.hackernews):

{
  "enabled": true,
  "fetch_top_stories": 30,
  "min_score": 100
}

Extracted data: title, URL (falls back to HN discussion URL), author, score, comment count, and top comment text.

GitHub

File: src/scrapers/github.py

Uses the GitHub REST API:

Two source types are supported:

Config (sources.github, list of entries):

{
  "type": "user_events",
  "username": "torvalds",
  "enabled": true
}
{
  "type": "repo_releases",
  "owner": "golang",
  "repo": "go",
  "enabled": true
}

Authentication: Set GITHUB_TOKEN in your environment for higher rate limits (5000 req/hr vs 60 without).

RSS

File: src/scrapers/rss.py

Fetches any Atom/RSS feed using the feedparser library. Tries multiple date fields (published, updated, created) with fallback parsing.

Config (sources.rss, list of entries):

{
  "name": "Simon Willison",
  "url": "https://simonwillison.net/atom/everything/",
  "enabled": true,
  "category": "ai-tools"
}

Extracted data: title, URL, author, content (from summary/description/content fields), feed name, category, and entry tags.

Reddit

File: src/scrapers/reddit.py

Uses Reddit’s public JSON API (www.reddit.com):

Subreddits and users are fetched concurrently. Comments are sorted by score, limited to the configured count, and exclude moderator-distinguished comments. Self-text is truncated at 1500 chars, comments at 500 chars.

Config (sources.reddit):

{
  "enabled": true,
  "fetch_comments": 5,
  "subreddits": [
    {
      "subreddit": "MachineLearning",
      "sort": "hot",
      "fetch_limit": 25,
      "min_score": 10
    }
  ],
  "users": [
    {
      "username": "spez",
      "sort": "new",
      "fetch_limit": 10
    }
  ]
}

Rate limiting: Detects HTTP 429 responses, reads the Retry-After header, waits, and retries once. Uses a descriptive User-Agent as required by Reddit’s API guidelines.

Extracted data: title, URL, author, score, upvote ratio, comment count, subreddit, flair, self-text, and top comments.

OpenBB

File: src/scrapers/openbb.py

Uses the OpenBB Platform Python SDK via obb.news.company() to fetch company news for one or more ticker watchlists.

The scraper imports openbb lazily. If the optional dependency is not installed, Horizon logs a warning and skips the source instead of failing the whole run.

Config (sources.openbb):

{
  "enabled": true,
  "watchlists": [
    {
      "name": "megacaps",
      "symbols": ["AAPL", "MSFT", "NVDA"],
      "enabled": true,
      "provider": "yfinance",
      "fetch_limit": 20,
      "category": "equities"
    }
  ]
}

Behavior:

Credentials: provider-specific secrets are resolved by the OpenBB SDK from its own environment variables or settings file. Horizon does not pass those values directly.

Extracted data: title, URL, author, published time, article body/excerpt, watchlist name, provider, category, and symbol list.

Twitter

File: src/scrapers/twitter.py

Uses the Apify platform to bypass Twitter’s anti-scraping measures. The actor altimis~scweet is called via the Apify REST API.

Flow:

  1. POST to /v2/acts/{actor_id}/runs to trigger a run
  2. Poll /v2/actor-runs/{run_id} until status is SUCCEEDED or a terminal failure
  3. GET /v2/datasets/{dataset_id}/items to retrieve results

Config (sources.twitter):

{
  "enabled": true,
  "users": ["karpathy", "ylecun"],
  "fetch_limit": 10,
  "fetch_reply_text": false,
  "max_replies_per_tweet": 3,
  "max_tweets_to_expand": 10,
  "reply_min_likes": 5,
  "actor_id": "altimis~scweet",
  "apify_token_env": "APIFY_TOKEN"
}

Authentication: Set APIFY_TOKEN in your .env. Get a token at console.apify.com.

Extracted data: tweet text, URL, author, publish time, likes, retweets, replies, views, and (optionally) reply-thread text appended under --- Top Comments ---.