Source Scrapers
Horizon fetches content from multiple source types. All scrapers inherit from BaseScraper, share an async HTTP client, and implement a fetch(since) method that returns a list of ContentItem objects. Sources are fetched concurrently via asyncio.gather.
Hacker News
File: src/scrapers/hackernews.py
Uses the Firebase HN API:
GET /topstories.jsonβ fetches top story IDsGET /item/{id}.jsonβ fetches story/comment details
Stories and their comments are fetched concurrently. For each story, the top 5 comments are included (deleted/dead comments excluded, HTML stripped, truncated at 500 chars).
Config (sources.hackernews):
{
"enabled": true,
"fetch_top_stories": 30,
"min_score": 100
}
fetch_top_storiesβ number of top story IDs to fetchmin_scoreβ minimum HN points to include a story
Extracted data: title, URL (falls back to HN discussion URL), author, score, comment count, and top comment text.
GitHub
File: src/scrapers/github.py
Uses the GitHub REST API:
GET /users/{username}/events/publicβ user activity eventsGET /repos/{owner}/{repo}/releasesβ repository releases
Two source types are supported:
user_eventsβ tracks push, create, release, public, and watch events for a userrepo_releasesβ tracks new releases for a specific repository
Config (sources.github, list of entries):
{
"type": "user_events",
"username": "torvalds",
"enabled": true
}
{
"type": "repo_releases",
"owner": "golang",
"repo": "go",
"enabled": true
}
Authentication: Set GITHUB_TOKEN in your environment for higher rate limits (5000 req/hr vs 60 without).
RSS
File: src/scrapers/rss.py
Fetches any Atom/RSS feed using the feedparser library. Tries multiple date fields (published, updated, created) with fallback parsing.
Config (sources.rss, list of entries):
{
"name": "Simon Willison",
"url": "https://simonwillison.net/atom/everything/",
"enabled": true,
"category": "ai-tools"
}
categoryβ optional tag for grouping (e.g.,"programming","microblog")
Extracted data: title, URL, author, content (from summary/description/content fields), feed name, category, and entry tags.
File: src/scrapers/reddit.py
Uses Redditβs public JSON API (www.reddit.com):
GET /r/{subreddit}/{sort}.jsonβ subreddit postsGET /user/{username}/submitted.jsonβ user submissionsGET /r/{subreddit}/comments/{post_id}.jsonβ post comments
Subreddits and users are fetched concurrently. Comments are sorted by score, limited to the configured count, and exclude moderator-distinguished comments. Self-text is truncated at 1500 chars, comments at 500 chars.
Config (sources.reddit):
{
"enabled": true,
"fetch_comments": 5,
"subreddits": [
{
"subreddit": "MachineLearning",
"sort": "hot",
"fetch_limit": 25,
"min_score": 10
}
],
"users": [
{
"username": "spez",
"sort": "new",
"fetch_limit": 10
}
]
}
sortβhot,new,top, orrising(subreddits);hotornew(users)time_filterβ fortop/risingsorts:hour,day,week,month,year,allmin_scoreβ minimum post score (subreddits only)
Rate limiting: Detects HTTP 429 responses, reads the Retry-After header, waits, and retries once. Uses a descriptive User-Agent as required by Redditβs API guidelines.
Extracted data: title, URL, author, score, upvote ratio, comment count, subreddit, flair, self-text, and top comments.
OpenBB
File: src/scrapers/openbb.py
Uses the OpenBB Platform Python SDK via obb.news.company() to fetch company news for one or more ticker watchlists.
The scraper imports openbb lazily. If the optional dependency is not installed, Horizon logs a warning and skips the source instead of failing the whole run.
Config (sources.openbb):
{
"enabled": true,
"watchlists": [
{
"name": "megacaps",
"symbols": ["AAPL", "MSFT", "NVDA"],
"enabled": true,
"provider": "yfinance",
"fetch_limit": 20,
"category": "equities"
}
]
}
watchlistsβ each enabled watchlist triggers onenews.company()call per runproviderβ OpenBB provider name for that watchlistsymbolsβ tickers fetched together for the same providerfetch_limitβ maximum rows requested from the providercategoryβ optional metadata tag stored on each item
Behavior:
- Wraps the synchronous OpenBB SDK in
asyncio.to_threadso the event loop stays responsive - Deduplicates duplicate news across watchlists by article URL
- Skips malformed rows, rows without URL/title/date, and items older than the current time window
- Keeps fetching other watchlists if one provider call fails
Credentials: provider-specific secrets are resolved by the OpenBB SDK from its own environment variables or settings file. Horizon does not pass those values directly.
Extracted data: title, URL, author, published time, article body/excerpt, watchlist name, provider, category, and symbol list.
File: src/scrapers/twitter.py
Uses the Apify platform to bypass Twitterβs anti-scraping measures. The actor altimis~scweet is called via the Apify REST API.
Flow:
- POST to
/v2/acts/{actor_id}/runsto trigger a run - Poll
/v2/actor-runs/{run_id}until status isSUCCEEDEDor a terminal failure - GET
/v2/datasets/{dataset_id}/itemsto retrieve results
Config (sources.twitter):
{
"enabled": true,
"users": ["karpathy", "ylecun"],
"fetch_limit": 10,
"fetch_reply_text": false,
"max_replies_per_tweet": 3,
"max_tweets_to_expand": 10,
"reply_min_likes": 5,
"actor_id": "altimis~scweet",
"apify_token_env": "APIFY_TOKEN"
}
usersβ Twitter screen names to monitor, without the@prefixfetch_limitβ maximum tweets to fetch per runfetch_reply_textβ whentrue, a second Apify run fetches reply bodies for each important tweet and appends them under--- Top Comments ---for AI analysismax_replies_per_tweetβ maximum reply lines per tweet (sorted by engagement score)max_tweets_to_expandβ cap on reply expansion runs per pipeline cycle, to control Apify credit usagereply_min_likesβ minimum likes required for a reply to be includedactor_idβ Apify actor ID (default:altimis~scweet)apify_token_envβ environment variable name containing the Apify API token
Authentication: Set APIFY_TOKEN in your .env. Get a token at console.apify.com.
Extracted data: tweet text, URL, author, publish time, likes, retweets, replies, views, and (optionally) reply-thread text appended under --- Top Comments ---.