Building on ATProto is a team sport. As we've shown previously, in open social, we only win when other folks in the ATmosphere win. In that effort, the Graze team is delighted to announce access, effective immediately, to two archived datasets for researchers, developers, archivists, and other folks looking to push the boundaries of the ATmosphere.
The turbostream has been available for about six months via websocket - in short, it is a stream of metadata-enriched posts that hydrate referenced objects in posts such as the author of the post, mentioned users, parent/quoted posts, and so forth. Under the hood, we've been storing that data to S3 for long term archival - we've now made that S3 bucket public, and have set it up for requestor-pays access. In theory, nearly every single post should be within this archive, enriched with these referenced objects to the greatest extent possible.
The megastream is a relatively new dataset - it is the turbostream, then enriched with ML inferences. At Graze, we run a handful of ML classifiers against every post to allow our users to be able to filter the content by those classifications. We also generate several text embeddings, and as of recently, even generate text transcriptions for every video passing through Bluesky. This is now generally available in the megastream bucket. While the turbostream archive begins at 2025-04-21, the megastream bucket starts effective 2025-09-09.
Two S3 buckets provide enriched Bluesky data snapshots as SQLite databases:
Each file contains a several-minute slice of the Bluesky firehose that has been progressively enriched:
Available from: April 21, 2025
Available from: September 9, 2025
The Megastream enrichment adds extensive analysis to each post, including:
All inference scores are included as probability values (0-1 range) for each record.
jetstream_YYYYMMDD_HHMMSS.db.zip
Example:
jetstream_20250421_235152.db.zip
mega/mega_jetstream_YYYYMMDD_HHMMSS.db.zip
Example:
mega/mega_jetstream_20250909_181102.db.zip
Each .db.zip file is a compressed SQLite database containing enriched Bluesky posts from a specific time window.
These buckets use access control via whitelist. To request access:
Both buckets use Requester Pays, which means you pay for data transfer costs when downloading files. Storage costs are covered by the bucket owner.
Turbostream archive:
aws s3 ls s3://graze-turbo-01/ --request-payer requester
Megastream archive:
aws s3 ls s3://graze-mega-02/mega/ --request-payer requester
Turbostream:
aws s3 cp s3://graze-turbo-01/jetstream_20250421_235152.db.zip . --request-payer requester
Megastream:
aws s3 cp s3://graze-mega-02/mega/mega_jetstream_20250909_181102.db.zip . --request-payer requester
Turbostream:
aws s3 sync s3://graze-turbo-01/ ./turbo-archive/ --request-payer requester
Megastream:
aws s3 sync s3://graze-mega-02/mega/ ./mega-archive/ --request-payer requester
import boto3
s3 = boto3.client('s3')# List turbostream filesresponse = s3.list_objects_v2( Bucket='graze-turbo-01', RequestPayer='requester')for obj in response.get('Contents', []): print(obj['Key'])# List megastream filesresponse = s3.list_objects_v2( Bucket='graze-mega-02', Prefix='mega/', RequestPayer='requester')for obj in response.get('Contents', []): print(obj['Key'])# Download a turbostream files3.download_file( 'graze-turbo-01', 'jetstream_20250421_235152.db.zip', 'local_turbo.db.zip', ExtraArgs={'RequestPayer': 'requester'})# Download a megastream files3.download_file( 'graze-mega-02', 'mega/mega_jetstream_20250909_181102.db.zip', 'local_mega.db.zip', ExtraArgs={'RequestPayer': 'requester'})AWS S3 data transfer pricing (as of 2025):
Check current pricing: https://aws.amazon.com/s3/pricing/
Contact Graze.social on BSky or via our site for assistance.