Marginalia Search Exports

For transparency, research and communal good, available here are exported data from the Marginalia Search' database, as well as other interesting data.

All data is provided as-is available under CC-BY-NC-SA 4.0; for commercial use, inquire at kontakt@marginalia.nu.

Some of these files are very large. Do not attempt to download them on your phone or open the larger ones in Excel, as it will likely crash. Sizes indicated are of the compressed data, the uncompressed content is typically about 8-10X larger.

Files

File Name	Size	Fields
blacklist-23-11-01.csv.gz	192 KB	domain
blacklist-25-03-11.tsv.gz	200 KB	domain
blacklist-26-04-05.csv.gz	200 KB	domain
domains-23-12-03.tsv.gz	324 MB	id, domain, ..., status
domains-25-03-11.tsv.gz	410 MB	id, domain, ..., status
domains-26-04-05.tsv.gz	639 MB	id, domain, ..., status
linkgraph-23-12-03.tsv.gz	306 MB	source, dest (see ids in domains file)
linkgraph-25-03-11.tsv.gz	304 MB	source, dest (see ids in domains file)
linkgraph-26-04-05.csv.gz	883 MB	source, dest (see ids in domains file)
urls-meta-23-11-02.csv.gz	17 GB	url, state, title, description, format, features, data_hash, quality, pub_year
atags-23-11-03.csv.gz	2.8 GB	anchor tag texts, sorted; dest link,text,origin domain
atags-25-04-20.csv.gz	6.1 GB	anchor tag texts, sorted; dest link,text,origin domain
atags-23-11-03.parquet	3.0 GB	dest, url[], text[], source[]
atags-24-12-10.parquet	219 MB	dest, url[], text[], cnt[]
atags-25-04-20.parquet	538 MB	dest, url[], text[], cnt[]
feeds.csv (Known RSS/Atom feeds)	29 MB	domain, (ignore), url

About the blacklist

The blacklist contains a list of domains that for one reason or another are not indexed on Marginalia Search. There are definitely false positives. Even though the list contains a lot of sketchy stuff, it is shared in the name of transparency.

If you are concerned that a website is blacklisted when you think it shouldn't be, do reach out, contact address in the footer.

If you are looking to implement a clean search filter or similar, you might be interested in the UT1 blacklists, as they are labelled and curated in a way this one isn't.

Other Cool Resources

Other projects share data too, including Wikipedia and Stackexchange.