Marginalia Search Exports

For transparency, research and communal good, available here are exported data from the Marginalia Search' database, as well as other interesting data.

All data is provided as-is available under CC-BY-NC-SA 4.0; for commercial use, inquire at kontakt@marginalia.nu.

Some of these files are very large. Do not attempt to download them on your phone or open the larger ones in Excel, as it will likely crash. Sizes indicated are of the compressed data, the uncompressed content is typically about 8-10X larger.

Files

File Name Size Fields
blacklist-23-11-01.csv.gz 200 KB domain
domains-23-12-03.tsv.gz 324 MB id, domain, ..., status
linkgraph-23-12-03.tsv.gz 306 MB source, dest (see ids in domains file)
urls-meta-23-11-02.csv.gz 17 GB url, state, title, description, format,
features, data_hash, quality, pub_year
atags-23-11-03.csv.gz 2.8 GB anchor tag texts, sorted; dest link,text,origin domain
atags-23-11-03.parquet 3.0 GB dest, url[], text[], source[]
atags-24-12-10.parquet 219 MB dest, url[], text[], cnt[]
feeds.csv (Known RSS/Atom feeds) 29 MB domain, (ignore), url

About the blacklist

The blacklist contains a list of domains that for one reason or another are not indexed on Marginalia Search. There are definitely false positives. Even though the list contains a lot of sketchy stuff, it is shared in the name of transparency.

If you are concerned that a website is blacklisted when you think it shouldn't be, do reach out, contact address in the footer.

If you are looking to implement a clean search filter or similar, you might be interested in the UT1 blacklists, as they are labelled and curated in a way this one isn't.

Other Cool Resources

Other projects share data too, including Wikipedia and Stackexchange.