For transparency, research and communal good, available here are exported data from the Marginalia Search' database, as well as other interesting data.
All data is provided as-is available under CC-BY-NC-SA 4.0; for commercial use, inquire at kontakt@marginalia.nu.
Some of these files are very large. Do not attempt to download them on your phone or open the larger ones in Excel, as it will likely crash. Sizes indicated are of the compressed data, the uncompressed content is typically about 8-10X larger.
File Name | Size | Fields |
---|---|---|
blacklist-23-11-01.csv.gz | 200 KB | domain |
domains-23-12-03.tsv.gz | 324 MB | id, domain, ..., status |
linkgraph-23-12-03.tsv.gz | 306 MB | source, dest (see ids in domains file) |
urls-meta-23-11-02.csv.gz | 17 GB | url, state, title, description, format,
features, data_hash, quality, pub_year |
atags-23-11-03.csv.gz | 2.8 GB | anchor tag texts, sorted; dest link,text,origin domain |
atags-23-11-03.parquet | 3.0 GB | dest, url[], text[], source[] |
atags-24-12-10.parquet | 219 MB | dest, url[], text[], cnt[] |
feeds.csv (Known RSS/Atom feeds) | 29 MB | domain, (ignore), url |
The blacklist contains a list of domains that for one reason or another are not indexed on Marginalia Search. There are definitely false positives. Even though the list contains a lot of sketchy stuff, it is shared in the name of transparency.
If you are concerned that a website is blacklisted when you think it shouldn't be, do reach out, contact address in the footer.
If you are looking to implement a clean search filter or similar, you might be interested in the UT1 blacklists, as they are labelled and curated in a way this one isn't.
Other projects share data too, including Wikipedia and Stackexchange.