Data Resources to Feed the Machine

Here’s some data resources to feed your hungry ML models or other applications:
Remember to check the data licenses for commercial friendly terms 🙂 Viva la profit 🙂

Ideally we can leave the web scraping and data gathering to others – at least one layer of separation and saving so much time.

The Granddaddy:
https://commoncrawl.org/
“We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone.”

https://oscar-project.org/
Open Super-large Crawled Aggregated coRpus”

https://huggingface.co/datasets
Lots of categorized data goodies here

https://github.com/datasciencemasters/data
Open Data Sources”

https://www.kaggle.com/datasets?fileType=csv

https://en.m.wikipedia.org/wiki/Wikipedia:Database_download
This dataset could need extra scrutiny and cross referencing for accuracy and facts-made-up/skewed bias,
but still a lot of potentially useful stuff still