Here’s some data resources to feed your hungry ML models or other applications:
Remember to check the data licenses for commercial friendly terms 🙂 Viva la profit 🙂
Ideally we can leave the web scraping and data gathering to others – at least one layer of separation and saving so much time.
The Granddaddy:
https://commoncrawl.org/
“We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone.”
https://oscar-project.org/
“Open Super-large Crawled Aggregated coRpus”
https://huggingface.co/datasets
Lots of categorized data goodies here
https://github.com/datasciencemasters/data
“Open Data Sources”
https://www.kaggle.com/datasets?fileType=csv
https://en.m.wikipedia.org/wiki/Wikipedia:Database_download
This dataset could need extra scrutiny and cross referencing for accuracy and facts-made-up/skewed bias,
but still a lot of potentially useful stuff still