RE: The Fear We Will Run Out Of Data
You are viewing a single comment's thread:
I like your point that Web3's role in data availability for training AIs is essential. It's a given that well-resourced interests will use all the data they can. If the little guy is to keep up, then improving the ratio of public to private datasets will matter.
Synthetic data has limits before error rates become a problem, but it has its uses. I worked on one image recognition project where we took the AI expert's request for 1200 training examples per item and reduced that to 7 by applying some domain-specific knowledge and my experience with image generation to produce synthetic data.
While synthetic data has its uses, my experience suggests it can't fully replace the need for diverse real-world data. Imagine exploring a landscape. Densely populated areas represent known information, while uncharted territories hold the potential for new discoveries. AI trained on a limited dataset might struggle to explore these uncharted territories, potentially limiting its expressiveness.
So, yes, web3 content production is crucial, not least so that the little guys have sufficient data to train on.