If we all start opting out of our posts being used for training models, doesn’t that reduce the influence of our unique voice and perspectives on those models? Increasingly, the models will be everyone’s primary window into the rest of the world. It seems like the people who care the least about these things will be the ones with the most data that ends up training the models’ default behavior.
âData Influencer
Honestly, itâs frustrating to me that users of the internet are forced to opt out of artificial intelligence training as the default. Wouldnât it be nice if affirmative consent was the norm for generative AI companies as they scrape the web and any other data repositories they can find to build increasingly larger and larger frontier models?
But, unfortunately, thatâs not the case. Companies like OpenAI and Google argue that if fair use access to all this data was taken away from them, then none of this technology would even be possible. For now, users who donât want to contribute to the generative models are stuck with a morass of opt-out processes across different websites and social media platforms.
Even if the current bubble surrounding generative AI does pop, much like the dotcom bubble did after a few years, the models that power all of these new AI tools wonât go extinct. So, the ghosts of your niche forum posts and social media threads advocating for strongly held convictions will live on inside the software tools. Youâre right that opting out means actively attempting not to be included in a potentially long-lasting piece of culture.
To address your question directly and realistically, these opt-out processes are basically futile in their current state. Those who opt out right now are still influencing the model. Letâs say you fill out a form for a social media site to not use or sell your data for AI training. Even if that platform respects that request, there are countless startups in Silicon Valley with plucky 19-year-olds who wonât think twice about scraping the data posted to that platform, even if they arenât technically supposed to. As a general rule, you can assume that anything youâve ever posted online has likely made it into multiple generative models.
OK, but letâs say you could realistically block your data from these systems or demand it be removed after the fact, would doing so lessen your voice or impact on the AI tools? Iâve been thinking about this question for a few days, and Iâm still torn.
On one hand, your singular information is just an infinitesimally small contribution to the vastness of the dataset, so your voice, as a nonpublic figure or author, likely isnât nudging the model one way or another.
From this perspective your data is just another brick in the wall of a 1,000-story building. And itâs worth remembering that data collection is just the first step in creating an AI model. Researchers spend months fine-tuning the software to get the results they desire, sometimes relying on low-wage workers to label datasets and gauge the output quality for refinement. These steps may further abstract data and lessen your individual impact.
On the opposite end, what if we compared this to voting in an election? Millions of votes are cast in American presidential elections, yet most citizens and defenders of democracy insist that every vote mattersâwith a constant refrain of âmake your voice heard.â Itâs not a perfect metaphor, but what if we saw our data as having a similar impact? A small whisper among the cacophony of noise, but still impactful on the AI modelâs output.
Iâm not fully convinced of this argument, but I also donât think this perspective should be dismissed outright. Especially for subject matter experts, your distinct insights and way of approaching information is uniquely valuable to the AI researchers. Meta wouldnât have gone through the trouble of using all those books in its new AI model if any old data would do the trick.
Looking toward the future, the true impact your data could have on these models will likely be to inspire âsyntheticâ data. As the companies who make generative AI systems run out of quality information to scrape, they will enter their ouroboros era; theyâll start using generative AI to replicate human data that they will then feed back into the system to train the next AI model to better replicate human responses. As long as generative AI exists, just remember that you, as a human, will always be a small part of the machineâwhether you want to be or not.
