Large Language Models (LLMs) suck in all the data they can find, train the database, then spit out answers in varying formats to queries that are also in varied formats.
It’s a form of neural networking, but it’s not human thought. It’s computing.
Most humans can determine immediately whether the answer to a question is bullshit. It’s the source of both comedy and drama. A farce, after all, begins when someone tells a lie.
Computers don’t have that facility. They depend for their output entirely on their input. That’s why models are collapsing, as my friend Steven J. Vaughan-Nichols wrote this week. Without a reliable way to tell true data from false, you’ll believe anything.
<joke>You might say AIs are Trump voters. </joke>
The only way computer scientists believe they can solve this problem is to get more data, to get more truth, and hope it drives out falsity. The answer is always more. Thus, executives like Facebook’s Nick Clegg insist the models can’t work if they need permission to use all the data they want. (Note: Clegg left Facebook in January.) Right now, the regulatory environment is inclined to give them that permission, hoping artificial intelligence will force down the cost of the real kind.
But here’s a question. What if they’re wrong?
The Value of Truth
Not all data is created equal. Some things are true while some things are false. Computers have no clear way of separating the two. They seem to be engaged in a constant political campaign, hoping that most inputs are indeed true.
The problem is they’re polluting their own data stream. Falsehoods are pollution in data. If you take in all the data you’re going to have a lot of pollution, and in time that pollution will naturally drive out the good stuff, the truth.
I’m going to go back to my Coca-Cola analogy again for a moment, but in a different context. The only way to make every Coca-Cola taste the same is to police the front end, to treat the water an LLM uses to reach its conclusions. This means data sources that police themselves, that adhere to truth, that do what we like to call journalism, are going to be worth far more than those that digitally print anything that comes in.
Treating data starts with having a good data store. There can be some pollution in it. The water a Coke bottler treats comes from local sources. But the bottler also knows something about that source.
The answer to the problem of AI truth, then, isn’t to take more data into the model. It’s to take less data, and to assign value to the incoming data so its output won’t poison people. As was true in his political career, Clegg is completely wrong here. The AI masters are completely wrong.
Truth has value, and if you don’t pay for that value on the front end, you won’t get truth on the back end.