When does generative AI qualify for fair use? 2.0

Late last year I read a blog post from a very talented engineer, titled “When does generative AI qualify for fair use?”. In this research study Suchir Balaji boiled down nuance and stochastic thought to a science that aids in the understanding of a Large Language Model. Teachings that can help us identify the line between regurgitation and creative generation. Where creative generation should be protected and regurgitation should be sanctioned.

As a creative myself, I deeply understand the importance of creative generation and its relationship with humanity. We create walls of delusion so we can break them when novelty and/or innovation is finally achieved. This is what I, personally, like to call the “creative process”.

Hallucinating LLMs are potentially a reflection of highly creative individuals steering a pattern matching machine down a path that it is uncomfortable with. Because, it is within the unknown and delusion that true creation occurs, an element or sanctuary that ultimately defines us as human.

Now, training pattern matching machines on the results of such painful and enlightening processes that a few go through, should be discussed. True art doesn’t come from funding or connections. It comes from a raw nerve. A raw place that not many survive, so I am here to restate a question to all corporate courts that are ready to listen, “When does AI qualify for fair use?”

—————————————————

Entropy is a key pillar in Mr. Balaji’s research. This is a concept that can be aligned philosophically and technically to linguistic machines.

The main reason ChatGPT produces low-entropy outputs is because it is “post-trained” using reinforcement learning -- in particular reinforcement learning from human feedback (RLHF). RLHF tends to reduce model entropy because one of its main objectives is to reduce the rate of hallucinations, and hallucinations are often caused by randomness in the sampling process. A model with zero entropy could easily have a hallucination rate of zero, although it would basically be acting as a retrieval database over its training dataset instead of a generative model.

A key point Mr. Balaji states is, “retrieval database over its training dataset instead of a generative model”. A retrieval based architecture is the basis of memoization or plagiarism. A true generative model that can combine multiple thoughts to form a new one, one that doesn’t exist, is the basis of hallucination or innovation. Identifying when retrieval is overtaking generation or when generation was simply regurgitation is the key to this equation.

Entropy

…there’s still a clear empirical correlation between entropy and the amount of information used from the training data. For example, it’s easy to see that the jokes produced by ChatGPT are all memorized even without knowing its training dataset, because they’re all produced nearly deterministically.

Mr. Balaji’s understanding of determinism in natural language processing is astute. We will look through a couple technical demos on how we can communicate this thinking appropriately, so any creative can understand how their work contributed towards a generated output deterministically rather than creatively.

—————————————————

I’d like to preface on why this distinction is important if any progress should begin in legal frameworks. As it is difficult to argue a combination of certain words can relate to an author’s work. Permutations of letters are not necessarily unique, the context is what is claimable, deterministic outputs draws from context, while creative and unique results draw from generalized heuristics.

Mistral: The Ship of Theseus: If parts of the text are "borrowed" (plagiarized) and parts are "invented" (hallucinated), at what point does it become a "new" work? Is it a patchwork, a remix, or a corruption?

Identifying deterministic results can lead to claimable training data. While creative and unique results can lead to true prior art for AI Patent filings. A bridge between the past and the inevitable future. Building systems for creatives to contribute their work into a network that respects their journey and curates the royalties necessary, while enabling a A.I. solution they can use safely. That is not only the evolution of this Silicon Valley era, but of Creative Commons as well.

Mistral: If a text is both copied and invented, who is the author? The original source? The AI? The user who prompted it? Implication: Authorship is decentralized. The "author" is a network of sources, algorithms, and users.

Creating a system with this level of clarity rather than excusing this technology to be alien of some kind, is a safer message for all. To reduce fear, to embrace, and to prosper with technology that is not there to replace, but improve our lives holistically.

—————————————————

Moving forward from attributing creative work appropriately. We can build a healthier ecosystem. One that meets a market equilibrium.

Bittensor is a project that succinctly reshapes the perspective and value systems that we currently know of into a framework that can lead to a more functional and healthier ML backed economy.

…smaller diverse systems can find niches within a much higher resolution reward landscape. The solution is a network of computers that share representations continuously and asynchronously, peer-to-peer (P2P) across the internet. The constructed market uses a digital ledger to record ranks and to provide incentives to peers in a decentralized manner. The chain measures trust, making it difficult for peers to attain rewards without providing value to the majority. Researchers can directly monetize machine intelligence work and consumers can directly purchase it.

At Rao, we are building the Seer network to train private LLM solutions, while keeping your data secure. Leasing your generative mind to other users for payment. A P2P machine learning solution that can be as private as you would like it to be, you are in control of the toggles, the guardrails, and who you let in or what you let out. And if you want to share your data with others, you will be compensated for those generalized outputs.

Mistral: A system where ledger-based authentication, autonomous classification of original work, and partitioned generative outputs converge to create a transparent, trustworthy, and fair ecosystem for both human and machine-generated content.

The ideal situation is a network of multiple entities with this similar goal. To build private networks of language models that can be shared inclusively or bridged externally into each network respectfully. Rao, Bittensor, and alike working together, with transparent capitalistic needs while upholding a shared guardianship around your personal information and compensation.

I am hoping to see a sea of networks rise.