Last updated on June 27th, 2025 at 03:48 pm
Training AI models is expensive. There is a high cost for hardware infrastructure like GPUs, massive amounts of datasets, and the cost of human experts to ensure the process of training AI models works effectively.
Whatever answers AI can provide are based on the training data it has received. So, what is the training data it has received? It is what people have written on their websites, blogs, and other places on the Internet.
Because the efficiency of an AI model depends on the data it was trained on, tech companies are now looking forward to training these AI models on books like historical books that are not on the Internet, books sitting in libraries, and books that were written hundreds of years ago.
If AI models could train on information and gain their knowledge from documented datasets like books, that information could be more practically used to solve human problems with the help of AI.
The old books from Harvard’s library seem to be going to AI researchers so they can continue training these AI models and make them more intelligent. Since the purpose of AI is to solve problems and help us reach specific goals, training them on books is going to be beneficial.
In addition to Harvard, a large public library in Boston will also be sharing its old newspapers and other documented information with this team of AI researchers.
AI models have been trained on some old books and archived newspapers, but that’s not much. More datasets are required, and this move will significantly improve the intelligence of AI models. The more AI chatbots learn from books, the better they become.
Some AI models are currently being sued because there are allegations that they have been trained on copyrighted information and data, such as creative works that belong to artists and are being used without the creators’ permission. Now, the advantage of these old books being used for training these models is that they are free to use and usually do not have any copyright claims.
Although technically not all books in a library are copyright-free, those that are very old and whose copyrights have expired become public domain information.
Some legal teams have expressed the opinion that using books free from copyright issues is a great idea because it helps avoid legal troubles. Lawsuits can be very complex and expensive.
Libraries are mountains of information, and they do a great job of collecting, preserving, and sharing knowledge across generations—knowledge that is really old and written in many languages, giving contextual references to cultures that are thousands of years old. There is so much to learn from such material.
There are reports that Microsoft, the big tech company behind the ChatGPT model, has worked with Harvard under a financial agreement to facilitate the process where their libraries can prepare books in a way that AI can read from and learn.
Library involvement in how AI is trained is a great way to move forward in AI model training, as expressed by a Harvard researcher. Libraries are rich sources of knowledge that contain references to culture, language, and other valuable information.
Trust is very important when sharing information on the Internet. Books are trusted sources, so AI models are expected to become smarter and provide more accurate information because their training data will include these old books, which are normally considered reliable.
Artificial intelligence models were previously trained on data that they could access from anywhere online, for example, blog posts, websites, forums like Reddit, or other random places where the information could be incorrect, unreliable or not trustworthy.
For training, AI uses a technology called tokens, which are like pieces of words. There is a tremendous amount of information from which these models need to learn and train themselves. The more tokens an AI is trained on, the better it can understand, process, and provide answers.
Facebook’s AI has trained on trillions of tokens. Compared to that, Harvard’s data is less, but as discussed earlier, most of the data existing AI models have been trained on so far is already digitally available and in many cases, may not be trustworthy or reliable.
So now, instead of using information that is readily available and in some instances less trustworthy, AI models are going to be working with libraries for their training purposes. This is because old books are more trustworthy than unverified content available online—written by people whose credentials and accuracy may not be validated.
Some of the oldest libraries have received financial assistance to share and prepare old books for AI models, so the world can benefit from the efficiency of these AI models as they become smarter by learning from the knowledge in these old books.
All the content that gets digitized will need to be in the public domain, which is what the libraries want. Freely available information can help citizens access knowledge, innovate, and build upon it, which is great for both personal and national development.
Even old newspapers read long ago by French-speaking Canadians are going to be scanned, and they will be a great source of information for AI models to learn about culture and language. Old newspapers are like windows to the past. They’re a great way to understand how people lived, what ideas that generation had, and how humans evolved.
This is a great opportunity not just for AI models but also for the public in general, because this information is expected to be made public. Anyone who wants to access it digitally can do so—not just AI models using it for learning purposes.
AI models do not want to get into trouble for using information from copyrighted books. So, this strategy of working with libraries which will be funded and will provide information from old books written hundreds of years ago, is a great source for training these AI models and also a huge advantage for the public.
The information sources that these models will use are expected to be those that belong to the public domain and are not copyrighted.
AI models are not inherently ethical or unethical at this stage, since they are systems built by humans. But as machine learning becomes more sophisticated, it is necessary to train AI models on good datasets that are uplifting and knowledge-building for the betterment of society.
It is expected that all this information retrieved from these old books will be shared on a website called Hugging Face, and this data will be available to anyone who wants to train or teach AI.
There is a huge cultural advantage to humanity when AI models learn from these old cultures, which are written in many different languages beyond English which is the primary language AI models have been trained on so far.
AI models will somewhat be like employees in an organization. Having them trained on content from centuries-old books helps them understand how people thought and solved problems, which is very useful for building smart AI models that act like intelligent employees who will play a crucial role in organizations.
Content is a critical part of growth and must be absorbed with care and reasoning. Tech companies are aware that AI models need to avoid learning harmful content or irrelevant information that is not useful.
Building AI models that are helpful, not harmful, is very important.
Institutions like Harvard are going to play a key role in providing datasets that will be very useful for training AI models.
Updated on: June 27, 2025
As concerns have risen and many creators have filed lawsuits against AI companies over the unauthorized use of their copyrighted creative work, a federal judge has ruled that AI models like Anthropic can train on published works such as books without the authors’ consent. The judge stated that this should be considered fair use, as the process is transformative in the context of AI training. Whether or not other courts will reference this judge’s ruling remains unknown.