AI Explained – The Challenges of LLMs and Content Retention Policies
In the era of big data and machine learning, the demand for large and diverse data for AI (Artificial Intelligence) use is increasing. However, this poses a challenge for organizations that must follow content retention policies that mandate the deletion or archiving of data after a certain period. This is especially relevant products that use AI LLMs (large language models) like the upcoming Microsoft 365 Copilot, that are powered by copious amounts of data sourced from your private content repositories that are used to ground the models.
Scoping becomes key
Scoping your AI effort poorly can create a lot of large secondary efforts like reviewing data retention policies. If you are starting your AI journey, we recommend you scope your first effort tightly. This reduces the size of secondary efforts like data retention reviews. Use your first efforts to inform your future direction.
So how do you prevent your content retention policies from accidentally giving your AI amnesia? Content retention policies are often designed to protect the privacy and security of data subjects and to reduce data storage and maintenance costs. However, these policies may also limit the availability and quality of data for LLMs, which need a lot of data to learn from and improve their performance. For example, if sales documentation is deleted or archived after a few years, it is no longer accessible or useful to the LLM that needs to use it.
Content retention policy assumptions have changed
AI use cases change the design assumptions used to create today’s content retention policies. There is a need to balance the conflicting interests of content retention policies and data needs for LLMs. How can organizations ensure that they follow the legal and ethical requirements of data protection, while also supplying enough data for LLMs to use?
Review these considerations
Here are some considerations that should be reviewed regarding your content policies today.
The purpose and scope of content retention policies
Organizations should review their content retention policies and ensure they align with the purpose and scope of their data collection and processing activities. For example, if an organization records meetings for the purpose of summarization, it should keep the recordings for as long as they are relevant and useful for that purpose, and no longer or shorter than necessary. In some cases, your retention may be measured in the time it takes for the LLM to process the original information.
The value and quality of data for LLMs
Organizations should assess the value and quality of their data for LLMs and prioritize the retention of data that are more valuable and useful for LLMs. For example, in theory, if an organization has code snippets that are rare, diverse, or high-quality, it should keep them longer than code snippets that are common, redundant, or low-quality. In practice, this is a much harder assessment and may require added monitoring as to where this content is used.
The risks and benefits of data retention and deletion to LLMs
Organizations should weigh the risks and benefits of keeping or deleting data for LLMs and balance them accordingly. For example, if an organization keeps sales documentation for LLMs, it may benefit from improved LLM performance and user satisfaction for their Dynamics 365 Copilot, but it may also incur higher storage costs and potential privacy breaches. Conversely, if an organization deletes sales documentation too soon for LLMs, it may reduce storage costs and privacy risks, but it may also lose valuable data and insights, hindering sales as well as degrading Copilot’s LLM performance and user satisfaction.
Based on these considerations, here are some recommendations on reviewing current retention policies to update them for the world of AI.
- Review the legal and ethical obligations of data protection. Organizations should review the legal and ethical obligations of data protection in their jurisdiction and industry and ensure that their content retention policies follow them. For example, organizations should follow the principles of data minimization, storage limitation, and accountability under the General Data Protection Regulation (GDPR) in the European Union.
- Review the business objectives and user expectations of LLMs. Organizations should review the business objectives and user expectations of LLMs and ensure that their content retention policies support them. For example, organizations should retain enough data to enable LLMs to generate accurate, relevant, and diverse code suggestions for developers.
- Review the technical capabilities and limitations of LLMs. Organizations should review the technical capabilities and limitations of LLMs and ensure that their content retention policies optimize them. For example, organizations should retain data in formats and platforms that are compatible with LLMs and delete or archive data that are outdated or incompatible with LLMs.
In conclusion, content retention policies and data needs for LLMs are two important but conflicting aspects of data management. Organizations should review their current retention policies and balance them according to the purpose, value, quality, risk, benefit, obligation, objective, expectation, capability, and limitation of their data collection and processing activities. By doing so, organizations can achieve a win-win situation where they protect their data subjects’ privacy and security while also enhancing their LLMs’ performance and user satisfaction.