Beekeeper Boosts User Experience with AI-Driven Personalization using Amazon Bedrock

Press Release

1 week ago

Key Takeaways:

Large Language Models (LLMs) are rapidly evolving, making it challenging for organizations to select the best model for each use case and optimize prompts for quality and cost.
Beekeeper’s solution addresses this challenge with an automated system that continuously evaluates model and prompt combinations, incorporates user feedback, and adapts to changing model capabilities.
The system uses a combination of synthetic data and user feedback to evaluate and refine prompts, ensuring a high-quality and personalized experience for users.
The solution reduces manual labor, shortens the feedback cycle, and enables the creation of user- or tenant-specific improvements.
Organizations can build a similar pipeline using AWS services, such as Amazon Bedrock, AWS Lambda, Amazon EKS, and Amazon Mechanical Turk.

Introduction to Large Language Models
Large Language Models (LLMs) are evolving at a rapid pace, making it difficult for organizations to select the best model for each specific use case, optimize prompts for quality and cost, and adapt to changing model capabilities. As Mike Koźmiński from Beekeeper notes, "Choosing the ‘right’ LLM and prompt isn’t a one-time decision—it shifts as models, prices, and requirements change." To address this issue, Beekeeper built an Amazon Bedrock-powered system that continuously evaluates model and prompt candidates, ranks them on a live leaderboard, and routes each request to the current best choice for that use case.

Beekeeper’s Solution
Beekeeper’s solution consists of two main phases: building a baseline leaderboard and personalizing with user feedback. The system uses several AWS components, including Amazon EventBridge for scheduling, Amazon Elastic Kubernetes Service (EKS) for orchestration, AWS Lambda for evaluation functions, Amazon Relational Database Service (RDS) for data storage, and Amazon Mechanical Turk for manual validation. As Koźmiński explains, "The system mutates promising prompts to create variations, evaluates these again, and saves the best performers. When user feedback arrives, the system incorporates it through a second phase." The coordinator fetches ranked model/prompt pairs and sends them with user feedback to a mutator, which returns personalized prompts.

Evaluation Criteria
The quality of summaries generated by model/prompt pairs is measured using both quantitative and qualitative metrics, including compression ratio, presence of action items, lack of hallucinations, and vector comparison. For example, the compression ratio evaluates the length of the summarized text compared to the original one and its adherence to a target length. As the article notes, "The corresponding score, between 0 and 100, is computed programmatically with the following Python code: def calculate_compression_score(original_text, compressed_text):". The presence of action items is checked by comparing the summary to the ground truth, and the lack of hallucinations is evaluated using cross-LLM evaluation and manual validation.

Real-World Example: Chat Summarization
One practical application of Beekeeper’s LLM system is chat summarization. When a user returns to shift, they might find a chat with many unread messages – instead of reading everything, they can request a summary. The system generates a concise overview with action items tailored to the user’s needs. Users can then provide feedback to improve future summaries. As the article notes, "This seemingly simple feature relies on sophisticated technology behind the scenes. The system must understand conversation context, identify important points, recognize action items, and present information concisely—all while adapting to user preferences."

Benefits and Conclusion
The key benefit of Beekeeper’s solution is its ability to rapidly evolve and adapt to user needs. By combining the benefits of synthetic data with user feedback, the solution is suitable even for smaller engineering teams. As Koźmiński concludes, "The proposed solution offers several notable benefits. It reduces manual labor by automating the LLM and prompt selection process, shortens the feedback cycle, enables the creation of user- or tenant-specific improvements, and provides the capacity to seamlessly integrate and estimate the performance of new models in the same manner as the previous ones." By building a similar pipeline using AWS services, organizations can create a feedback-driven system that continuously improves results for their users.

https://aws.amazon.com/blogs/machine-learning/how-beekeeper-optimized-user-personalization-with-amazon-bedrock/