Lessons from building Sherlock: Automating security code reviews with Sourcegraph
This post covers an internal tool developed by the Sourcegraph Security team to enhance code review. While this tool isn’t publicly available, key insights are being integrated into Sourcegraph Agents. Join the waitlist to get early access
Security teams often find themselves fighting an uphill battle when it comes to code reviews. Second-generation SAST (Static Application Security Testing) tools generate a flood of alerts—many of them false positives—forcing engineers to spend hours triaging noise instead of fixing real vulnerabilities. Most tools also lack the context needed to separate genuine risks from false alarms. Meanwhile, cybersecurity benchmarks from the PurpleLlama project highlight how Large Language Models (LLMs) are becoming increasingly effective at spotting vulnerable patterns in code, pointing to a major shift in automated security analysis.
Motivated by these advancements and the limitations of traditional rule-based SAST tools, our security team built Sherlock, an internal tool that enhances code reviews by tapping into the contextual awareness of Sourcegraph Cody to provide richer insights into pull requests and diffs. Sherlock combines LLM capabilities with targeted prompts, repository metadata, and code scanner outputs to automate large parts of the security review process. This allows security engineers to focus on high-risk issues rather than getting bogged down in noise. In this post, we’ll share our journey in building and refining Sherlock, the challenges we faced, and how we successfully integrated it into our code review workflow.
How Sherlock Got Started
We started exploring AI’s potential for security reviews during a team hackathon. We experimented with Sourcegraph APIs to automate parts of the process, which led to integrating additional security context like scanner alerts and custom prompts. This evolved into Sherlock, a system that triages alerts, summarizes code changes from a security perspective, and streamlines reviews using Sourcegraph Cody’s contextual insights.
Behind the Scenes: How Sherlock Works

Sherlock integrates seamlessly with our GitHub workflow and Sourcegraph setup, enhancing security reviews without disrupting development. Whenever a pull request is opened or updated, a GitHub App streams diffs, SAST alerts, and key metadata directly into Sherlock. From there, Sherlock enriches the data by adding references, related files, and definitions before leveraging Sourcegraph Cody’s API for context-aware analysis.
To focus on the right areas, Sherlock uses custom security prompts tailored to the nature of our codebase and existing threat models. This approach highlights specific risks and flags edge cases that might otherwise be overlooked. Code scanner alerts are automatically fed into Sherlock, allowing it to correlate scanner findings with LLM-driven insights. This correlation process reduces noise and emphasizes issues that genuinely warrant attention.
Once the analysis is complete, Sherlock produces a concise summary of potential security concerns and prioritizes them by severity. This summary is sent to a dedicated Slack channel for quick visibility among the security team and logged in our SIEM for broader alerting workflows.
Challenges Encountered
Hallucinations and Irrelevant References
Even with ongoing improvements, LLMs occasionally flag non-existent vulnerabilities or point to unrelated files. To reduce these hallucinations, we’ve refined prompts and added contextual cues such as threat model docs, pull request metadata, and relevant code snippets backed by Sourcegraph context to keep the models focused. Another related issue is current LLMs’ tendency to suggest security best practices instead of identifying edge cases or vulnerabilities tied to specific code changes. Customizing prompts and narrowing the data fed to the models have proven effective in minimizing these off-target recommendations.
Limitations of Code Navigation
While LLMs are adept at identifying common security patterns, they don’t navigate codebases the way humans do by mapping out symbols, references, and definitions across files to build a complete picture. As a result, they can sometimes report code snippets as vulnerable without understanding the broader context. Future integrations with Sourcegraph’s code navigation APIs or language server protocols should help the models form a more holistic view of the application, bringing it closer to the thoroughness of a human-driven security review.
Key Metrics
In the last two months, Sherlock has scanned over 400 pull requests in our main repository, uncovering three high-severity and four medium-severity issues, along with 12 notable edge cases. These findings encompass both obvious, low-hanging vulnerabilities and subtler problems often overlooked by traditional scanning tools. While the development team swiftly addressed these issues based on the validity of findings and context, the real value lies in the time saved and the granular insights gained from systematic, automated checks.
Edge Cases and Uncovering Low-Hanging Fruit
Sherlock excels at identifying situations that fall outside standard pattern-matching approaches, particularly those that require a deeper understanding of how code changes interact with different parts of the application. When it comes to enumerating risks and edge cases, Sherlock augments the security team’s efforts by providing contextual prompts that surface nuanced issues. At the same time, its ability to flag straightforward yet critical vulnerabilities—those low-hanging fruit that code scanning tools might miss—ensures that seemingly minor flaws don’t slip through the cracks.
Proactive, Efficient Reviews
By delivering valid, contextualized alerts, Sherlock enables proactive code security reviews. Developers and security engineers receive timely notifications about pull requests, ensuring that security remains a constant focus rather than an afterthought. In practice, this approach reduces false positives and cuts manual triage efforts by at least 30 minutes per day per security engineer, freeing the security team to concentrate on more complex tasks like analyzing architecture changes or refining threat models.
Business Value
Beyond the technical gains, we believe Sherlock now delivers tangible business benefits. It boosts productivity by helping security engineers and on-call responders prioritize actual risks rather than sifting through a barrage of irrelevant alerts. This focus on high-value or high-risk areas accelerates development cycles without compromising security rigor. By enumerating edge cases and summarizing code changes through a security lens, Sherlock empowers the team to make informed, risk-based decisions quickly, ultimately enhancing confidence in the overall safety of the codebase.
Conclusion
Sherlock demonstrates the power of combining LLMs with rich context to triage security issues effectively. What started as an internal experiment quickly became a critical tool that saves the security team time and allows us to focus on more impactful work.
Building Sherlock required laying down a significant amount of foundational functionality: connecting to code hosts, making calls to LLM providers, gathering and passing context, navigating the codebase, sending Slack notifications, and integrating with external tools like SAST scanners.
We’re building Sourcegraph Agents to provide a framework for these kinds of systems—offering pre-built agents for common tasks and composable components that let you quickly build and customize agents to meet your internal needs.
At Sourcegraph, we’re focused on industrializing software development with AI agents. If you want to build with us, join the waitlist for early access.