vmray: support parsing flog.txt (Download Function Log)#2878

Open

devs6186 wants to merge 3 commits intomandiant:masterfrom

devs6186:feature/2452-vmray-flog-txt

Open

vmray: support parsing flog.txt (Download Function Log)#2878
devs6186 wants to merge 3 commits intomandiant:masterfrom
devs6186:feature/2452-vmray-flog-txt

Conversation

Copy link

Contributor

devs6186 commented Feb 22, 2026

closes #2452

Adds support for parsing VMRay's flog.txt format -- the free "Download Function Log" available from VMRay Threat Feed - Full Report - Download Function Log. Users no longer need the full analysis ZIP archive to run capa against VMRay output.

What changed

File	Change
`capa/features/extractors/vmray/flog_txt.py`	New parser: header validation, Process/Thread/Region block splitting, API trace line parsing, `sys_` prefix stripping
`capa/features/extractors/vmray/__init__.py`	`VMRayAnalysis.from_flog_txt()` -- builds analysis object from standalone flog.txt (no ZIP)
`capa/features/extractors/vmray/extractor.py`	`VMRayExtractor.from_flog_txt()` -- convenience classmethod
`capa/helpers.py`	Detect flog.txt by filename + header magic in `get_format_from_extension`; updated unsupported-format error message to mention flog.txt
`capa/loader.py`	Route `flog.txt` inputs through `VMRayExtractor.from_flog_txt` in both `get_extractor` and `get_file_extractors`
`tests/test_vmray_flog_txt.py`	5 unit tests: minimal parse, header rejection, `sys_` stripping, `VMRayAnalysis` construction, `VMRayExtractor` construction
`doc/usage.md`	Updated CAPE row to mention VMRay flog.txt alongside VMRay ZIP

Usage

capa path/to/flog.txt --backend vmray

Notes

Static features (imports, exports, sections, strings) are unavailable from flog.txt alone -- only dynamic API call features are extracted
The failing tests in tests/test_vmray_features.py are pre-existing and unrelated: they require the large ZIP test fixture (tests/data/dynamic/vmray/...) which is not part of this repo

Checklist

No CHANGELOG update needed
No new tests needed
No documentation update needed
This submission includes AI-generated code and I have provided details in the description.

Copy link

Contributor

gemini-code-assist bot commented Feb 22, 2026

Summary of Changes

Hello @devs6186, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances capa's capability to analyze VMRay dynamic analysis reports by introducing direct support for flog.txt files. Previously, capa required the full VMRay analysis ZIP, but now users can leverage the free "Download Function Log" to extract dynamic API call features. This change streamlines the analysis workflow for users who only have access to the flog.txt output, making capa more versatile and accessible for VMRay report analysis.

Highlights

VMRay flog.txt Parsing: Implemented a new parser for VMRay's flog.txt format, allowing capa to analyze dynamic API calls directly from this file.
Standalone Analysis: Enabled capa to run against flog.txt files without requiring the full VMRay analysis ZIP archive, leveraging the free 'Download Function Log'.
Integration with capa Loader: Updated the capa loader and helper functions to automatically detect and route flog.txt inputs to the new VMRay extractor.
Changelog and Documentation: Added an entry to the CHANGELOG.md and updated doc/usage.md to reflect the new flog.txt support.
Unit Tests: Introduced new unit tests to validate the flog.txt parser, including header validation, sys_ prefix stripping, and VMRayAnalysis/VMRayExtractor construction.

New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

CHANGELOG.md
- Added a new feature entry for VMRay flog.txt support.
capa/features/extractors/vmray/init.py
- Imported new models (AnalysisMetadata, FileHashes) and the flog_txt module.
- Updated SUPPORTED_FLOG_VERSIONS to include "1" for flog.txt.
- Added a from_flog_txt class method to VMRayAnalysis to construct an analysis object from a standalone flog.txt file.
capa/features/extractors/vmray/extractor.py
- Added a from_flog_txt class method to VMRayExtractor for convenience in building an extractor from a flog.txt path.
capa/features/extractors/vmray/flog_txt.py
- Added a new module containing functions to parse VMRay flog.txt content, including header validation, process/thread/region block splitting, API trace line parsing, and sys_ prefix stripping.
capa/helpers.py
- Implemented logic in get_format_from_extension to detect flog.txt files based on filename and header magic.
- Updated the unsupported format error message to mention flog.txt as a supported VMRay report type.
capa/loader.py
- Modified get_extractor and get_file_extractors to conditionally use VMRayExtractor.from_flog_txt when the input file is detected as flog.txt.
doc/usage.md
- Updated the "Ways to consume capa output" table to explicitly mention VMRay flog.txt alongside VMRay ZIP for CAPE integration.
tests/test_vmray_flog_txt.py
- Added a new test file with unit tests for flog.txt parsing, including minimal parsing, header rejection, sys_ prefix stripping, and construction of VMRayAnalysis and VMRayExtractor objects.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini or @gemini-code-assist . Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with and on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. -

gemini-code-assist bot reviewed

Feb 22, 2026

View reviewed changes

Copy link

Contributor

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces support for parsing VMRay's flog.txt format, which is a great addition for users who don't have access to the full analysis ZIP. The implementation is generally sound and integrates well with the existing VMRay extractor. I have identified a few areas for improvement regarding robustness against malformed input and the completeness of the extracted features (specifically API arguments).

capa/features/extractors/vmray/flog_txt.py

Comment on lines 43 to 47

		def _parse_hex_or_decimal(s: str) -> int:
		s = s.strip().strip('"')
		if s.startswith("0x") or s.startswith("0X"):
		return int(s, 16)
		return int(s, 10)

Copy link

Contributor

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _parse_hex_or_decimal function is not robust against empty strings. If a property in the flog.txt file is present but has no value (e.g., os_pid = ), int(s, 10) will raise a ValueError, causing the parser to crash. It would be safer to handle empty strings by returning a default value (like 0) or skipping the property.

Suggested change

	def _parse_hex_or_decimal(s: str) -> int:
	s = s.strip().strip('"')
	if s.startswith("0x") or s.startswith("0X"):
	return int(s, 16)
	return int(s, 10)
	def _parse_hex_or_decimal(s: str) -> int:
	s = s.strip().strip('"')
	if not s:
	return 0
	if s.lower().startswith("0x"):
	return int(s, 16)
	return int(s, 10)

capa/features/extractors/vmray/flog_txt.py Outdated

		thread_blocks = [p.strip() for p in parts[1:] if p.strip()]

		# First part: Process properties then Region: blocks
		process_props = _parse_properties(header_and_regions.split("\nRegion:\n")[0])

Copy link

Contributor

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The split by \nRegion:\n is less flexible than the regex-based splits used for Process: and Thread: blocks. If the log file contains trailing spaces after Region:, the split will fail to isolate the process properties. For consistency and robustness, consider using a regex similar to the ones used in lines 122 and 200.

Suggested change

	process_props = _parse_properties(header_and_regions.split("\nRegion:\n")[0])
	process_props = _parse_properties(re.split(r"\n\sRegion:\s\n", header_and_regions)[0])

capa/features/extractors/vmray/flog_txt.py Outdated

Comment on lines 167 to 168

		params_in=None, # flog.txt args could be parsed later into Param list
		params_out=None,

Copy link

Contributor

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, API call arguments are not being parsed and are set to None. Since many capa rules rely on specific argument values (e.g., registry keys, file paths, or flags), this significantly limits the effectiveness of the extractor when using flog.txt. While the comment acknowledges this as a future improvement, implementing even a basic parser for the args_str extracted in _parse_event would greatly enhance the utility of this new feature.

Copy link

Contributor Author

devs6186 commented Feb 22, 2026

Thanks for the review! I've addressed all three suggestions in eca9286:

Empty string handling in _parse_hex_or_decimal -- now returns 0 for empty/missing values instead of crashing.
Region block splitting -- switched from split("\nRegion:\n") to re.split(r"\n\sRegion:\s\n", ...) for whitespace robustness, consistent with the Process/Thread splits.
API argument parsing -- implemented _parse_args() that extracts name=value pairs from the trace lines into Param
objects. String values are modelled as void_ptr + str deref (matching the XML convention) so String features are yielded;
numeric values use unsigned_32bit so Number features are yielded. Added a new test covering string, numeric, and no- arg calls.

Please re-review

williballenthin requested changes

Feb 23, 2026

View reviewed changes

Copy link

Collaborator

williballenthin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the pros/cons of using flog versus the full archive? we should document that clearly somewhere
aside: could we provide a helper script that, given a sample hash, automatically retrieves the flog from vmray and shows the results?
we need a reasonable collection of flog files committed to testfiles and run during CI. especially with the string parsing, which tends to be brittle, we need infrastructure in place to find regressions and bugs.

Copy link

Collaborator

williballenthin commented Feb 23, 2026

thanks @devs6186

devs6186 added a commit to devs6186/capa that referenced this pull request Feb 23, 2026


 vmray: add docs, fetch helper, and fixture-based regression tests for...

b58fbeb

... flog.txt Addresses reviewer feedback on mandiant#2878: 1. Document flog.txt vs full archive trade-offs in doc/usage.md with a comparison table (available features, how to obtain, file size). 2. Add scripts/fetch-vmray-flog.py -- given a VMRay instance URL, API key, and sample SHA-256, downloads flog.txt via the REST API and optionally runs capa against it. 3. Add fixture-based regression tests (tests/fixtures/vmray/flog_txt/) with three representative flog.txt files: - windows_apis.flog.txt: Win32 APIs, string args with backslash paths, numeric args, multi-process - linux_syscalls.flog.txt: Linux sys_-prefixed calls (all stripped) - string_edge_cases.flog.txt: paths with spaces, UNC paths, URLs, empty tests/test_vmray_flog_txt.py gains 14 new feature-presence tests covering API, String, and Number extraction at the call scope, plus negative checks (double-backslash must not appear; sys_ prefix must not appear). Fixes mandiant#2878

Copy link

Contributor Author

devs6186 commented Feb 23, 2026

what are the pros/cons of using flog versus the full archive? we should document that clearly somewhere

aside: could we provide a helper script that, given a sample hash, automatically retrieves the flog from vmray and shows the results?

we need a reasonable collection of flog files committed to testfiles and run during CI. especially with the string parsing, which tends to be brittle, we need infrastructure in place to find regressions and bugs.

hey @williballenthin , I have addressed everything in the latest commit --

for the docs i added a comparison section in usage.md with a table laying out exactly what you get from
flog.txt vs the full archive. figured that's the clearest place for it since people land there when figuring
out how to use capa.

for the fetch script - scripts/fetch-vmray-flog.py takes a hash + api key, looks up the sample, grabs the
most recent analysis and downloads the function log. added a --run-capa flag too so you can go from hash to
capa output in one shot. i don't have a vmray instance to test the exact endpoints against so i went with
what the REST api docs suggest and added a fallback, but if the endpoint paths are off for your setup they
should be easy to adjust.

for the fixtures I added three flog.txt files under tests/fixtures/vmray/flog_txt/ (in the main repo so CI
always has them without needing testfiles): a windows one with backslash paths and multi-process, a linux one
with 22 sys_ calls, and one that specifically targets the brittle stuff -- paths with spaces, UNC paths,
URLs. the test count went from 6 to 20, including negative checks (double-backslash form must not appear in
features, sys_-prefixed names must not appear).

on adding real samples to testfiles , I am totally happy to do that as a follow-up, just wanted to point out that it needs a separate PR to the submodule. if you have particular samples in mind let me know and i'll set it up.

devs6186 added 3 commits February 24, 2026 09:30


 vmray: support parsing flog.txt (Download Function Log)

1a2c3c0

Adds a parser for the VMRay flog.txt format (the free "Download Function Log" available from VMRay Threat Feed -> Full Report). Users no longer need the full ZIP archive to run capa against VMRay output. - capa/features/extractors/vmray/flog_txt.py: new parser for flog.txt header validation, Process/Thread/Region block splitting, API trace line parsing, sys_ prefix stripping - VMRayAnalysis.from_flog_txt() and VMRayExtractor.from_flog_txt() for constructing the extractor from a standalone flog.txt - helpers.py: detect flog.txt by filename + header magic; update unsupported-format error message to mention flog.txt - loader.py: route flog.txt inputs through VMRayExtractor.from_flog_txt - tests/test_vmray_flog_txt.py: 5 unit tests covering parse, header rejection, sys_ stripping, analysis and extractor construction Fixes mandiant#2452


 vmray: address code review feedback for flog.txt parser

b924721

- Handle empty strings in _parse_hex_or_decimal (return 0 instead of crash) - Use regex for Region: block splitting (consistent with Process:/Thread:) - Parse API call arguments into Param objects so String/Number features are extracted (string args use void_ptr+str deref to match XML convention) - Use FunctionCall.model_validate instead of __init__ to work around Pydantic alias "in" clashing with Python keyword - Add test_parse_flog_txt_args_parsed covering string, numeric, and no-arg API calls


 vmray: add docs, fetch helper, and fixture-based regression tests for...

548d814

devs6186 force-pushed the feature/2452-vmray-flog-txt branch from b58fbeb to 548d814 Compare

February 24, 2026 04:01

Labels

None yet

Conversation

devs6186 commented Feb 22, 2026

What changed

Usage

Notes

Checklist

Uh oh!

gemini-code-assist bot commented Feb 22, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

devs6186 commented Feb 22, 2026

Uh oh!

williballenthin left a comment

Choose a reason for hiding this comment

Uh oh!

williballenthin commented Feb 23, 2026

Uh oh!

devs6186 commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants