-
Notifications
You must be signed in to change notification settings - Fork 661
vmray: support parsing flog.txt (Download Function Log)#2878
vmray: support parsing flog.txt (Download Function Log)#2878devs6186 wants to merge 3 commits intomandiant:masterfrom
Conversation
closes #2452
Adds support for parsing VMRay's flog.txt format -- the free "Download Function Log" available from VMRay Threat Feed - Full Report - Download Function Log. Users no longer need the full analysis ZIP archive to run capa against VMRay output.
What changed
| File | Change |
|---|---|
capa/features/extractors/vmray/flog_txt.py |
New parser: header validation, Process/Thread/Region block splitting, API trace line parsing, sys_ prefix stripping |
capa/features/extractors/vmray/__init__.py |
VMRayAnalysis.from_flog_txt() -- builds analysis object from standalone flog.txt (no ZIP) |
capa/features/extractors/vmray/extractor.py |
VMRayExtractor.from_flog_txt() -- convenience classmethod |
capa/helpers.py |
Detect flog.txt by filename + header magic in get_format_from_extension; updated unsupported-format error message to mention flog.txt |
capa/loader.py |
Route flog.txt inputs through VMRayExtractor.from_flog_txt in both get_extractor and get_file_extractors |
tests/test_vmray_flog_txt.py |
5 unit tests: minimal parse, header rejection, sys_ stripping, VMRayAnalysis construction, VMRayExtractor construction |
doc/usage.md |
Updated CAPE row to mention VMRay flog.txt alongside VMRay ZIP |
Usage
capa path/to/flog.txt --backend vmray
Notes
- Static features (imports, exports, sections, strings) are unavailable from flog.txt alone -- only dynamic API call features are extracted
- The failing tests in
tests/test_vmray_features.pyare pre-existing and unrelated: they require the large ZIP test fixture (tests/data/dynamic/vmray/...) which is not part of this repo
Checklist
- No CHANGELOG update needed
- No new tests needed
- No documentation update needed
- This submission includes AI-generated code and I have provided details in the description.
Summary of ChangesHello @devs6186, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances Highlights
New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with and on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request introduces support for parsing VMRay's flog.txt format, which is a great addition for users who don't have access to the full analysis ZIP. The implementation is generally sound and integrates well with the existing VMRay extractor. I have identified a few areas for improvement regarding robustness against malformed input and the completeness of the extracted features (specifically API arguments).
| def _parse_hex_or_decimal(s: str) -> int: | ||
| s = s.strip().strip('"') | ||
| if s.startswith("0x") or s.startswith("0X"): | ||
| return int(s, 16) | ||
| return int(s, 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _parse_hex_or_decimal function is not robust against empty strings. If a property in the flog.txt file is present but has no value (e.g., os_pid = ), int(s, 10) will raise a ValueError, causing the parser to crash. It would be safer to handle empty strings by returning a default value (like 0) or skipping the property.
| def _parse_hex_or_decimal(s: str) -> int: | |
| s = s.strip().strip('"') | |
| if s.startswith("0x") or s.startswith("0X"): | |
| return int(s, 16) | |
| return int(s, 10) | |
| def _parse_hex_or_decimal(s: str) -> int: | |
| s = s.strip().strip('"') | |
| if not s: | |
| return 0 | |
| if s.lower().startswith("0x"): | |
| return int(s, 16) | |
| return int(s, 10) |
| thread_blocks = [p.strip() for p in parts[1:] if p.strip()] | ||
|
|
||
| # First part: Process properties then Region: blocks | ||
| process_props = _parse_properties(header_and_regions.split("\nRegion:\n")[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The split by \nRegion:\n is less flexible than the regex-based splits used for Process: and Thread: blocks. If the log file contains trailing spaces after Region:, the split will fail to isolate the process properties. For consistency and robustness, consider using a regex similar to the ones used in lines 122 and 200.
| process_props = _parse_properties(header_and_regions.split("\nRegion:\n")[0]) | |
| process_props = _parse_properties(re.split(r"\n\s*Region:\s*\n", header_and_regions)[0]) |
| params_in=None, # flog.txt args could be parsed later into Param list | ||
| params_out=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, API call arguments are not being parsed and are set to None. Since many capa rules rely on specific argument values (e.g., registry keys, file paths, or flags), this significantly limits the effectiveness of the extractor when using flog.txt. While the comment acknowledges this as a future improvement, implementing even a basic parser for the args_str extracted in _parse_event would greatly enhance the utility of this new feature.
|
Thanks for the review! I've addressed all three suggestions in eca9286:
Please re-review |
williballenthin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- what are the pros/cons of using flog versus the full archive? we should document that clearly somewhere
- aside: could we provide a helper script that, given a sample hash, automatically retrieves the flog from vmray and shows the results?
- we need a reasonable collection of flog files committed to testfiles and run during CI. especially with the string parsing, which tends to be brittle, we need infrastructure in place to find regressions and bugs.
|
thanks @devs6186 |
Addresses reviewer feedback on mandiant#2878:
1. Document flog.txt vs full archive trade-offs in doc/usage.md with a
comparison table (available features, how to obtain, file size).
2. Add scripts/fetch-vmray-flog.py -- given a VMRay instance URL, API key,
and sample SHA-256, downloads flog.txt via the REST API and optionally
runs capa against it.
3. Add fixture-based regression tests (tests/fixtures/vmray/flog_txt/) with
three representative flog.txt files:
- windows_apis.flog.txt: Win32 APIs, string args with backslash paths,
numeric args, multi-process
- linux_syscalls.flog.txt: Linux sys_-prefixed calls (all stripped)
- string_edge_cases.flog.txt: paths with spaces, UNC paths, URLs, empty
tests/test_vmray_flog_txt.py gains 14 new feature-presence tests covering
API, String, and Number extraction at the call scope, plus negative checks
(double-backslash must not appear; sys_ prefix must not appear).
Fixes mandiant#2878
hey @williballenthin , I have addressed everything in the latest commit -- for the docs i added a comparison section in usage.md with a table laying out exactly what you get from for the fetch script - for the fixtures I added three flog.txt files under on adding real samples to testfiles , I am totally happy to do that as a follow-up, just wanted to point out that it needs a separate PR to the submodule. if you have particular samples in mind let me know and i'll set it up. |
Log" available from VMRay Threat Feed -> Full Report). Users no longer
need the full ZIP archive to run capa against VMRay output.
- capa/features/extractors/vmray/flog_txt.py: new parser for flog.txt
header validation, Process/Thread/Region block splitting, API trace
line parsing, sys_ prefix stripping
- VMRayAnalysis.from_flog_txt() and VMRayExtractor.from_flog_txt() for
constructing the extractor from a standalone flog.txt
- helpers.py: detect flog.txt by filename + header magic; update
unsupported-format error message to mention flog.txt
- loader.py: route flog.txt inputs through VMRayExtractor.from_flog_txt
- tests/test_vmray_flog_txt.py: 5 unit tests covering parse, header
rejection, sys_ stripping, analysis and extractor construction
Fixes mandiant#2452
- Use regex for Region: block splitting (consistent with Process:/Thread:)
- Parse API call arguments into Param objects so String/Number features
are extracted (string args use void_ptr+str deref to match XML convention)
- Use FunctionCall.model_validate instead of __init__ to work around
Pydantic alias "in" clashing with Python keyword
- Add test_parse_flog_txt_args_parsed covering string, numeric, and
no-arg API calls
Addresses reviewer feedback on mandiant#2878:
1. Document flog.txt vs full archive trade-offs in doc/usage.md with a
comparison table (available features, how to obtain, file size).
2. Add scripts/fetch-vmray-flog.py -- given a VMRay instance URL, API key,
and sample SHA-256, downloads flog.txt via the REST API and optionally
runs capa against it.
3. Add fixture-based regression tests (tests/fixtures/vmray/flog_txt/) with
three representative flog.txt files:
- windows_apis.flog.txt: Win32 APIs, string args with backslash paths,
numeric args, multi-process
- linux_syscalls.flog.txt: Linux sys_-prefixed calls (all stripped)
- string_edge_cases.flog.txt: paths with spaces, UNC paths, URLs, empty
tests/test_vmray_flog_txt.py gains 14 new feature-presence tests covering
API, String, and Number extraction at the call scope, plus negative checks
(double-backslash must not appear; sys_ prefix must not appear).
Fixes mandiant#2878
b58fbeb to
548d814
Compare