mirror of
https://github.com/NationalSecurityAgency/ghidra.git
synced 2024-11-22 04:05:39 +00:00
115 lines
5.7 KiB
HTML
Executable File
115 lines
5.7 KiB
HTML
Executable File
<h1 id="introduction-to-bsim">Introduction to BSim</h1>
|
||
|
||
<p>As you’ve reverse engineered software, you’ve likely asked the following questions:</p>
|
||
|
||
<ul>
|
||
<li>Which libraries were statically linked into this executable?</li>
|
||
<li>Does this executable share some code with another executable that I’ve analyzed?</li>
|
||
<li>What are the differences between version 1 and version 2 of a given executable?</li>
|
||
<li>Does this executable share code with another executable in a large collection of binaries?</li>
|
||
<li>Was this function pulled from an open-source library?</li>
|
||
</ul>
|
||
|
||
<p>BSim is intended to help with these questions (and others) by providing a way to search collections of binaries for similar, but not necessarily identical, functions.</p>
|
||
|
||
<h1 id="how-does-bsim-work">How Does BSim Work?</h1>
|
||
|
||
<p>The idea behind BSim is to generate a <em>feature vector</em> for each function in a binary.
|
||
The vectors are generated by Ghidra’s decompiler.
|
||
Each feature represents a small piece of data flow and/or control flow of the associated function.
|
||
The decompiler normalizes the feature vector representation so that different, but functionally equivalent, pieces of code often produce the same features.
|
||
Certain attributes, such as values of constants, names of registers, and data types, are intentionally not incorporated into the features.</p>
|
||
|
||
<p>BSim vectors are compared using <em>cosine similarity</em>.
|
||
Discrepancies between the vectors for <code>foo</code> and <code>bar</code> which are caused by differences in compilers, target architectures, and/or small changes to the source code typically result in vectors which are close but not identical.</p>
|
||
|
||
<p>BSim vectors can be stored in a dedicated database.
|
||
BSim databases intended to hold large<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> numbers of vectors maintain an index based on <em>locality-sensitive hashing</em>.
|
||
The index drastically reduces the number of vector comparisons needed and allows for rapid retrieval of results.</p>
|
||
|
||
<p>Querying <code>foo</code> against a BSim database typically yields a number of potential matches.
|
||
Each individual match for <code>foo</code> can be compared to <code>foo</code> in a side-by-side view, and certain information (such as function name) can be quickly copied from a match to <code>foo</code>.</p>
|
||
|
||
<p>We frequently call BSim vectors the <em>BSim signature</em> of a function, or just the <em>signature</em> when the context is clear.</p>
|
||
|
||
<h1 id="why-bsim">Why “BSim”?</h1>
|
||
|
||
<p>We can think of each feature as representing a small piece of the <em>behavior</em> of a function, analogous to a snippet of source code.
|
||
Functions whose BSim vectors are close typically have many features in common, that is, they have <em>similar behavior</em>.
|
||
Hence the name “BSim”: <strong>B</strong>ehavioral <strong>Sim</strong>iliarity.</p>
|
||
|
||
<h1 id="bsim-clients-bsim-databases-and-ghidra-projects">BSim Clients, BSim Databases, and Ghidra Projects</h1>
|
||
|
||
<p>Using BSim involves the following components:</p>
|
||
|
||
<ul>
|
||
<li>A <em>BSim Client</em>, i.e., an instance of Ghidra with the BSim plugin enabled.
|
||
<ul>
|
||
<li>This is where the reverse engineering happens.</li>
|
||
</ul>
|
||
</li>
|
||
<li>A <em>BSim Database</em>, which stores the BSim signatures.
|
||
<ul>
|
||
<li>Also stores some metadata about each function and its containing executable.</li>
|
||
<li>In particular, stores the ghidra:// URL of the associated Ghidra program.</li>
|
||
<li>Does not store disassembly or decompiled functions.</li>
|
||
</ul>
|
||
</li>
|
||
<li>A <em>Ghidra Project</em>, which stores the analyzed programs used to populate the BSim database.
|
||
<ul>
|
||
<li>Given a BSim match, the BSim client can use the ghidra:// URL to retrieve a program from a Ghidra project for side-by-side comparisons.</li>
|
||
<li>Note that a single BSim database can reference multiple Ghidra projects.</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
|
||
<h1 id="database-backends">Database Backends</h1>
|
||
|
||
<p>There are three supported database backends for BSim:</p>
|
||
|
||
<ol>
|
||
<li>
|
||
<p>PostgreSQL</p>
|
||
|
||
<ul>
|
||
<li>The Ghidra distribution includes the source for PostgreSQL, a PostgreSQL plugin for BSim, and a build script.</li>
|
||
<li>Populated from shared Ghidra projects (i.e., requires a Ghidra server).</li>
|
||
<li>Server not supported on Windows (no restriction on clients).</li>
|
||
</ul>
|
||
</li>
|
||
<li>
|
||
<p>Elasticsearch</p>
|
||
|
||
<ul>
|
||
<li>The <code>BSimElasticPlugin</code> extension contains an Elasticsearch plugin for BSim.</li>
|
||
<li>This plugin must be installed into an existing Elasticsearch database.</li>
|
||
<li>Populated from shared Ghidra projects.</li>
|
||
</ul>
|
||
</li>
|
||
<li>
|
||
<p>H2</p>
|
||
|
||
<ul>
|
||
<li>Simplest way to use BSim:
|
||
<ul>
|
||
<li>Backed by files on the user’s machine (don’t need to install database server).</li>
|
||
<li>Can be created and populated quickly.</li>
|
||
<li>Supported on all platforms.</li>
|
||
</ul>
|
||
</li>
|
||
<li>Does not support large collections of binaries or multiple users.</li>
|
||
<li>Can be populated from non-shared (local) or shared Ghidra projects.</li>
|
||
</ul>
|
||
</li>
|
||
</ol>
|
||
|
||
<p>Next Section: <a href="BSimTutorial_Enabling.html">Starting Ghidra and Enabling BSim</a></p>
|
||
|
||
<div class="footnotes" role="doc-endnotes">
|
||
<ol>
|
||
<li id="fn:1" role="doc-endnote">
|
||
<p>Creating a database requires a <em>database template</em>, which determines the specifics of the index. Currently, Ghidra provides a <em>medium</em> template, intended for databases holding up to 10 million unique vectors, and a <em>large</em> template, intended for databases holding up to 100 million unique vectors. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
|
||
</li>
|
||
</ol>
|
||
</div>
|