ghidra/GhidraDocs/GhidraClass/BSim/BSimTutorial_Intro.html
2023-12-13 15:35:36 +00:00

115 lines
5.7 KiB
HTML
Executable File
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<h1 id="introduction-to-bsim">Introduction to BSim</h1>
<p>As youve reverse engineered software, youve likely asked the following questions:</p>
<ul>
<li>Which libraries were statically linked into this executable?</li>
<li>Does this executable share some code with another executable that Ive analyzed?</li>
<li>What are the differences between version 1 and version 2 of a given executable?</li>
<li>Does this executable share code with another executable in a large collection of binaries?</li>
<li>Was this function pulled from an open-source library?</li>
</ul>
<p>BSim is intended to help with these questions (and others) by providing a way to search collections of binaries for similar, but not necessarily identical, functions.</p>
<h1 id="how-does-bsim-work">How Does BSim Work?</h1>
<p>The idea behind BSim is to generate a <em>feature vector</em> for each function in a binary.
The vectors are generated by Ghidras decompiler.
Each feature represents a small piece of data flow and/or control flow of the associated function.
The decompiler normalizes the feature vector representation so that different, but functionally equivalent, pieces of code often produce the same features.
Certain attributes, such as values of constants, names of registers, and data types, are intentionally not incorporated into the features.</p>
<p>BSim vectors are compared using <em>cosine similarity</em>.
Discrepancies between the vectors for <code>foo</code> and <code>bar</code> which are caused by differences in compilers, target architectures, and/or small changes to the source code typically result in vectors which are close but not identical.</p>
<p>BSim vectors can be stored in a dedicated database.
BSim databases intended to hold large<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> numbers of vectors maintain an index based on <em>locality-sensitive hashing</em>.
The index drastically reduces the number of vector comparisons needed and allows for rapid retrieval of results.</p>
<p>Querying <code>foo</code> against a BSim database typically yields a number of potential matches.
Each individual match for <code>foo</code> can be compared to <code>foo</code> in a side-by-side view, and certain information (such as function name) can be quickly copied from a match to <code>foo</code>.</p>
<p>We frequently call BSim vectors the <em>BSim signature</em> of a function, or just the <em>signature</em> when the context is clear.</p>
<h1 id="why-bsim">Why “BSim”?</h1>
<p>We can think of each feature as representing a small piece of the <em>behavior</em> of a function, analogous to a snippet of source code.
Functions whose BSim vectors are close typically have many features in common, that is, they have <em>similar behavior</em>.
Hence the name “BSim”: <strong>B</strong>ehavioral <strong>Sim</strong>iliarity.</p>
<h1 id="bsim-clients-bsim-databases-and-ghidra-projects">BSim Clients, BSim Databases, and Ghidra Projects</h1>
<p>Using BSim involves the following components:</p>
<ul>
<li>A <em>BSim Client</em>, i.e., an instance of Ghidra with the BSim plugin enabled.
<ul>
<li>This is where the reverse engineering happens.</li>
</ul>
</li>
<li>A <em>BSim Database</em>, which stores the BSim signatures.
<ul>
<li>Also stores some metadata about each function and its containing executable.</li>
<li>In particular, stores the ghidra:// URL of the associated Ghidra program.</li>
<li>Does not store disassembly or decompiled functions.</li>
</ul>
</li>
<li>A <em>Ghidra Project</em>, which stores the analyzed programs used to populate the BSim database.
<ul>
<li>Given a BSim match, the BSim client can use the ghidra:// URL to retrieve a program from a Ghidra project for side-by-side comparisons.</li>
<li>Note that a single BSim database can reference multiple Ghidra projects.</li>
</ul>
</li>
</ul>
<h1 id="database-backends">Database Backends</h1>
<p>There are three supported database backends for BSim:</p>
<ol>
<li>
<p>PostgreSQL</p>
<ul>
<li>The Ghidra distribution includes the source for PostgreSQL, a PostgreSQL plugin for BSim, and a build script.</li>
<li>Populated from shared Ghidra projects (i.e., requires a Ghidra server).</li>
<li>Server not supported on Windows (no restriction on clients).</li>
</ul>
</li>
<li>
<p>Elasticsearch</p>
<ul>
<li>The <code>BSimElasticPlugin</code> extension contains an Elasticsearch plugin for BSim.</li>
<li>This plugin must be installed into an existing Elasticsearch database.</li>
<li>Populated from shared Ghidra projects.</li>
</ul>
</li>
<li>
<p>H2</p>
<ul>
<li>Simplest way to use BSim:
<ul>
<li>Backed by files on the users machine (dont need to install database server).</li>
<li>Can be created and populated quickly.</li>
<li>Supported on all platforms.</li>
</ul>
</li>
<li>Does not support large collections of binaries or multiple users.</li>
<li>Can be populated from non-shared (local) or shared Ghidra projects.</li>
</ul>
</li>
</ol>
<p>Next Section: <a href="BSimTutorial_Enabling.html">Starting Ghidra and Enabling BSim</a></p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Creating a database requires a <em>database template</em>, which determines the specifics of the index. Currently, Ghidra provides a <em>medium</em> template, intended for databases holding up to 10 million unique vectors, and a <em>large</em> template, intended for databases holding up to 100 million unique vectors. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
</li>
</ol>
</div>