Merge branch 'GP-0_ryanmkurtz_bsim-class-html'

This commit is contained in:
Ryan Kurtz 2024-10-30 06:51:08 -04:00
commit 3c5edba8ad
15 changed files with 9 additions and 812 deletions

View File

@ -20,4 +20,5 @@ eclipse.project.name = '_MarkdownSupport'
dependencies {
implementation 'org.commonmark:commonmark:0.23.0'
implementation 'org.commonmark:commonmark-ext-heading-anchor:0.23.0'
implementation 'org.commonmark:commonmark-ext-footnotes:0.23.0'
}

View File

@ -20,6 +20,7 @@ import java.util.List;
import java.util.Map;
import org.commonmark.Extension;
import org.commonmark.ext.footnotes.FootnotesExtension;
import org.commonmark.ext.heading.anchor.HeadingAnchorExtension;
import org.commonmark.node.Link;
import org.commonmark.node.Node;
@ -49,8 +50,9 @@ public class MarkdownToHtml {
throw new Exception("First argument doesn't not end with .md");
}
// Setup the CommonMark Library with the needed "anchor extension" library
List<Extension> extensions = List.of(HeadingAnchorExtension.create());
// Setup the CommonMark Library with the needed extension libraries
List<Extension> extensions =
List.of(HeadingAnchorExtension.create(), FootnotesExtension.create());
Parser parser = Parser.builder().extensions(extensions).build();
HtmlRenderer renderer = HtmlRenderer.builder()
.extensions(extensions)

View File

@ -1,83 +0,0 @@
<h1 id="bsim-databases-from-the-command-line">BSim Databases from the Command Line</h1>
<p>The <code>bsim</code> command-line utility, located in the <code>support</code> directory of a Ghidra distribution, is used to create, populate, and manage BSim databases.
It works for all BSim database backends.
This utility offers a number of commands, many of which have several options.
In this section, we cover only a small subset of the possibilities.</p>
<p>Running <code>bsim</code> with no arguments will print a detailed usage message.</p>
<h2 id="generating-signature-files">Generating Signature Files</h2>
<p>The first step is to create signature files from the binaries in the Ghidra project.
Signature files are XML files which contain the BSim signatures and metadata needed by the BSim server.</p>
<p><strong>Important</strong>: Its simplest to exit Ghidra before performing the next steps, because:</p>
<ul>
<li>The H2-backed database can only be accessed by one process at a time.</li>
<li>In case you have the <code>postgres_object_files</code> project open in Ghidra, signature generation will fail.
Non-shared projects are locked when open, and the lock will prevent the signature-generating process from accessing the project.</li>
</ul>
<p>To generate the signature files, execute the following commands in a shell (adjust as necessary for Windows).</p>
<pre><code class="language-bash">cd &lt;ghidra_install_dir&gt;/support
mkdir ~/bsim_sigs
./bsim generatesigs ghidra:/&lt;ghidra_project_dir&gt;/postgres_object_files --bsim file:/&lt;database_dir&gt;/example ~/bsim_sigs
</code></pre>
<ul>
<li>The <code>ghidra:/</code> argument is the local project which holds the analyzed binaries.
Note that there is only one forward slash in the URL for a local project.</li>
<li>The <code>--bsim</code> argument is the URL of the BSim database.
This command does not add any signatures to the database, but it does query the database for its settings.</li>
</ul>
<h2 id="committing-signature-files">Committing Signature Files</h2>
<p>Now, we commit the signatures to the BSim database with the following command (still in the <code>support</code> directory).</p>
<pre><code class="language-bash">./bsim commitsigs file:/&lt;database_dir&gt;/example ~/bsim_sigs
</code></pre>
<p>Once the signatures have been committed, start Ghidra again.</p>
<h2 id="aside-creating-a-database">Aside: Creating a Database</h2>
<p>We continue to use the database <code>example</code>, so this step isnt necessary for the exercises.</p>
<p>However, if we hadnt created <code>example</code> using <code>CreateH2BSimDatabaseScript.java</code>, we could have used the following command:</p>
<pre><code class="language-bash">./bsim createdatabase file:/&lt;database_dir&gt;/example medium_nosize
</code></pre>
<ul>
<li><code>medium_nosize</code> is a database template.
<ul>
<li>“medium” (vs. “large”) affects the vector index and is not relevant to H2 databases.</li>
<li>“nosize” means that size differences for varnodes of size four bytes and above are not incorporated into the BSim features.
This is necessary to allow matching between 32-bit and 64-bit code.</li>
</ul>
</li>
<li>The <code>createdatabase</code> command can also be used to create a BSim database on a PostgreSQL or Elasticsearch server, provided the servers are configured and running.
See the “BSim” entry in the Ghidra help for details.</li>
</ul>
<h2 id="aside-executable-categories-and-function-tags">Aside: Executable Categories and Function Tags</h2>
<p>Its worth a brief note about Executable Categories and Function Tags, although they are not used in any of the following exercises.</p>
<p>A BSim database can record user-defined metadata about an executable (executable categories) or about a function (function tags).
Categories and tags can then be used as filter elements in a BSim query.
For example, you could restrict a BSim query to search only in executables of the category “OPEN_SOURCE” or to functions which have been tagged “COMPRESSION_FUNCTIONS”.</p>
<p>Executable categories in BSim are implemented using <em>program properties</em>, and function tags in BSim correspond to function tags in Ghidra. Properties and tags both have uses in Ghidra which are independent of BSim.
So, if we want a BSim database to record a particular category or tag, we must indicate that explicitly.</p>
<p>For example, to inform the database that we wish to record the ORIGIN category, you would execute the command</p>
<pre><code class="language-bash">./bsim addexecategory file:/&lt;database_dir&gt;/example ORIGIN
</code></pre>
<p>Executable categories can be added to a program using the script <code>SetExecutableCategoryScript.java</code>.</p>
<p>Next Section: <a href="BSimTutorial_Evaluating_Matches.html">Evaluating Matches and Applying Information</a></p>

View File

@ -1,200 +0,0 @@
<h1 id="basic-bsim-queries">Basic BSim Queries</h1>
<p>In this section, we demonstrate some applications of our BSim database.</p>
<h2 id="registering-a-bsim-database">Registering a BSim Database</h2>
<p>In order to query the database, you must register it with Ghidra:</p>
<ol>
<li>From The Code Browser, Select <strong>BSim -&gt; Manage Servers</strong>.</li>
<li>In the BSim Server Manager dialog, click the green plus <img src="images/Plus2.png" alt="add server icon" />.</li>
<li>Select the <strong>File</strong> radio button and use the chooser to select <code>example.mv.db</code></li>
<li>Click <strong>OK</strong></li>
<li>Click <strong>Dismiss</strong> to close the dialog.</li>
</ol>
<h2 id="how-to-query-a-bsim-database">How to Query a BSim Database</h2>
<p>Before presenting the exercises, we describe the general mechanics of querying a BSim database.</p>
<h3 id="initiating-a-bsim-query">Initiating a BSim Query</h3>
<p>There are a number of ways to initiate a BSim query, including:</p>
<ul>
<li><strong>BSim -&gt; Search Functions…</strong> from the Code Browser.</li>
<li>Right-click in the Listing and select <strong>BSim -&gt; Search Functions…</strong></li>
<li>Click on the BSim icon <img src="images/preferences-web-browser-shortcuts.png" alt="BSim toolbar icon" /> in the Code Browser toolbar.</li>
</ul>
<p>For these cases, the function(s) being queried depend on the current selection.
If there is no selection, the function containing the current address is queried.
If there is a selection, all functions whose entry points are within the selection are queried.
An easy way to query all functions in a program is to select all addresses with <code>Ctrl-A</code> in the Listing window and then initiate a BSim query.</p>
<p>It is also possible to initiate a BSim query from the Decompiler window.
Simply right-click on a function name token and select <strong>BSim…</strong> to query the corresponding function.
This action is available on the name token in the decompiled functions signature as well as tokens corresponding to names of callees.</p>
<p>All of these actions bring up the <em>BSim Search Dialog</em>.</p>
<h3 id="the-bsim-search-dialog">The BSim Search Dialog</h3>
<p>From the BSim Search Dialog, you can</p>
<ul>
<li>Select which BSim database to query.</li>
<li>Set query thresholds.</li>
<li>Bound the number of results returned for each function.</li>
<li>Set query filters.</li>
</ul>
<p><img src="images/bsim_search_dialog.png" alt="bsim search dialog icon" /></p>
<h4 id="selecting-a-bsim-database">Selecting a BSim Database</h4>
<p>To query a registered BSim database, select that server from the <strong>BSim Server</strong> drop-down.</p>
<h4 id="setting-query-options">Setting Query Options</h4>
<p><strong>Similarity</strong> and <strong>confidence</strong> are scores used to evaluate the relationship between two vectors.
The respective fields in the dialog set lower bounds for these values for the matches returned by BSim.</p>
<ul>
<li>Similarity
<ul>
<li>Formally, the similarity of a match is the cosine of the angle between the vectors.</li>
<li>For BSim vectors, this value will always be between 0.0 and 1.0.</li>
<li>The higher the similarity score, the closer the vectors.</li>
</ul>
</li>
<li>Confidence
<ul>
<li>Intuitively, confidence quantifies the meatiness of a match.</li>
<li>Shared features increase this score and differing features decrease this score.</li>
<li>Sharing rare features contributes more to this score than sharing common features.</li>
<li>There is no upper bound for confidence when considered over all pairs of vectors.
However, if you fix a vector <em>v</em>, the greatest possible confidence score for a comparison involving <em>v</em> occurs when <em>v</em> is compared to itself.
The resulting confidence value is called the <strong>self-significance</strong> of <em>v</em>.</li>
</ul>
</li>
</ul>
<p>Confidence is used to judge the significance of a match.
For example, many executables contain a function which simply returns a constant value.
Given two executables, each with such a function, the similarity score between the corresponding BSim vectors will be 1.0.
However, the confidence score of the match will be quite low, indicating that it is not very significant that the two executables “share” this code.</p>
<p>In general, setting the thresholds involves a tradeoff: lower values mean that the database is more likely to return legitimate matches with significant differences, but also more likely to return matches which simply happen to share some features by chance.
The results of a BSim query can be sorted by the similarity and/or confidence of each match, so a common practice is to set the thresholds relatively low and to examine the matches in descending sort order.</p>
<p>The <strong>Matches per Function</strong> bound controls the number of results returned for a single function.
Note that in large collections, certain small or common functions might have substantial numbers of identical matches.</p>
<p>Filters are discussed in <a href="BSimTutorial_Filters.html">BSim Filters</a>.</p>
<h4 id="performing-the-query">Performing the Query</h4>
<p>Click the <strong>Search</strong> button in the dialog to perform a query.</p>
<p>After successfully issuing a query, you will also see a <strong>Search Function(s)</strong> action (without the ellipsis) in certain contexts.
This will perform a BSim query on the selected functions using the same parameters as the last query (skipping the BSim Search Dialog).</p>
<h2 id="exercises">Exercises</h2>
<p>The database <code>example</code> contains vectors from a Linux executable used by Ghidras GNU demangler.
Ghidra ships with several other versions of this executable.
We use these different versions to demonstrate some of the capabilities of BSim.</p>
<p><strong>Note</strong>: Use the default query settings and autoanalysis options for the exercises unless otherwise specified.</p>
<h3 id="exercise-function-identification">Exercise: Function Identification</h3>
<ol>
<li>Import and analyze the binary <code>&lt;ghidra_install_dir&gt;/GPL/DemanglerGnu/os/win_x86_64/demangler_gnu_v2_41.exe</code>.
<ul>
<li>This executable is based on the same source code as <code>demangler_gnu_v2_41</code> but compiled with Visual Studio instead of GCC.</li>
</ul>
</li>
<li>Examine this binary in Ghidra and verify that the original function names are not present.
<ul>
<li>Note that the function names <strong>are</strong> present in <code>demangler_gnu_v2_41</code>.</li>
</ul>
</li>
<li>Using the default query options, query <code>example</code> for matches to the function at <code>140006760</code>.</li>
<li>You should see the following search results:
<img src="images/basic_query.png" alt="search results" />
<ul>
<li>In this case, there is exactly one match, the similarity is 1.0, and the matching function has a non-default name (it wont always be this easy).</li>
<li>The results window has two tables: the function-level results (upper table) and the executable-level results (lower table).
The executable-level results are covered in <a href="BSimTutorial_Exe_Results.html">From Matching Functions to Matching Executables</a>.</li>
</ul>
</li>
<li>Right-click on the row of the match and perform the <strong>Compare Functions</strong> action to bring up the side-by-side comparison.
<ul>
<li>The <strong>Listing View</strong> tab shows the disassembly.</li>
<li>The <strong>Decompiler Diff View</strong> tab shows the decompiled code.</li>
<li>Differences in the code are automatically highlighted in cyan.</li>
<li>Either view can be toggled between a horizontal split and a vertical split using the drop-down menu.</li>
</ul>
</li>
<li>Examine the diff views to verify that the match is valid.</li>
<li>Using the <strong>Apply Name</strong> action in the BSim Search Results table, apply the name from the search result to the queried function.</li>
</ol>
<p><strong>Note</strong>: We cover the Decompiler Diff View in greater detail and discuss the various “Apply” actions in <a href="BSimTutorial_Evaluating_Matches.html">Evaluating Matches and Applying Information</a>.</p>
<h3 id="exercise-changes-to-the-source-code">Exercise: Changes to the Source Code</h3>
<ol>
<li>Import and analyze the executable <code>&lt;ghidra_install_dir&gt;/GPL/DemanglerGnu/os/linux_x86_64/demangler_gnu_v2_24</code>.
<ul>
<li>This executable is based on an earlier version of the source code than the executable in <code>example</code>.</li>
</ul>
</li>
<li>Navigate to the function <code>expandargv</code> in <code>demangler_gnu_v2_24</code> and issue a BSim query.</li>
<li>What differences do you see in the decompiled code of the single match?
<details><summary>In demangler_gnu_v2_41...</summary> The main differences are that call to dupargv is now in an if clause (and decompiler creates a related local variable) and there are two additional calls to free. </details>
</li>
<li>The relevant source files are included with the Ghidra distribution:
<ul>
<li><code>&lt;ghidra_install_dir&gt;/GPL/DemanglerGnu/src/demangler_gnu_v2_24/c/argv.c</code></li>
<li><code>&lt;ghidra_install_dir&gt;/GPL/DemanglerGnu/src/demangler_gnu_v2_41/c/argv.c</code></li>
</ul>
</li>
<li>Verify that the differences you found are present in the source.</li>
</ol>
<h3 id="exercise-cross-architectural-matching">Exercise: Cross-architectural Matching</h3>
<ol>
<li>Import and analyze the executable
<code>&lt;ghidra_install_dir&gt;/GPL/DemanglerGnu/os/mac_arm_64/demangler_gnu_v2_41</code>.
<ul>
<li>This executable is based on the same source code as the executable in <code>example</code> but compiled for a different architecture.</li>
<li><strong>Note</strong>: this file has the same name as the one we used to populate the BSim database, so you will have to give the resulting Ghidra program a different name or import it into a different directory in your Ghidra project.</li>
</ul>
</li>
<li>Navigate to <code>_expandargv</code> and issue a BSim query.
In the decompiler diff view of the single match, what differences do you see regarding <code>memmove</code> and <code>memcpy</code>?
<details><summary>In the arm64 version...</summary> In the arm64_version, the compiler replaced these functions with __memmove_chk and __memcpy_chk. The __chk versions have an extra parameter related to preventing buffer overflows. Neither the names nor the bodies of callees are incorporated into BSim signatures, but the arguments of a call are, so this change partly explains why the BSim vectors are not identical.</details>
</li>
<li>Examine the <strong>Listing View</strong> tab and verify that the architectures are indeed different.</li>
</ol>
<h2 id="a-remark-on-query-thresholds-and-indices">A Remark on Query Thresholds and Indices</h2>
<p>Q: If you set the similarity and confidence thresholds to 0.0, will a BSim query return all of the functions in the database?</p>
<p>A: No, because</p>
<ul>
<li>For indexed databases (i.e., PostgreSQL and Elasticsearch), the index is designed so that vector comparisons are only performed between vectors which are likely to be close.
Most vectors will not even be considered as potential matches for a given queried vector.</li>
<li>Regardless of database backed, matches are only shown if the confidence score is above the confidence threshold of the query.
The interface will not allow you to set a negative confidence threshold, but confidence scores can be negative.</li>
<li>The <strong>Matches per Function</strong> parameter also controls how many functions are returned.</li>
</ul>
<p>Next Section: <a href="BSimTutorial_Ghidra_Command_Line.html">Ghidra from the Command Line</a></p>

View File

@ -1,38 +0,0 @@
<h1 id="creating-and-populating-a-bsim-database-from-the-ghidra-gui">Creating and Populating a BSim Database from the Ghidra GUI</h1>
<p>This section explains how to create and populate an H2-backed BSim database from the Ghidra GUI.</p>
<h2 id="creating-the-database">Creating the Database</h2>
<p>To create a BSim database, first create a directory on your file system to contain the database.</p>
<p>Next, perform the following steps from the Ghidra Code Browser:</p>
<ol>
<li>Run the Ghidra script <code>CreateH2BSimDatabaseScript.java</code>.</li>
<li>In the resulting dialog:
<ol>
<li>Enter “example” in the <strong>Database Name</strong> field.</li>
<li>Select the new directory in the <strong>Database Directory</strong> field.</li>
<li>Dont change any of the other fields.</li>
</ol>
</li>
<li>Click <strong>OK</strong>.</li>
</ol>
<h2 id="populating-the-database">Populating the Database</h2>
<p>We now populate the database with an executable which is contained in the Ghidra distribution.</p>
<ol>
<li>Import and analyze the executable <code>&lt;ghidra_install_dir&gt;/GPL/DemanglerGnu/os/linux_x86_64/demangler_gnu_v2_41</code> using the default analysis options.</li>
<li>Run the Ghidra script <code>AddProgramToH2BSimDatabaseScript.java</code> on this program.
<ul>
<li>The script will ask you to select an H2 database file. Use <code>example.mv.db</code> in the database directory.</li>
</ul>
</li>
<li>In general you can run this script on other programs to add their signatures to this database, but thats not necessary for the exercises in the next section.</li>
</ol>
<p>Next Section: <a href="BSimTutorial_Basic_Queries.html">Basic BSim Queries</a></p>

View File

@ -1,22 +0,0 @@
<h1 id="starting-ghidra-and-enabling-the-bsim-plugin">Starting Ghidra and Enabling the BSim Plugin:</h1>
<p>To begin the tutorial, perform the following steps:</p>
<ol>
<li>Launch Ghidra.</li>
<li>Create a new non-shared project for this tutorial.</li>
<li>Launch the Code Browser.</li>
</ol>
<p>To enable BSim, perform the following steps:</p>
<ol>
<li><strong>File -&gt; Configure</strong> from the Code Browser.</li>
<li>Click on the <code>Configure</code> link of the <code>BSim</code> entry.</li>
<li>In the resulting dialog, ensure that the checkbox for <code>BSimSearchPlugin</code> is checked.</li>
</ol>
<p><img src="images/configure.png" alt="configure dialog" /></p>
<p>Next Section: <a href="BSimTutorial_Creating_Database_From_GUI.html">Creating and Populating a BSim Database from the GUI</a></p>

View File

@ -1,135 +0,0 @@
<h1 id="evaluating-matches-and-applying-information">Evaluating Matches and Applying Information</h1>
<p>Summarizing what weve created over the last few sections, we now have:</p>
<ol>
<li>A stripped executable (<code>postgres</code>).</li>
<li>A Ghidra project containing some object files <em>with debug information</em><sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> used to build that executable.</li>
<li>A BSim database containing the BSim signatures of the object files.</li>
</ol>
<p>We now demonstrate using BSim to help reverse engineer <code>postgres</code>.
While doing this, well showcase some of the features available in the decompiler diff view.</p>
<h2 id="exercise-exploring-the-highlights">Exercise: Exploring the Highlights</h2>
<p>Import and analyze the stripped <code>postgres</code> executable into the tutorial project, then perform the following steps:</p>
<ol>
<li>Select all functions in <code>postgres</code> via <code>Ctrl-A</code> in the Listing.</li>
<li>Perform a BSim query of the database <code>example</code>.
<ul>
<li><strong>Note:</strong> We use the results of this query in the following few exercises.
If you dont close the BSim search results window, you wont have to issue the query again.</li>
</ul>
</li>
<li>Sort the rows by confidence and find the row with <code>grouping_planner</code> as the matching function.
The corresponding function in <code>postgres</code> should have a default name.</li>
<li>Examine this match in the side-by-side decompiler view.
Note that the matching function has better data type information due to the debug information.</li>
<li>Q: Why does the placement of the <code>double</code> argument differ between the functions?
<details><summary>Answer</summary> Floating point values and integer/pointer values are passed in separate sets of registers.
Neither ordering is wrong since both are consistent with the instructions of the function.
The debug info records a specific signature (and ordering) for the function, which Ghidra applies.
In the version without debug information, the decompiler used heuristics to determine the function's signature.</details>
</li>
</ol>
<p>For matches with a fair number of differences, the decompiler diff panel can get pretty colorful.
Furthermore, as you click around, tokens will gain and lose highlights of various colors.
Its worth giving a brief explanation of when highlighting happens and what the different colors mean.
Some terminology: if you click on a token in a decompiler panel, that token becomes the <em>focused token</em>.</p>
<p><img src="images/decomp_diff.png" alt="Decomp Diff Window" /></p>
<p>The colors:</p>
<ul>
<li>Cyan is used to highlight differences between the two functions.</li>
<li>Pink is used to highlight the focused token and its match.</li>
<li>Lavender is used to highlight the focused token when it does not have a match.</li>
<li>Orange is used to highlight the focused token when it is ineligible for match.
Certain tokens, such as whitespace tokens or tokens used in variable declarations, are never assigned matching tokens.</li>
</ul>
<h2 id="exercise-locking-and-unlocking-scrolling">Exercise: Locking and Unlocking Scrolling</h2>
<p>By default, scrolling in the diff window is synchronized.
This means that scrolling within one window will also scroll within the other window.
In the decompiler diff window, scrolling works by matching one line in the left function with one line in the right function.
The two functions are aligned using those lines.
Initially, the functions are aligned using the functions signatures.</p>
<p>As you click around in either function, the “aligning lines” will change.
If the focused token has a match, the scrolling is re-centered based on the lines containing the matched tokens.
If the focused token does not have a match, the functions will be aligned using the closest token to the focused token which does have a match.</p>
<p>Synchronized scrolling can be toggled using the <img src="images/lock.gif" alt="lock icon" /> and <img src="images/unlock.gif" alt="unlock icon" /> icons in the toolbar.</p>
<ol>
<li>Experiment with locking and unlocking synchronized scrolling.</li>
</ol>
<h2 id="exercise-applying-signatures">Exercise: Applying Signatures</h2>
<p>If you are satisfied with a given match, you might want to apply information about the matching function to the queried function.
For example, you might want to apply the name or signature of the function.
There are some subtleties which determine how much information is safe to apply.
Hence there are three actions available under the <strong>Apply From Other</strong> menu when you right-click in the left panel:</p>
<ol>
<li><strong>Function Name</strong> will apply the right functions name and namespace to the function on the left.</li>
<li><strong>Function Signature</strong> will apply the name, namespace, and “skeleton” data types.
Structure and union data types are not transferred.
Instead, empty placeholder structures are created.</li>
<li><strong>Function Signature and Data Types</strong> will apply the name and signature with full data types.
This may result in many data types being imported into the program (consider structures which refer to other structures).</li>
</ol>
<p><strong>Warning</strong>: You should be absolutely certain that the datatypes are the exactly the same before applying signatures and data types.
If there have been any changes to a datatypes definition, you could end up bringing incorrect datatypes into a program, even using BSim matches with 1.0 similarity.
Applying full data types is also problematic for cross-architecture matches.</p>
<ol>
<li>Since we know its safe, apply the function signature and data types to the left function.</li>
</ol>
<p>There are similarly-named actions available on rows of the Function Matches table in the BSim Search Results window.
The <strong>Status</strong> column contains information about which rows have had their matches applied.</p>
<h2 id="exercise-comparing-callees">Exercise: Comparing Callees</h2>
<p>The token matching algorithm matches a function call in one program to a function call in another by considering the data flow into and out of the <code>CALL</code> instruction, but it does not do anything with the bodies of the callees.
However, given a matched pair of calls, you can bring up a new comparison window for the callees with the <strong>Compare Matching Callees</strong> action.</p>
<ol>
<li>Click in the left panel of the decompile diff window and press <code>Ctrl-F</code>.</li>
<li>Enter <code>FUN_</code> and search for matched function calls where the callee in the left window has a default name and the callee in the right window has a non-default name.</li>
<li>Right-click on one of the matched tokens and perform the <strong>Compare Matching Callees</strong> action.</li>
<li>In the comparison of the callees, apply the function signature and data types from the right function to the left function.
Verify that the update is reflected in the decompiler diff view of the callers.</li>
</ol>
<h2 id="exercise-multiple-comparisons">Exercise: Multiple Comparisons</h2>
<p>The function shown in a panel is controlled by a drop-down menu at the top of the panel.
This can be useful when youd like to evaluate multiple matches to a single function.</p>
<p>Exercise:</p>
<ol>
<li>In the BSim Search Results window, right-click on a table column name, select <strong>Add/Remove Columns</strong>, and enable the <strong>Matches</strong> column.</li>
<li>Find two functions in <code>postgres</code>, each of which has exactly two matches.
Select the corresponding four rows in the matches table and perform the <strong>Compare Functions</strong> action.</li>
<li>Experiment with the drop-downs in each panel.</li>
</ol>
<p>In the next section, we discuss the Executable Results table.</p>
<p>Next Section: <a href="BSimTutorial_Exe_Results.html">From Matching Functions to Matching Executables</a></p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Having debug information isnt necessary to use BSim (as weve seen in a previous exercise), but it is convenient. Note that applying debug information can change BSim signatures, which can negatively impact matching between functions with debug information and functions without it. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
</li>
</ol>
</div>

View File

@ -1,46 +0,0 @@
<h1 id="from-matching-functions-to-matching-executables">From Matching Functions to Matching Executables</h1>
<p>In this section, we discuss the Executable Results table.
Each row of this table corresponds to one executable in the database.
The information in one row is an aggregation of all of the function-level matches into that rows executable.
Your Executable Results table from the previous query should look similar to the following:</p>
<p><img src="images/exe_results.png" alt="executable results" /></p>
<p>If you select a single row in the table and right-click on it, you will see the following actions:</p>
<ul>
<li><strong>Load Executable</strong>
Opens a read-only copy of the program in the Code Browser.</li>
<li><strong>Filter on this Executable</strong>
Applies a filter which restricts the matches shown in the Function Matches table to matches which occur in the given executable.</li>
</ul>
<h2 id="exercise">Exercise</h2>
<ol>
<li>Sort the Executable results by descending <strong>Function Count</strong>.
An entry in this column shows the number of queried functions which have at least one match in the rows executable (if <code>foo</code> has 2 or more matches into a given executable, it still only contributes 1 to the function count).
What position is <code>demangler_gnu_v2_41</code>?
<details><summary>In this table...</summary> It's in the first position.</details>
</li>
<li>An entry in the <strong>Confidence</strong> column shows the sum of the confidence scores of all matches into the corresponding executable.
If <code>foo</code> has more than one match into a given executable, only the one with the highest (function-level) confidence contributes to the (executable-level) confidence score.
Sort the Executable results by descending confidence and observe that <code>demangler_gnu_v2_41</code> is now much further down the list.
<details><summary>What could explain this?</summary> If there are many function matches but the sum of all the confidences is relatively low, it is likely that many of the matches involve small functions with common BSim signatures.</details>
</li>
<li>In the Executable match table, right click on <code>demangler_gnu_v2_41</code> and apply the filter action.
Sort the filtered function matches by descending confidence.
Starting at the top, examine some of the matches and convince yourself that the given explanation is correct.
<ul>
<li><strong>Note</strong>: You can remove the filter using the <strong>Filter Results</strong> icon <img src="images/exec.png" alt="Filter Results" /> in the toolbar.
Well discuss this further in <a href="BSimTutorial_Filters.html">BSim Filters</a></li>
</ul>
</li>
</ol>
<p>From this exercise, we see that unrelated functions can be duplicates of each other, either because they are small or because they perform a common generic action.
Keep in mind that such functions can “pollute” the results of a blanket query.
In the next section, we demonstrate a technique to restrict queries to functions which are more likely to have meaningful matches.</p>
<p>Next Section: <a href="BSimTutorial_Overview_Queries.html">Overview Queries</a></p>

View File

@ -1,21 +0,0 @@
<h1 id="bsim-filters">BSim Filters</h1>
<p>There are a number of filters that can be applied to BSim queries, involving names, architectures, compilers, ingest dates, user-defined executable categories, and other attributes.</p>
<p>Filters be can applied <em>server-side</em> or <em>client-side</em>.
Server-side filters affect the query results sent to Ghidra from a BSim server and can be applied using the <strong>Filters</strong> drop-down in the BSim Search dialog.
Client-side filters apply to the BSim Search results table and can be added and removed at will using the <strong>Filter Results</strong> icon <img src="images/exec.png" alt="Filter Results" />.
However, to “undo” a server-side filter, you have to issue another BSim query without the filter.</p>
<h2 id="exercise-filters">Exercise: Filters</h2>
<ol>
<li>Select all functions in <code>postgres</code> and bring up the BSim Search dialog.</li>
<li>Apply an <strong>Executable name does not equal</strong> filter with <code>demangler_gnu_v2_41</code> as the name to exclude.</li>
<li>Perform the query and verify <code>demangler_gnu_v2_41</code> is not in the list of executables with matches.</li>
<li>Using the <strong>Search Info</strong> icon <img src="images/information.png" alt="Search Info" /> in the BSim Search Results toolbar, you can see the server-side filters applied to the query.
Verify that this information is correct.</li>
<li>Using the <strong>Filter Results</strong> icon <img src="images/exec.png" alt="Filter Results" />, you can apply client-side filters to the query results. Experiment with applying and removing some client-side filters.</li>
</ol>
<p>Next Section: <a href="BSimTutorial_Scripting.html">Scripting and Visualization</a></p>

View File

@ -1,53 +0,0 @@
<h1 id="ghidra-analysis-from-the-command-line">Ghidra Analysis from the Command Line</h1>
<p>For the remaining exercises, we need to populate our BSim database with a number of binaries.
Wed like a consistent set of binaries for the tutorial, but we dont want to clutter the Ghidra distribution with dozens of additional executables.
Fortunately, the BSim plugin includes a script for building the PostgreSQL backend, and that build process creates hundreds of object files.
So we can just build PostgreSQL and harvest the object files we need.</p>
<p><strong>Note</strong>: For the tutorial, we continue to use the H2 BSim backend.
We do not run any PostgreSQL code, we simply analyze some files produced when building PostgreSQL.</p>
<p>Note that these files must be built on a machine running Linux.
Windows users can build these files in a Linux virtual machine.</p>
<p>To build the files, execute the following commands in a shell: <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>
<pre><code class="language-bash">cd &lt;ghidra_install_dir&gt;/Features/BSim
export CFLAGS="-O2 -g"
./make-postgres.sh
mkdir ~/postgres_object_files
cd build
find . -name p*o -size +100000c -size -700000c -exec cp {} ~/postgres_object_files/ \;
cd os/linux_x86_64/postgresql/bin
strip -s postgres
</code></pre>
<p>To continue on Windows, transfer the <code>~/postgres_object_files</code> directory and the stripped <code>postgres</code> executable to your Windows machine.</p>
<h2 id="importing-and-analyzing-the-exercise-files">Importing and Analyzing the Exercise Files</h2>
<p>Now that we have the executables, we can analyze them with the headless analyzer<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>.
The headless analyzer is distinct from BSim, but using it is the only feasible way to analyze substantial numbers of binaries.</p>
<p>To analyze the files in Linux, execute the following commands in a shell.</p>
<pre><code class="language-bash">cd &lt;ghidra_install_dir&gt;/support
./analyzeHeadless &lt;ghidra_project_dir&gt; postgres_object_files -import ~/postgres_object_files/*
</code></pre>
<p>(On windows, use <code>analyzeHeadless.bat</code> and adjust paths accordingly.)</p>
<p>This will create a local Ghidra project called <code>postgres_object_files</code> in the directory <code>&lt;ghidra_project_dir&gt;</code>.</p>
<p>Next Section: <a href="BSimTutorial_BSim_Command_Line.html">BSim from the Command Line</a></p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>You may need to install additional packages and/or change some build options in order for PostgreSQL to build successfully. The error messages are generally informative. See the comments in <code>make-postgres.sh</code>. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>The headless analyzer has its own documentation: <code>&lt;ghidra_install_dir&gt;/support/analyzeHeadlessREADME.html</code>. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
</li>
</ol>
</div>

View File

@ -1,114 +0,0 @@
<h1 id="introduction-to-bsim">Introduction to BSim</h1>
<p>As youve reverse engineered software, youve likely asked the following questions:</p>
<ul>
<li>Which libraries were statically linked into this executable?</li>
<li>Does this executable share some code with another executable that Ive analyzed?</li>
<li>What are the differences between version 1 and version 2 of a given executable?</li>
<li>Does this executable share code with another executable in a large collection of binaries?</li>
<li>Was this function pulled from an open-source library?</li>
</ul>
<p>BSim is intended to help with these questions (and others) by providing a way to search collections of binaries for similar, but not necessarily identical, functions.</p>
<h1 id="how-does-bsim-work">How Does BSim Work?</h1>
<p>The idea behind BSim is to generate a <em>feature vector</em> for each function in a binary.
The vectors are generated by Ghidras decompiler.
Each feature represents a small piece of data flow and/or control flow of the associated function.
The decompiler normalizes the feature vector representation so that different, but functionally equivalent, pieces of code often produce the same features.
Certain attributes, such as values of constants, names of registers, and data types, are intentionally not incorporated into the features.</p>
<p>BSim vectors are compared using <em>cosine similarity</em>.
Discrepancies between the vectors for <code>foo</code> and <code>bar</code> which are caused by differences in compilers, target architectures, and/or small changes to the source code typically result in vectors which are close but not identical.</p>
<p>BSim vectors can be stored in a dedicated database.
BSim databases intended to hold large<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> numbers of vectors maintain an index based on <em>locality-sensitive hashing</em>.
The index drastically reduces the number of vector comparisons needed and allows for rapid retrieval of results.</p>
<p>Querying <code>foo</code> against a BSim database typically yields a number of potential matches.
Each individual match for <code>foo</code> can be compared to <code>foo</code> in a side-by-side view, and certain information (such as function name) can be quickly copied from a match to <code>foo</code>.</p>
<p>We frequently call BSim vectors the <em>BSim signature</em> of a function, or just the <em>signature</em> when the context is clear.</p>
<h1 id="why-bsim">Why “BSim”?</h1>
<p>We can think of each feature as representing a small piece of the <em>behavior</em> of a function, analogous to a snippet of source code.
Functions whose BSim vectors are close typically have many features in common, that is, they have <em>similar behavior</em>.
Hence the name “BSim”: <strong>B</strong>ehavioral <strong>Sim</strong>iliarity.</p>
<h1 id="bsim-clients-bsim-databases-and-ghidra-projects">BSim Clients, BSim Databases, and Ghidra Projects</h1>
<p>Using BSim involves the following components:</p>
<ul>
<li>A <em>BSim Client</em>, i.e., an instance of Ghidra with the BSim plugin enabled.
<ul>
<li>This is where the reverse engineering happens.</li>
</ul>
</li>
<li>A <em>BSim Database</em>, which stores the BSim signatures.
<ul>
<li>Also stores some metadata about each function and its containing executable.</li>
<li>In particular, stores the ghidra:// URL of the associated Ghidra program.</li>
<li>Does not store disassembly or decompiled functions.</li>
</ul>
</li>
<li>A <em>Ghidra Project</em>, which stores the analyzed programs used to populate the BSim database.
<ul>
<li>Given a BSim match, the BSim client can use the ghidra:// URL to retrieve a program from a Ghidra project for side-by-side comparisons.</li>
<li>Note that a single BSim database can reference multiple Ghidra projects.</li>
</ul>
</li>
</ul>
<h1 id="database-backends">Database Backends</h1>
<p>There are three supported database backends for BSim:</p>
<ol>
<li>
<p>PostgreSQL</p>
<ul>
<li>The Ghidra distribution includes the source for PostgreSQL, a PostgreSQL plugin for BSim, and a build script.</li>
<li>Populated from shared Ghidra projects (i.e., requires a Ghidra server).</li>
<li>Server not supported on Windows (no restriction on clients).</li>
</ul>
</li>
<li>
<p>Elasticsearch</p>
<ul>
<li>The <code>BSimElasticPlugin</code> extension contains an Elasticsearch plugin for BSim.</li>
<li>This plugin must be installed into an existing Elasticsearch database.</li>
<li>Populated from shared Ghidra projects.</li>
</ul>
</li>
<li>
<p>H2</p>
<ul>
<li>Simplest way to use BSim:
<ul>
<li>Backed by files on the users machine (dont need to install database server).</li>
<li>Can be created and populated quickly.</li>
<li>Supported on all platforms.</li>
</ul>
</li>
<li>Does not support large collections of binaries or multiple users.</li>
<li>Can be populated from non-shared (local) or shared Ghidra projects.</li>
</ul>
</li>
</ol>
<p>Next Section: <a href="BSimTutorial_Enabling.html">Starting Ghidra and Enabling BSim</a></p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Creating a database requires a <em>database template</em>, which determines the specifics of the index. Currently, Ghidra provides a <em>medium</em> template, intended for databases holding up to 10 million unique vectors, and a <em>large</em> template, intended for databases holding up to 100 million unique vectors. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
</li>
</ol>
</div>

View File

@ -1,51 +0,0 @@
<h1 id="overview-queries">Overview Queries</h1>
<p>An <strong>Overview Query</strong> queries a BSim database for the number of matches to each function in an executable.
The matching functions themselves are not returned.
Similarity and Confidence thresholds can be set for an Overview Query, but there is no “Matches per Function” bound and no filters can be applied.</p>
<p>To perform an Overview Query, select <strong>BSim -&gt; Perform Overview…</strong> from the Code Browser.</p>
<h2 id="exercise-hit-counts-and-self-significance">Exercise: Hit Counts and Self-Significance</h2>
<ol>
<li>Perform an Overview query on <code>postgres</code> using the default query thresholds.
You should see the following result:
<img src="images/overview_window.png" alt="overview window" /></li>
<li>Sort the table by the “Hit Count” column in ascending order. Typically, the functions with the largest hit counts will have low self-significance.
Verify that that is the case for this table.</li>
<li>Q: Examine the functions with the highest hit count. Why are there so many matches for these functions?
<details><summary>Answer:</summary> These are all instances of PostgreSQL statistics-reporting functions. Their bodies are quite similar and they have identical BSim signatures.</details>
</li>
</ol>
<h2 id="exercise-selections-and-queries">Exercise: Selections and Queries</h2>
<p>Using the hit count column, it is possible to exclude functions with large numbers of matches.</p>
<ol>
<li>In the Overview Table, select all functions whose hit count is 2 or less.</li>
<li>Right-click on the selection and perform the <strong>Search Selected Functions</strong> action.
Sort the query results by descending <strong>Function Count</strong> and verify that <code>demangler_gnu_v2_41</code> is far down the list.</li>
</ol>
<h2 id="exercise-vector-hashes">Exercise: Vector Hashes</h2>
<p>Suppose <code>foo</code> and <code>bar</code> have the same number of hits in the Overview table.
There are two possibilities:</p>
<ol>
<li><code>foo</code> and <code>bar</code> have distinct feature vectors which happen to have the same number of matches.</li>
<li><code>foo</code> and <code>bar</code> have the same feature vector.</li>
</ol>
<p>An optional column, <strong>Vector Hash</strong>, can be used to distinguish between these two cases.</p>
<ol>
<li>Enable the <strong>Vector Hash</strong> Column in the Overview Table.</li>
<li>Find two functions with the same vector hash.</li>
<li>Select the two corresponding rows in the table and then transfer the selection to the Listing using the <img src="images/text_align_justify.png" alt="make selection icon" /> icon in the BSim Overview toolbar.</li>
<li>In the Listing, press <code>Shift-C</code> or right-click and perform the <strong>Compare Selected Functions</strong> action.</li>
<li>In the resulting Function Comparison window, convince yourself that these two functions should have the same BSim signature.</li>
</ol>
<p>Next Section: <a href="BSimTutorial_Filters.html">Queries and Filters</a></p>

View File

@ -1,23 +0,0 @@
<h1 id="scripting-and-visualization">Scripting and Visualization</h1>
<p>Finally, we briefly mention a few other topics related to BSim.</p>
<h2 id="scripting-bsim">Scripting BSim</h2>
<p>There are are number of example scripts in the <code>BSim</code> script category, which demonstrate how to interact with BSim programmatically.</p>
<p><img src="images/script_manager.png" alt="script manager" /></p>
<h2 id="visualizing-features">Visualizing Features</h2>
<p>Finally, if youd like to see the particular BSim features in a function, you can use the BSim Feature Visualizer.
This plugin allows you to highlight regions of the decompiled code corresponding to a particular feature and to display a graph representing the feature.</p>
<p>To use this plugin, first enable the <code>BSimFeatureVisualizerPlugin</code> via <strong>File -&gt; Configure</strong> from the Code Browser.
You can then bring it up via <strong>BSim -&gt; BSim Feature Visualizer</strong>.</p>
<p><img src="images/feature_visualizer.png" alt="feature visualizer" /></p>
<p>This is the end of the tutorial.</p>
<p><a href="README.html">Return to the Beginning</a></p>

View File

@ -1,24 +0,0 @@
<h1 id="bsim-tutorial">BSim Tutorial</h1>
<p>BSim is a Ghidra plugin for finding structurally similar functions in (potentially large) collections of binaries.
It is based on Ghidras decompiler and can find matches across compilers, architectures, and/or small changes to source code.</p>
<p>This tutorial demonstrates how create a small BSim database and walks through some typical use cases.</p>
<p><strong>Detailed information about BSim can be found in the “BSim” entry of the Ghidra Help</strong>.</p>
<ol>
<li><a href="BSimTutorial_Intro.html">Introduction to BSim</a></li>
<li><a href="BSimTutorial_Enabling.html">Starting Ghidra and Enabling BSim</a></li>
<li><a href="BSimTutorial_Creating_Database_From_GUI.html">Creating and Populating a BSim Database from the GUI</a></li>
<li><a href="BSimTutorial_Basic_Queries.html">Basic BSim Queries</a></li>
<li><a href="BSimTutorial_Ghidra_Command_Line.html">Ghidra from the Command Line</a></li>
<li><a href="BSimTutorial_BSim_Command_Line.html">BSim from the Command Line</a></li>
<li><a href="BSimTutorial_Evaluating_Matches.html">Evaluating Matches</a></li>
<li><a href="BSimTutorial_Exe_Results.html">From Matching Functions to Matching Executables</a></li>
<li><a href="BSimTutorial_Overview_Queries.html">Overview Queries</a></li>
<li><a href="BSimTutorial_Filters.html">BSim Filters</a></li>
<li><a href="BSimTutorial_Scripting.html">Scripting and Visualization</a></li>
</ol>
<p>Next Section: <a href="BSimTutorial_Intro.html">Introduction to BSim</a></p>

View File

@ -55,4 +55,8 @@ rootProject.assembleMarkdownToHtml {
from ("${this.projectDir}/InstallationGuide.md") {
into "docs"
}
from ("${this.projectDir}/GhidraClass/BSim") {
include "*.md"
into "docs/GhidraClass/BSim"
}
}