Patterns & Phrases

Administrator Help | Forcepoint DLP | Version 8.8.1

Related topics:

Adding or editing a regular expression classifier

Adding a key phrase classifier

Adding a dictionary classifier

File properties

Use the Main > Policy Management > Content Classifiers > Patterns & Phrases page in the Data Security module of the Forcepoint Security Manager to view or manage a list of script, regular expression, dictionary, and key phrase content classifiers.

Use the Type column to tell whether a classifier is predefined (built-in) or user-defined. The list can be sorted by this column.

Refer to Predefined Classifiers for details about each predefined Patterns & Phrases classifier.

On this page:

Click New, then select the classifier type to add a new regular expression (regex), key phrase, or dictionary.

Select a classifier, then click Delete to remove the selected classifier.

Refer to the Used in a Policy column to determine whether or not a classifier is used. For classifiers that are in use, click Where Used to see which policies use the classifier.

Regular expression patterns

Regex patterns are special text strings for describing search patterns that can be detected within content. (Content includes the body of the content as well as any attachments). You define the patterns to look for in content and you set the action to take when a pattern is found.

For example, the string "a\d+" matches all strings that start with the letter "a" and are followed by at least one digit, where "\d" represents any digit and "+" represents "at least one." When the extracted text from a transaction is scanned, Forcepoint DLP uses regular expressions to find strings in the text that match patterns for confidential information. For example, this is a very basic regular expression for catching Visa credit card numbers:

\b(4\d{3}[\-\\]\d{4}[\-\\]\d{4}[\-\\]\d{4})\b

Because a regular expression file contains many internal attributes, if it is improperly written it can create many false-positive incidents, slow down the system, and impede analysis.

One way of mitigating false positives in a pattern is to exclude certain values that falsely match it. When defining the classifier, define a "Pattern to exclude" listing words or phrases that are exceptions to the pattern rule (search for all Social Security numbers except these numbers that look like Social Security numbers but are not).

You can also add a "List of phrases to exclude" with words or phrases that, when found in combination with the pattern, affect whether or not the content is considered suspicious.

Another way to mitigate false positives is to consider the pattern as suspicious only when some other pattern or set of words appear in the analyzed data. To do this, create each content classifier (a pattern, dictionary or any other), then combine them in a rule condition with an AND operator.

When creating a rule for a policy, specify how many instances (matches) of the pattern must be found before the content is considered suspicious enough for the configured action to be taken (for example, 4 or more Social Security numbers).

For each content transmission, the system tallies the number of instances of the pattern found in the content.

If the number of pattern matches is less than the number of matches set, the content is not considered suspicious and there is no further analysis.

If the number of pattern matches is equal to or greater than the number of matches set, the content triggers the action specified in the rule.

Example:

The pattern is Social Security numbers and the number of matches is 4.

The body of an email contains 3 Social Security numbers; the subject contains 2 Social Security numbers.

Since there were 5 pattern matches, and this is greater than the number of set matches, the message triggers the action specified in the rule that uses this pattern.

Pattern to exclude

Administrators can define a list of exceptions to a regular expression, script, or dictionary classifier. This is a list of content that matches the classifier, but should not be considered in the tally of matches. For each content item transmitted, the system tallies the number of instances of the pattern, and subtracts any matches in the Exclude list.

Example:

The pattern is Social Security numbers, the number of matches is 2, and the list of excluded patterns is: 111-11-1111, 222-22-2222, and 333 33 3333 (total of three in the excluded list).

The email contains 7 Social Security numbers: 111-11-1111, 222-33-4444, 444-55-6666, 555-66-7777, 222-22-2222, 777-88-9999, 333-33-3333.

The number of pattern matches is 7, minus 3 excluded patterns that were found in the email, thus equal to 4. Since 4 is greater than the number of matches (2), the message triggers the action specified in the rule that uses this pattern.

List of phrases to exclude

Administrators can add a list of suspicious words to a regular expression, script, or dictionary classifier. For each content item transmitted, the rule applies its action only if the total number of matches is above the threshold, and a string from the specified list is found. If the number of matches is reached but no strings from the list are present, no further analysis is performed.

Example:

The pattern is Social Security numbers, the number of matches is 2, and the list of phrases to exclude contains "Social Security" and "credit card." The distributed content contains 3 Social Security numbers: 111-22-3333, 222-33-4444, 444-55-6666, but none of the words were found. Since the number of found distributed content (3) is greater than the number of matches (2), but there were no dictionary words in the email, no action is taken.