Character filters examine text one character at a time and perform filtering operations. Character filters require a type field, and some take additional options as well.
"charFilters": [   {     "type": "<filter-type>",     "<additional-option>": <value>   } ] 
Character Filter Types
MongoDB Search supports the following types of character filter:
The following sample index definitions and queries use the sample
collection named minutes.
To follow along with these examples, load the minutes collection on your cluster
and navigate to the Create a Search Index page in the Atlas UI following the steps
in the Create an MongoDB Search Index tutorial.
Then, select the minutes collection as your data source, and follow the example procedure
to create an index from the Atlas UI or using mongosh.
➤ Use the Select your language drop-down menu to set the method to run the examples on this page.
htmlStrip
The htmlStrip character filter strips out HTML constructs.
Attributes
The htmlStrip character filter has the following attributes:
| Name | Type | Required? | Description | 
|---|---|---|---|
| 
 | string | yes | Human-readable label that identifies this character filter type.
Value must be  | 
| 
 | array of strings | yes | List that contains the HTML tags to exclude from filtering. | 
Example
The following index definition example indexes the text.en_US
field in the minutes collection
using a custom analyzer named htmlStrippingAnalyzer. The
custom analyzer specifies the following:
- Remove all HTML tags from the text except the - atag using the- htmlStripcharacter filter.
- Generate tokens based on word break rules from the Unicode Text Segmentation algorithm using the standard tokenizer. 
- In the Custom Analyzers section, click Add Custom Analyzer. 
- Select the Create Your Own radio button and click Next. 
- Type - htmlStrippingAnalyzerin the Analyzer Name field.
- Expand Character Filters and click Add character filter. 
- Select htmlStrip from the dropdown and type - ain the ignoredTags field.
- Click Add character filter. 
- Expand Tokenizer if it's collapsed and select standard from the dropdown. 
- Click Add to add the custom analyzer to your index. 
- In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the text.en_US nested field. 
- Select text.en_US nested from the Field Name dropdown and String from the Data Type dropdown. 
- In the properties section for the data type, select - htmlStrippingAnalyzerfrom the Index Analyzer and Search Analyzer dropdowns.
- Click Add, then Save Changes. 
Replace the default index definition with the following:
1  { 2    "mappings": { 3      "fields": { 4        "text": { 5          "type": "document", 6          "dynamic": true, 7          "fields": { 8            "en_US": { 9              "analyzer": "htmlStrippingAnalyzer", 10              "type": "string" 11            } 12          } 13        } 14      } 15    }, 16    "analyzers": [{ 17      "name": "htmlStrippingAnalyzer", 18      "charFilters": [{ 19        "type": "htmlStrip", 20        "ignoredTags": ["a"] 21      }], 22      "tokenizer": { 23        "type": "standard" 24      }, 25      "tokenFilters": [] 26    }] 27  } 
1 db.minutes.createSearchIndex( 2   "default", 3   { 4     "mappings": { 5       "fields": { 6         "text": { 7           "type": "document", 8           "dynamic": true, 9           "fields": { 10             "en_US": { 11               "analyzer": "htmlStrippingAnalyzer", 12               "type": "string" 13             } 14           } 15         } 16       } 17     }, 18     "analyzers": [ 19       { 20         "name": "htmlStrippingAnalyzer", 21         "charFilters": [ 22           { 23             "type": "htmlStrip", 24             "ignoredTags": ["a"] 25           } 26         ], 27         "tokenizer": { 28           "type": "standard" 29         }, 30         "tokenFilters": [] 31       } 32     ] 33   } 34 ) 
The following query looks for occurrences of the string head in
the text.en_US field of the minutes collection.
- Click the Query button for your index. 
- Click Edit Query to edit the query. 
- Click on the query bar and select the database and collection. 
- Replace the default query with the following and click Find: - { - "$search": { - "text": { - "query": "head", - "path": "text.en_US" - } - } - } - SCORE: 0.32283568382263184 _id: “2” - message: "do not forget to SIGN-IN. See ① for details." - page_updated_by: Object - last_name: "OHRBACH" - first_name: "Noël" - email: "ohrbach@example.com" - phone: "(123) 456 0987" - text: Object - en_US: "The head of the sales department spoke first." - fa_IR: "ابتدا رئیس بخش فروش صحبت کرد" - sv_FI: "Först talade chefen för försäljningsavdelningen" - SCORE: 0.3076632022857666 _id: “3” - message: "try to sign-in" - page_updated_by: Object - last_name: "LEWINSKY" - first_name: "Brièle" - email: "lewinsky@example.com" - phone: "(123).456.9870" - text: Object - en_US: "<body>We'll head out to the conference room by noon.</body>" 
1 db.minutes.aggregate([ 2   { 3     "$search": { 4       "text": { 5         "query": "head", 6         "path": "text.en_US" 7       } 8     } 9   }, 10   { 11     "$project": { 12       "_id": 1, 13       "text.en_US": 1 14     } 15   } 16 ]) 
[   {     _id: 2,     text: { en_US: "The head of the sales department spoke first." }   },   {     _id: 3,     text: {       en_US: "<body>We'll head out to the conference room by noon.</body>"     }   } ] 
MongoDB Search doesn't return the document with _id: 1 because the
string head is part of the HTML tag <head>. The
document with _id: 3 contains HTML tags, but the string
head is elsewhere so the document is a match. The following
table shows the tokens that MongoDB Search generates for the text.en_US
field values in documents _id: 1, _id: 2, and  _id: 3 in
the minutes collection using the
htmlStrippingAnalyzer.
| Document ID | Output Tokens | 
|---|---|
| 
 | 
 | 
| 
 | 
 | 
| 
 | 
 | 
icuNormalize
The icuNormalize character filter normalizes text with the ICU Normalizer. It is based on Lucene's
ICUNormalizer2CharFilter.
Attributes
The icuNormalize character filter has the following attribute:
| Name | Type | Required? | Description | 
|---|---|---|---|
| 
 | string | yes | Human-readable label that identifies this character filter type.
Value must be  | 
Example
The following index definition example indexes the message field
in the minutes collection using a
custom analyzer named normalizingAnalyzer. The custom analyzer
specifies the following:
- Normalize the text in the - messagefield value using the- icuNormalizecharacter filter.
- Tokenize the words in the field based on occurrences of whitespace between words using the whitespace tokenizer. 
- In the Custom Analyzers section, click Add Custom Analyzer. 
- Select the Create Your Own radio button and click Next. 
- Type - normalizingAnalyzerin the Analyzer Name field.
- Expand Character Filters and click Add character filter. 
- Select icuNormalize from the dropdown and click Add character filter. 
- Expand Tokenizer if it's collapsed and select whitespace from the dropdown. 
- Click Add to add the custom analyzer to your index. 
- In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the message field. 
- Select message from the Field Name dropdown and String from the Data Type dropdown. 
- In the properties section for the data type, select the - normalizingAnalyzerfrom the Index Analyzer and Search Analyzer dropdowns.
- Click Add, then Save Changes. 
Replace the default index definition with the following:
1 { 2   "mappings": { 3     "fields": { 4       "message": { 5         "type": "string", 6         "analyzer": "normalizingAnalyzer" 7       } 8     } 9   }, 10   "analyzers": [ 11     { 12       "name": "normalizingAnalyzer", 13       "charFilters": [ 14         { 15           "type": "icuNormalize" 16         } 17       ], 18       "tokenizer": { 19         "type": "whitespace" 20       }, 21       "tokenFilters": [] 22     } 23   ] 24 } 
db.minutes.createSearchIndex("default", {   "mappings": {     "fields": {       "message": {         "type": "string",         "analyzer": "normalizingAnalyzer"       }     }   },   "analyzers": [     {       "name": "normalizingAnalyzer",       "charFilters": [         {           "type": "icuNormalize"         }       ],       "tokenizer": {         "type": "whitespace"       },       "tokenFilters": []     }   ] }) 
The following query searches for occurrences of the string
no (for  number) in the message field of the minutes
collection.
- Click the Query button for your index. 
- Click Edit Query to edit the query. 
- Click on the query bar and select the database and collection. 
- Replace the default query with the following and click Find: - { - "$search": { - "text": { - "query": "no", - "path": "message" - } - } - } - SCORE: 0.4923309087753296 _id: “4” - message: "write down your signature or phone №" - page_updated_by: Object - last_name: "LEVINSKI" - first_name: "François" - email: "levinski@example.com" - phone: "123-456-8907" - text: Object - en_US: "<body>This page has been updated with the items on the agenda.</body>" - es_MX: "La página ha sido actualizada con los puntos de la agenda." - pl_PL: "Strona została zaktualizowana o punkty porządku obrad." 
1 db.minutes.aggregate([ 2   { 3     "$search": { 4       "text": { 5         "query": "no", 6         "path": "message" 7       } 8     } 9   }, 10   { 11     "$project": { 12       "_id": 1, 13       "message": 1, 14       "title": 1 15     } 16   } 17 ]) 
[   {     _id: 4,     title: 'The daily huddle on tHe StandUpApp2',     message: 'write down your signature or phone №'   } ] 
MongoDB Search matched document with _id: 4 to the query term no
because it normalized the numero symbol № in the field using the
icuNormalize character filter and created the token no for
that typographic abbreviation of the word "number". MongoDB Search generates
the following tokens for the message field value in document
_id: 4 using the normalizingAnalyzer:
| Document ID | Output Tokens | 
|---|---|
| 
 | 
 | 
mapping
The mapping character filter applies user-specified normalization
mappings to characters. It is based on Lucene's MappingCharFilter.
Attributes
The mapping character filter has the following attributes:
| Name | Type | Required? | Description | 
|---|---|---|---|
| 
 | string | yes | Human-readable label that identifies this character filter type. Value must be  | 
| 
 | object | yes | Object that contains a comma-separated list of mappings. A
mapping indicates that one character or group of characters
should be substituted for another, in the format
 | 
Example
The following index definition example indexes the
page_updated_by.phone field in the minutes collection using a custom analyzer named
mappingAnalyzer. The custom analyzer specifies the  following:
- Remove instances of hyphen ( - -), dot (- .), open parenthesis (- (), close parenthesis (- )), and space characters in the phone field using the- mappingcharacter filter.
- Tokenize the entire input as a single token using the keyword tokenizer. 
- In the Custom Analyzers section, click Add Custom Analyzer. 
- Select the Create Your Own radio button and click Next. 
- Type - mappingAnalyzerin the Analyzer Name field.
- Expand Character Filters and click Add character filter. 
- Select mapping from the dropdown and click Add mapping. 
- Enter the following characters in the Original field, one at a time, and leave the corresponding Replacement field empty. OriginalReplacement- -- .- (- )- {SPACE} 
- Click Add character filter. 
- Expand Tokenizer if it's collapsed and select keyword from the dropdown. 
- Click Add to add the custom analyzer to your index. 
- In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the page_updated_by.phone (nested) field. 
- Select page_updated_by.phone (nested) from the Field Name dropdown and String from the Data Type dropdown. 
- In the properties section for the data type, select - mappingAnalyzerfrom the Index Analyzer and Search Analyzer dropdowns.
- Click Add, then Save Changes. 
Replace the default index definition with the following:
1 { 2   "mappings": { 3     "fields": { 4       "page_updated_by": { 5         "fields": { 6           "phone": { 7             "analyzer": "mappingAnalyzer", 8             "type": "string" 9           } 10         }, 11         "type": "document" 12       } 13     } 14   }, 15   "analyzers": [ 16     { 17       "name": "mappingAnalyzer", 18       "charFilters": [ 19         { 20           "mappings": { 21             "-": "", 22             ".": "", 23             "(": "", 24             ")": "", 25             " ": "" 26           }, 27           "type": "mapping" 28         } 29       ], 30       "tokenizer": { 31         "type": "keyword" 32       } 33     } 34   ] 35 } 
1 db.minutes.createSearchIndex( 2   "default", 3   { 4     "mappings": { 5       "fields": { 6         "page_updated_by": { 7           "fields": { 8             "phone": { 9               "analyzer": "mappingAnalyzer", 10               "type": "string" 11             } 12           }, 13           "type": "document" 14         } 15       } 16     }, 17     "analyzers": [ 18       { 19         "name": "mappingAnalyzer", 20         "charFilters": [ 21           { 22             "mappings": { 23               "-": "", 24               ".": "", 25               "(": "", 26               ")": "", 27               " ": "" 28             }, 29             "type": "mapping" 30           } 31         ], 32         "tokenizer": { 33           "type": "keyword" 34         } 35       } 36     ] 37   } 38 ) 
The following query searches the page_updated_by.phone field for
the string 1234567890.
- Click the Query button for your index. 
- Click Edit Query to edit the query. 
- Click on the query bar and select the database and collection. 
- Replace the default query with the following and click Find: - { - "$search": { - "text": { - "query": "1234567890", - "path": "page_updated_by.phone" - } - } - } - SCORE: 0.5472603440284729 _id: “1” - message: "try to siGn-In" - page_updated_by: Object - last_name: "AUERBACH" - first_name: "Siân" - email: "auerbach@example.com" - phone: "(123)-456-7890" - text: Object - en_US: "<head> This page deals with department meetings.</head>" - sv_FI: "Den här sidan behandlar avdelningsmöten" - fr_CA: "Cette page traite des réunions de département" 
1 db.minutes.aggregate([ 2   { 3     "$search": { 4       "text": { 5         "query": "1234567890", 6         "path": "page_updated_by.phone" 7       } 8     } 9   }, 10   { 11     "$project": { 12       "_id": 1, 13       "page_updated_by.phone": 1, 14       "page_updated_by.last_name": 1 15     } 16   } 17 ]) 
[   {     _id: 1,     page_updated_by: { last_name: 'AUERBACH', phone: '(123)-456-7890' }   } ] 
The MongoDB Search results contain one document where the numbers in the
phone string match the query string. MongoDB Search matched the
document to the query string even though the query doesn't
include the parentheses around the phone area code and the
hyphen between the numbers because MongoDB Search removed these
characters using the mapping character filter and created a
single token for the field value. Specifically, MongoDB Search generated
the following token for the phone field in document with
_id: 1:
| Document ID | Output Tokens | 
|---|---|
| 
 | 
 | 
MongoDB Search would also match document with  _id: 1 for searches
for (123)-456-7890, 123-456-7890, 123.456.7890, and
so on because for How to Index String Fields fields, MongoDB Search also
analyzes search query terms using the index analyzer (or if
specified, using the searchAnalyzer). The following table shows
the tokens that MongoDB Search creates by removing instances of hyphen
(-), dot (.), open parenthesis ((), close parenthesis (
)), and space characters in the query term:
| Query Term | Output Tokens | 
|---|---|
| 
 | 
 | 
| 
 | 
 | 
| 
 | 
 | 
persian
The persian character filter replaces instances of zero-width
non-joiner
with the space character. This character filter is based on Lucene's
PersianCharFilter.
Attributes
The persian character filter has the following attribute:
| Name | Type | Required? | Description | 
|---|---|---|---|
| 
 | string | yes | Human-readable label that identifies this character filter type.
Value must be  | 
Example
The following index definition example indexes the text.fa_IR
field in the minutes collection
using a custom analyzer named persianCharacterIndex. The
custom analyzer specifies the following:
- Apply the - persiancharacter filter to replace non-printing characters in the field value with the space character.
- Use the whitespace tokenizer to create tokens based on occurrences of whitespace between words. 
- In the Custom Analyzers section, click Add Custom Analyzer. 
- Select the Create Your Own radio button and click Next. 
- Type - persianCharacterIndexin the Analyzer Name field.
- Expand Character Filters and click Add character filter. 
- Select persian from the dropdown and click Add character filter. 
- Expand Tokenizer if it's collapsed and select whitespace from the dropdown. 
- Click Add to add the custom analyzer to your index. 
- In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the text.fa_IR (nested) field. 
- Select text.fa_IR (nested) from the Field Name dropdown and String from the Data Type dropdown. 
- In the properties section for the data type, select the - persianCharacterIndexfrom the Index Analyzer and Search Analyzer dropdowns.
- Click Add, then Save Changes. 
Replace the default index definition with the following:
1 { 2   "analyzer": "lucene.standard", 3   "mappings": { 4     "fields": { 5       "text": { 6         "dynamic": true, 7         "fields": { 8           "fa_IR": { 9             "analyzer": "persianCharacterIndex", 10             "type": "string" 11           } 12         }, 13         "type": "document" 14       } 15     } 16   }, 17   "analyzers": [ 18     { 19       "name": "persianCharacterIndex", 20       "charFilters": [ 21         { 22           "type": "persian" 23         } 24       ], 25       "tokenizer": { 26         "type": "whitespace" 27       } 28     } 29   ] 30 } 
db.minutes.createSearchIndex("default", {   "analyzer": "lucene.standard",   "mappings": {     "fields": {       "text": {         "dynamic": true,         "fields": {           "fa_IR": {             "analyzer": "persianCharacterIndex",             "type": "string"           }         },         "type": "document"       }     }   },   "analyzers": [     {       "name": "persianCharacterIndex",       "charFilters": [         {           "type": "persian"         }       ],       "tokenizer": {         "type": "whitespace"       }     }   ] }) 
The following query searches the text.fa_IR field for the term
صحبت.
- Click the Query button for your index. 
- Click Edit Query to edit the query. 
- Click on the query bar and select the database and collection. 
- Replace the default query with the following and click Find: - { - "$search": { - "text": { - "query": "صحبت", - "path": "text.fa_IR" - } - } - } - SCORE: 0.13076457381248474 _id: “2” - message: "do not forget to SIGN-IN. See ① for details." - page_updated_by: Object - last_name: "OHRBACH" - first_name: "Noël" - email: "ohrbach@example.com" - phone: "(123) 456 0987" - text: Object - en_US: "The head of the sales department spoke first." - fa_IR: "ابتدا رئیس بخش فروش صحبت کرد" - sv_FI: "Först talade chefen för försäljningsavdelningen" 
1 db.minutes.aggregate([ 2   { 3     "$search": { 4       "text": { 5         "query": "صحبت", 6         "path": "text.fa_IR" 7       } 8     } 9   }, 10   { 11     "$project": { 12       "_id": 1, 13       "text.fa_IR": 1, 14       "page_updated_by.last_name": 1 15     } 16   } 17 ]) 
[   {     _id: 2,     page_updated_by: { last_name: 'OHRBACH' },     text: { fa_IR: 'ابتدا رئیس بخش فروش صحبت کرد' }   } ] 
MongoDB Search returns the  _id: 2 document that contains the query term.
MongoDB Search matches the query term to the document by first replacing
instances of zero-width non-joiners with the space character and
then creating individual tokens for each word in the field value
based on occurrences of whitespace between words. Specifically, MongoDB Search
generates the following tokens for document with _id: 2:
| Document ID | Output Tokens | 
|---|---|
| 
 | 
 | 
Learn More
To see additional index definitions and queries that use the mapping character filter, see the following reference page examples:
- shingle token filter 
- regexCaptureGroup tokenizer