mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-06-29 18:03:32 -04:00
This has a lot of improvements in lucene, particularly around memory usage, merging, safety, compressed bitsets, etc. On the elasticsearch side, summary of the larger changes: API changes: postings API became a "pull" rather than "push", collector API became per-segment, etc. packaging changes: add lucene-backwards-codecs.jar as a dependency. improvements to boolean filtering: especially ensuring it will not be slow for SparseBitSet. use generic BitSet api in plumbing so that concrete bitset type is an implementation detail. use generic BitDocIdSetFilter api for dedicated bitset cache, so there is type safety. changes to support atomic commits implement Accountable.getChildResources (detailed memory usage API) for fielddata, etc change handling of IndexFormatTooOld/New, since they no longer extends CorruptIndexException Closes #8347. Squashed commit of the following: commitd90d53f5f2
Author: Simon Willnauer <simonw@apache.org> Date: Wed Nov 5 21:35:28 2014 +0100 Make default codec/postings/docvalues format constants commitcb66c22c71
Merge:d4e2f6d
ad4ff43
Author: Robert Muir <rmuir@apache.org> Date: Wed Nov 5 11:41:13 2014 -0500 Merge branch 'master' into enhancement/lucene_5_0_upgrade commitd4e2f6dfe7
Merge:4e5445c
4111d93
Author: Robert Muir <rmuir@apache.org> Date: Wed Nov 5 06:26:32 2014 -0500 Merge branch 'master' into enhancement/lucene_5_0_upgrade commit4e5445c775
Author: Robert Muir <rmuir@apache.org> Date: Tue Nov 4 16:19:19 2014 -0500 FixedBitSet -> BitSet commit9887ea73e8
Merge:1bf8894
fc84666
Author: Robert Muir <rmuir@apache.org> Date: Tue Nov 4 15:26:25 2014 -0500 Merge branch 'master' into enhancement/lucene_5_0_upgrade commit1bf8894430
Author: Robert Muir <rmuir@apache.org> Date: Tue Nov 4 15:22:51 2014 -0500 remove nocommit commita9c2a2259f
Author: Robert Muir <rmuir@apache.org> Date: Tue Nov 4 13:48:43 2014 -0500 turn jenkins red again commit067baaaa4d
Author: Robert Muir <rmuir@apache.org> Date: Tue Nov 4 13:18:21 2014 -0500 unzip from stream commit82b6fba33d
Merge: b2214bb6523cd9
Author: Robert Muir <rmuir@apache.org> Date: Tue Nov 4 13:10:59 2014 -0500 Merge branch 'master' into enhancement/lucene_5_0_upgrade commitb2214bb093
Author: Robert Muir <rmuir@apache.org> Date: Tue Nov 4 13:09:53 2014 -0500 go back to my URL until we can figure out what is up with jenkins commite7d6141722
Author: Robert Muir <rmuir@apache.org> Date: Tue Nov 4 10:52:54 2014 -0500 try this jenkins commit337a3c7704
Author: Simon Willnauer <simonw@apache.org> Date: Tue Nov 4 16:17:49 2014 +0100 Rename temp-files under lock to prevent metadata reads while renaming commit77d5ba80d0
Author: Robert Muir <rmuir@apache.org> Date: Tue Nov 4 10:07:11 2014 -0500 continue to treat too-old/too-new as corruption for now commit98d0fd2f48
Author: Robert Muir <rmuir@apache.org> Date: Tue Nov 4 09:24:21 2014 -0500 fix last nocommit commit643fceed66
Author: Simon Willnauer <simonw@apache.org> Date: Tue Nov 4 14:46:17 2014 +0100 remove NoSuchDirectoryException commit2e43c4feba
Merge:93826e4
8163107
Author: Simon Willnauer <simonw@apache.org> Date: Tue Nov 4 14:38:00 2014 +0100 Merge branch 'master' into enhancement/lucene_5_0_upgrade commit93826e4d56
Merge:7f10129
44e24d3
Author: Simon Willnauer <simonw@apache.org> Date: Tue Nov 4 12:54:27 2014 +0100 Merge branch 'master' into enhancement/lucene_5_0_upgrade Conflicts: src/main/java/org/elasticsearch/index/store/DistributorDirectory.java src/main/java/org/elasticsearch/index/store/Store.java src/main/java/org/elasticsearch/indices/recovery/RecoveryStatus.java src/test/java/org/elasticsearch/index/store/DistributorDirectoryTest.java src/test/java/org/elasticsearch/index/store/StoreTest.java src/test/java/org/elasticsearch/indices/recovery/RecoveryStatusTests.java commit7f10129364
Author: Adrien Grand <jpountz@gmail.com> Date: Tue Nov 4 11:32:24 2014 +0100 Fix TopHitsAggregator to not ignore the top-level/leaf collector split. commit042fadc860
Author: Adrien Grand <jpountz@gmail.com> Date: Tue Nov 4 11:31:20 2014 +0100 Remove MatchDocIdSet in favor of DocValuesDocIdSet. commit7d877581ff
Author: Adrien Grand <jpountz@gmail.com> Date: Tue Nov 4 11:10:08 2014 +0100 Make the and filter use the cost API. Lucene 5 ensured that cost() can safely be used, and this will have the benefit that the order in which filters are specified is not important anymore (only for slow random-access filters in practice). commit78f1718aa2
Author: Robert Muir <rmuir@apache.org> Date: Mon Nov 3 23:55:17 2014 -0500 fix previous eclipse import braindamage commit186c40e925
Author: Robert Muir <rmuir@apache.org> Date: Mon Nov 3 22:32:34 2014 -0500 allow child queries to exhaust iterators again commitb0b1271305
Author: Ryan Ernst <ryan@iernst.net> Date: Mon Nov 3 14:50:44 2014 -0800 Fix nocommit for mapping output. index_options will not be printed if the field is not indexed. commitba223eb85e
Author: Ryan Ernst <ryan@iernst.net> Date: Mon Nov 3 14:07:26 2014 -0800 Remove no commit for chinese analyzer provider. We should have a separate issue to address not using this provider on new indexes. commitca554b03c4
Author: Ryan Ernst <ryan@iernst.net> Date: Mon Nov 3 13:41:59 2014 -0800 Fix stop tests commitde67c4653e
Author: Ryan Ernst <ryan@iernst.net> Date: Mon Nov 3 12:51:17 2014 -0800 Remove analysis nocommits, switching over to Lucene43*Filters for backcompat commit50cae9bec7
Author: Robert Muir <rmuir@apache.org> Date: Mon Nov 3 15:32:25 2014 -0500 add ram accounting and TODO lazy-loading (its no worse than master, can be a followup improvement) for suggesters commit7a7f0122f1
Author: Robert Muir <rmuir@apache.org> Date: Mon Nov 3 15:11:26 2014 -0500 bump lucene version commitcd0cae5c35
Merge:446bc09
3c72073
Author: Robert Muir <rmuir@apache.org> Date: Mon Nov 3 14:49:05 2014 -0500 Merge branch 'master' into enhancement/lucene_5_0_upgrade commit446bc09b4e
Author: Robert Muir <rmuir@apache.org> Date: Mon Nov 3 14:46:30 2014 -0500 remove hack commita19d85a968
Author: Robert Muir <rmuir@apache.org> Date: Mon Nov 3 12:53:11 2014 -0500 dont create exceptions with circular references on corruption (will open a PR for this) commit0beefb9e82
Author: Robert Muir <rmuir@apache.org> Date: Mon Nov 3 11:47:14 2014 -0500 temporarily add craptastic detector for this horrible bug commite9f2d298bf
Author: Robert Muir <rmuir@apache.org> Date: Mon Nov 3 10:56:01 2014 -0500 add nocommit commite97f1d50a9
Merge:c57a3c8
f1f50ac
Author: Robert Muir <rmuir@apache.org> Date: Mon Nov 3 10:12:12 2014 -0500 Merge branch 'master' into enhancement/lucene_5_0_upgrade commitc57a3c8341
Author: Robert Muir <rmuir@apache.org> Date: Mon Nov 3 10:11:46 2014 -0500 fix nocommit commitdd0e77e4ec
Author: Robert Muir <rmuir@apache.org> Date: Mon Nov 3 09:54:09 2014 -0500 nocommit -> TODO, this is in much more places in the codebase, bigger issue commit3cc3bf56d7
Author: Ryan Ernst <ryan@iernst.net> Date: Sat Nov 1 23:59:17 2014 -0700 Remove nocommit and awaitsfix for edge ngram filter test. commit89f1152451
Author: Ryan Ernst <ryan@iernst.net> Date: Sat Nov 1 23:57:44 2014 -0700 Fix EdgeNGramTokenFilter logic for version <= 4.3, and fixed instanceof checks in corresponding tests to correctly check for reverse filter when applicable. commit112df869cd
Author: Robert Muir <rmuir@apache.org> Date: Sun Nov 2 00:08:30 2014 -0400 execute geo disjoint query/filter as intersects commite5061273cc
Author: Robert Muir <rmuir@apache.org> Date: Sat Nov 1 22:58:59 2014 -0400 remove chinese analyzer from docs commitea1af11b89
Author: Robert Muir <rmuir@apache.org> Date: Sat Nov 1 22:29:00 2014 -0400 fix ram accounting bug commit53c0a42c6a
Merge:e3bcd3c
6011a18
Author: Robert Muir <rmuir@apache.org> Date: Sat Nov 1 22:16:29 2014 -0400 Merge branch 'master' into enhancement/lucene_5_0_upgrade commite3bcd3cc07
Author: Robert Muir <rmuir@apache.org> Date: Sat Nov 1 22:15:01 2014 -0400 fix url-email back compat (thanks ryan) commit91d6b096a9
Author: Robert Muir <rmuir@apache.org> Date: Sat Nov 1 22:11:26 2014 -0400 bump lucene version commitd2bb9568df
Author: Robert Muir <rmuir@apache.org> Date: Sat Nov 1 20:33:07 2014 -0400 remove nocommit commit1d049c471e
Author: Robert Muir <rmuir@apache.org> Date: Sat Nov 1 20:28:58 2014 -0400 fix eclipse to group org/com imports together: without this, its madness commit09d8c1585e
Author: Robert Muir <rmuir@apache.org> Date: Sat Nov 1 14:27:41 2014 -0400 remove nocommit, if you dont liek it, print assembly and tell me how it can be better commit8a6a294313
Author: Adrien Grand <jpountz@gmail.com> Date: Fri Oct 31 20:01:55 2014 +0100 Remove deprecated usage of DocIdSets.newDocIDSet. commit601bee6054
Author: Robert Muir <rmuir@apache.org> Date: Fri Oct 31 14:13:18 2014 -0400 maybe one of these zillions of annotations will stop thread leaks commit9d3f69abc7
Author: Robert Muir <rmuir@apache.org> Date: Fri Oct 31 14:05:39 2014 -0400 fix some analysis nocommits commit312e3a29c7
Author: Adrien Grand <jpountz@gmail.com> Date: Fri Oct 31 18:28:45 2014 +0100 Remove XConstantScoreQuery/XFilteredQuery/ApplyAcceptedDocsFilter. commit5a0cb9f8e1
Author: Adrien Grand <jpountz@gmail.com> Date: Fri Oct 31 17:06:45 2014 +0100 Fix misleading documentation of DocIdSets.toCacheable. commit8b4ef2b5b4
Author: Adrien Grand <jpountz@gmail.com> Date: Fri Oct 31 17:05:59 2014 +0100 Fix CustomRandomAccessFilterStrategy to override the right method. commitd7a9a407a6
Author: Adrien Grand <jpountz@gmail.com> Date: Fri Oct 31 16:21:35 2014 +0100 Better handle the special case when there is a single SHOULD clause. commit648ad389f0
Author: Adrien Grand <jpountz@gmail.com> Date: Fri Oct 31 15:53:38 2014 +0100 Cut over XBooleanFilter to BitDocIdSet.Builder. The idea is similar to what happened to Lucene's BooleanFilter. Yet XBooleanFilter is a bit more sophisticated and I had to slightly change the way it is implemented in order to make it work. The main difference with before is that slow filters are now applied lazily, so eg. if you have 3 MUST clauses, two with a fast iterator and the third with a slow iterator, the previous implementation used to apply the fast iterators first and then only check the slow filter for bits which were set in the bit set. Now we are computing a bit set based on the fast must clauses and then basically returning a BitsFilteredDocIdSet.wrap(bitset, slowClause). Other than that, BooleanFilter still uses the bitset optimizations when or-ing and and-ind filters. Another improvement is that BooleanFilter is now aware of the cost API. commitb2dad312b4
Author: Robert Muir <rmuir@apache.org> Date: Fri Oct 31 10:18:53 2014 -0400 clear nocommit commit4851d2091e
Author: Simon Willnauer <simonw@apache.org> Date: Fri Oct 31 15:15:16 2014 +0100 cut over to RoaringDocIdSet commitca6aec24a9
Author: Simon Willnauer <simonw@apache.org> Date: Fri Oct 31 14:57:30 2014 +0100 make nocommit more explicit commitd0742ee2cb
Author: Robert Muir <rmuir@apache.org> Date: Fri Oct 31 09:55:24 2014 -0400 fix standardtokenizer nocommit commit7d6faccaff
Author: Simon Willnauer <simonw@apache.org> Date: Fri Oct 31 14:54:08 2014 +0100 fix compilation commita038a405c1
Author: Simon Willnauer <simonw@apache.org> Date: Fri Oct 31 14:53:43 2014 +0100 fix compilation commit30c9e307b1
Author: Simon Willnauer <simonw@apache.org> Date: Fri Oct 31 14:52:35 2014 +0100 fix compilation commite5139bc5a0
Author: Robert Muir <rmuir@apache.org> Date: Fri Oct 31 09:52:16 2014 -0400 clear nocommit here commit85dd2cedf7
Author: Simon Willnauer <simonw@apache.org> Date: Fri Oct 31 14:46:17 2014 +0100 fix CompletionPostingsFormatTest commitc0f3781f61
Author: Robert Muir <rmuir@apache.org> Date: Fri Oct 31 09:38:00 2014 -0400 add tests for these analyzers commit51f9999b4a
Author: Simon Willnauer <simonw@apache.org> Date: Fri Oct 31 14:10:26 2014 +0100 remove nocommit - this is not an issue commitfd1388fa03
Author: Martijn van Groningen <martijn.v.groningen@gmail.com> Date: Fri Oct 31 14:07:01 2014 +0100 Remove redundant null check commit3d6dd51b09
Author: Martijn van Groningen <martijn.v.groningen@gmail.com> Date: Fri Oct 31 14:01:37 2014 +0100 Removed the work around to prevent p/c error when invoking #iterator() twice, because the custom query filter wrapper now doesn't transform the result to a cache doc id set any more. I think the transforming to a cachable doc id set in CustomQueryWrappingFilter isn't needed at all, because we use the DocIdSet only once and because of that is just slowed things down. commit821832a537
Author: Simon Willnauer <simonw@apache.org> Date: Fri Oct 31 13:54:33 2014 +0100 one more nocommit commit77eb9ea4c4
Author: Martijn van Groningen <martijn.v.groningen@gmail.com> Date: Fri Oct 31 13:52:29 2014 +0100 Remove cast commita400573c03
Author: Simon Willnauer <simonw@apache.org> Date: Fri Oct 31 13:49:24 2014 +0100 fix stop filter commit51746087cf
Author: Simon Willnauer <simonw@apache.org> Date: Fri Oct 31 13:21:36 2014 +0100 fix changed semantics of FBS.nextSetBit to check for NO_MORE_DOCS commit8d0a4e2511
Author: Robert Muir <rmuir@apache.org> Date: Fri Oct 31 08:13:44 2014 -0400 do the bogus cast differently commit46a5cc5732
Author: Simon Willnauer <simonw@apache.org> Date: Fri Oct 31 13:00:16 2014 +0100 I hate it but P/C now passes commit580c0c2f82
Merge:a9d3c00
1645434
Author: Robert Muir <rmuir@apache.org> Date: Fri Oct 31 06:54:31 2014 -0400 fix nocommit/classcast commita9d3c004d6
Author: Adrien Grand <jpountz@gmail.com> Date: Fri Oct 31 08:49:31 2014 +0100 Update TODO. commitaa75af0b40
Author: Robert Muir <rmuir@apache.org> Date: Thu Oct 30 19:18:25 2014 -0400 clear obselete nocommits from lucene bump commitd438534cf4
Author: Robert Muir <rmuir@apache.org> Date: Thu Oct 30 18:53:20 2014 -0400 throw classcastexception when ES abuses regular filtercache for nested docs commit2c751f3a8f
Author: Robert Muir <rmuir@apache.org> Date: Thu Oct 30 18:31:34 2014 -0400 bump lucene revision, fix tests commitd6ef7f6304
Author: Simon Willnauer <simonw@apache.org> Date: Thu Oct 30 22:37:58 2014 +0100 fix merge problems commitde9d361f88
Merge:41f6aab
f6b37a3
Author: Simon Willnauer <simonw@apache.org> Date: Thu Oct 30 22:28:59 2014 +0100 Merge branch 'master' into enhancement/lucene_5_0_upgrade Conflicts: pom.xml src/main/java/org/elasticsearch/Version.java src/main/java/org/elasticsearch/gateway/local/state/meta/MetaDataStateFormat.java commit41f6aab388
Author: Simon Willnauer <simonw@apache.org> Date: Thu Oct 30 17:48:46 2014 +0100 fix potiential NPE commitc4428b12e1
Author: Simon Willnauer <simonw@apache.org> Date: Thu Oct 30 17:38:46 2014 +0100 don't advance iterator in a match(doc) method commit28ab948e99
Author: Simon Willnauer <simonw@apache.org> Date: Thu Oct 30 17:34:58 2014 +0100 don't advance iterator in a match(doc) method commiteb0f33f663
Author: Simon Willnauer <simonw@apache.org> Date: Thu Oct 30 16:55:54 2014 +0100 fix GeoUtilsTest commit7f711fe3ea
Author: Simon Willnauer <simonw@apache.org> Date: Thu Oct 30 16:43:16 2014 +0100 Use a dedicated default index option if field type is not indexed by default commit78e3f37ab7
Author: Robert Muir <rmuir@apache.org> Date: Thu Oct 30 10:56:14 2014 -0400 disable this test with AwaitsFix to reduce noise commit9a590f563c
Author: Simon Willnauer <simonw@apache.org> Date: Thu Oct 30 09:38:49 2014 +0100 fix lucene version commitabe3ca1d8b
Author: Simon Willnauer <simonw@apache.org> Date: Thu Oct 30 09:35:05 2014 +0100 fix AnalyzingCompletionLookupProvider to wrok with new codec API commit464293b245
Author: Robert Muir <rmuir@apache.org> Date: Thu Oct 30 00:26:00 2014 -0400 don't try to write stuff to tests class directory commit031cc6c19f
Author: Robert Muir <rmuir@apache.org> Date: Thu Oct 30 00:12:36 2014 -0400 AwaitsFix these known issues to reduce noise commit4600d51891
Author: Robert Muir <rmuir@apache.org> Date: Thu Oct 30 00:06:53 2014 -0400 openbitset lives on commit8492bae056
Author: Robert Muir <rmuir@apache.org> Date: Wed Oct 29 23:42:54 2014 -0400 fixes for filter tests commit31f24ce4ef
Author: Robert Muir <rmuir@apache.org> Date: Wed Oct 29 23:12:38 2014 -0400 don't use fieldcache commit8480789942
Author: Robert Muir <rmuir@apache.org> Date: Wed Oct 29 23:04:29 2014 -0400 ancient index no longer supported commit02e78dc7eb
Author: Simon Willnauer <simonw@apache.org> Date: Wed Oct 29 23:37:02 2014 +0100 fix more tests commitff746c6df2
Author: Simon Willnauer <simonw@apache.org> Date: Wed Oct 29 23:08:19 2014 +0100 fix all mapper commite4fb84b517
Author: Simon Willnauer <simonw@apache.org> Date: Wed Oct 29 22:55:54 2014 +0100 fix distributor tests and cut over to FileStore API commit20c850e2cf
Author: Simon Willnauer <simonw@apache.org> Date: Wed Oct 29 22:42:18 2014 +0100 use DOCS_ONLY if index=true and current options == null commit44169c1084
Author: Simon Willnauer <simonw@apache.org> Date: Wed Oct 29 22:33:36 2014 +0100 Fix index=yes|no settings in mappers commita3c5f77987
Author: Simon Willnauer <simonw@apache.org> Date: Wed Oct 29 21:51:41 2014 +0100 fix several field mappers conversion from setIndexed to indexOptions commitdf84d73690
Author: Simon Willnauer <simonw@apache.org> Date: Wed Oct 29 21:33:35 2014 +0100 fix SourceFieldMapper to be not indexed commitb2bf01d12a
Author: Simon Willnauer <simonw@apache.org> Date: Wed Oct 29 21:23:08 2014 +0100 Cut over to .liv files in store and corruption tests commit619004df43
Author: Simon Willnauer <simonw@apache.org> Date: Wed Oct 29 17:05:52 2014 +0100 fix more tests commitb7ed653a8b
Author: Simon Willnauer <simonw@apache.org> Date: Wed Oct 29 16:19:08 2014 +0100 [STORE] Add dedicated method to write temporary files Recovery writes temporary files which might not end up in the right distributor directories today. This commit adds a dedicated API that allows specifying the target file name in order to create the tempoary file in the correct directory. commit7d574659f6
Author: Robert Muir <rmuir@apache.org> Date: Wed Oct 29 10:28:49 2014 -0400 add some leniency to temporary bogus method commitf97022ea7c
Author: Robert Muir <rmuir@apache.org> Date: Wed Oct 29 10:24:17 2014 -0400 fix MultiCollector bug commitb760533128
Author: Simon Willnauer <simonw@apache.org> Date: Wed Oct 29 14:56:08 2014 +0100 CheckIndex is now closeable we need to close it commit9dae9fb6d6
Author: Simon Willnauer <simonw@apache.org> Date: Wed Oct 29 14:45:11 2014 +0100 s/Lucene51/Lucene50 commit7aea9b8685
Author: Simon Willnauer <simonw@apache.org> Date: Wed Oct 29 14:42:30 2014 +0100 fix BloomFilterPostingsFormat commit16fea6fe84
Author: Simon Willnauer <simonw@apache.org> Date: Wed Oct 29 14:41:16 2014 +0100 fix some codec format issues commit3d77aa97dd
Author: Simon Willnauer <simonw@apache.org> Date: Wed Oct 29 14:30:43 2014 +0100 fix CodecTests commit6ef823b1fd
Author: Simon Willnauer <simonw@apache.org> Date: Wed Oct 29 14:26:47 2014 +0100 make it compile commit9991eee1fe
Author: Robert Muir <rmuir@apache.org> Date: Wed Oct 29 09:12:43 2014 -0400 add an ugly hack for TopHitsAggregator for now commit03e768a01f
Author: Simon Willnauer <simonw@apache.org> Date: Wed Oct 29 14:01:02 2014 +0100 cut over ES090PostingsFormat commit463d281faa
Merge:0f8740a
8eac79c
Author: Robert Muir <rmuir@apache.org> Date: Wed Oct 29 08:30:36 2014 -0400 Merge branch 'master' into enhancement/lucene_5_0_upgrade commit0f8740a782
Author: Robert Muir <rmuir@apache.org> Date: Wed Oct 29 01:00:15 2014 -0400 fix/hack remaining filter and analysis issues commitdf53448856
Author: Robert Muir <rmuir@apache.org> Date: Tue Oct 28 23:11:47 2014 -0400 fix ngrams / openbitset usage commit11f5dc3b98
Author: Robert Muir <rmuir@apache.org> Date: Tue Oct 28 22:42:44 2014 -0400 hack over sort comparators commit4ebdc75435
Author: Robert Muir <rmuir@apache.org> Date: Tue Oct 28 21:27:07 2014 -0400 compiler errors < 100 commit2d60c9e29d
Author: Robert Muir <rmuir@apache.org> Date: Tue Oct 28 03:13:08 2014 -0400 clear some nocommits around ram usage commitaaf47fe6c0
Author: Robert Muir <rmuir@apache.org> Date: Mon Oct 27 12:27:34 2014 -0400 migrate fieldinfo handling commitef6ed6d15d
Author: Robert Muir <rmuir@apache.org> Date: Mon Oct 27 12:07:13 2014 -0400 more simple fixes commitf475e1048a
Author: Robert Muir <rmuir@apache.org> Date: Mon Oct 27 11:58:21 2014 -0400 more fielddata ram accounting fixes commit16b4239eaa
Author: Simon Willnauer <simonw@apache.org> Date: Mon Oct 27 16:47:32 2014 +0100 add missing file commit5b542fa2a6
Author: Simon Willnauer <simonw@apache.org> Date: Mon Oct 27 16:43:29 2014 +0100 cut over completion posting formats - still some nocommits commitecdea49404
Author: Robert Muir <rmuir@apache.org> Date: Mon Oct 27 11:21:09 2014 -0400 fielddata accountable fixes commitd43da26571
Author: Simon Willnauer <simonw@apache.org> Date: Mon Oct 27 16:19:53 2014 +0100 cut over BloomFilterPostings to new API commit29b192ba62
Author: Robert Muir <rmuir@apache.org> Date: Mon Oct 27 10:22:51 2014 -0400 fix more analyzers commit74b4a0c528
Author: Robert Muir <rmuir@apache.org> Date: Mon Oct 27 09:54:25 2014 -0400 fix tests commit554084ccb4
Author: Simon Willnauer <simonw@apache.org> Date: Mon Oct 27 14:51:48 2014 +0100 maintain supressed exceptions on CorruptIndexException commitcf882d9112
Author: Simon Willnauer <simonw@apache.org> Date: Mon Oct 27 14:47:17 2014 +0100 commitOnClose=false commitebb2a9189a
Author: Simon Willnauer <simonw@apache.org> Date: Mon Oct 27 14:46:06 2014 +0100 cut over indexwriter closeing in InternalEngine commitcd21b3d470
Author: Simon Willnauer <simonw@apache.org> Date: Mon Oct 27 14:38:10 2014 +0100 fix constant commitf93f900c4a
Author: Robert Muir <rmuir@apache.org> Date: Mon Oct 27 09:50:49 2014 -0400 fix test commita9a752940b
Author: Martijn van Groningen <martijn.v.groningen@gmail.com> Date: Mon Oct 27 09:26:18 2014 +0100 Be explicit about the index options commitd9ee815bab
Author: Simon Willnauer <simonw@apache.org> Date: Sun Oct 26 20:03:44 2014 +0100 cut over store and directory commitb3f5c8e390
Author: Robert Muir <rmuir@apache.org> Date: Sun Oct 26 13:08:39 2014 -0400 more test fixes commit8842f2684e
Author: Robert Muir <rmuir@apache.org> Date: Sun Oct 26 12:14:52 2014 -0400 tests manual labor commitc43de5aec3
Author: Robert Muir <rmuir@apache.org> Date: Sun Oct 26 11:04:13 2014 -0400 BytesRef -> BytesRefBuilder commit020c0d087a
Author: Martijn van Groningen <martijn.v.groningen@gmail.com> Date: Sun Oct 26 15:53:37 2014 +0100 Moved over to BitSetFilter commit48dd1b909e
Author: Martijn van Groningen <martijn.v.groningen@gmail.com> Date: Sun Oct 26 15:53:11 2014 +0100 Left over Collector api change in ScanContext commit6ec248ef63
Author: Martijn van Groningen <martijn.v.groningen@gmail.com> Date: Sun Oct 26 15:47:40 2014 +0100 Moved indexed() over to indexOptions != null or indexOptions == null commit9937aebfd8
Author: Martijn van Groningen <martijn.v.groningen@gmail.com> Date: Sun Oct 26 13:26:31 2014 +0100 Fixed many compile errors. Mainly around the breaking Collector api change in 5.0. commitfec32c4abc
Author: Robert Muir <rmuir@apache.org> Date: Sat Oct 25 11:22:17 2014 -0400 more easy fixes commitdab22531d8
Author: Robert Muir <rmuir@apache.org> Date: Sat Oct 25 09:33:41 2014 -0400 more progress commit414767e9a9
Author: Robert Muir <rmuir@apache.org> Date: Sat Oct 25 06:33:17 2014 -0400 more progress commitad9d969fdd
Author: Robert Muir <rmuir@apache.org> Date: Fri Oct 24 14:28:01 2014 -0400 current state of fun commit464475eecb
Author: Robert Muir <rmuir@apache.org> Date: Fri Oct 24 11:42:41 2014 -0400 bump to 5.0 snapshot
1516 lines
38 KiB
Text
1516 lines
38 KiB
Text
[[analysis-lang-analyzer]]
|
|
=== Language Analyzers
|
|
|
|
A set of analyzers aimed at analyzing specific language text. The
|
|
following types are supported:
|
|
<<arabic-analyzer,`arabic`>>,
|
|
<<armenian-analyzer,`armenian`>>,
|
|
<<basque-analyzer,`basque`>>,
|
|
<<brazilian-analyzer,`brazilian`>>,
|
|
<<bulgarian-analyzer,`bulgarian`>>,
|
|
<<catalan-analyzer,`catalan`>>,
|
|
<<cjk-analyzer,`cjk`>>,
|
|
<<czech-analyzer,`czech`>>,
|
|
<<danish-analyzer,`danish`>>,
|
|
<<dutch-analyzer,`dutch`>>,
|
|
<<english-analyzer,`english`>>,
|
|
<<finnish-analyzer,`finnish`>>,
|
|
<<french-analyzer,`french`>>,
|
|
<<galician-analyzer,`galician`>>,
|
|
<<german-analyzer,`german`>>,
|
|
<<greek-analyzer,`greek`>>,
|
|
<<hindi-analyzer,`hindi`>>,
|
|
<<hungarian-analyzer,`hungarian`>>,
|
|
<<indonesian-analyzer,`indonesian`>>,
|
|
<<irish-analyzer,`irish`>>,
|
|
<<italian-analyzer,`italian`>>,
|
|
<<latvian-analyzer,`latvian`>>,
|
|
<<norwegian-analyzer,`norwegian`>>,
|
|
<<persian-analyzer,`persian`>>,
|
|
<<portuguese-analyzer,`portuguese`>>,
|
|
<<romanian-analyzer,`romanian`>>,
|
|
<<russian-analyzer,`russian`>>,
|
|
<<sorani-analyzer,`sorani`>>,
|
|
<<spanish-analyzer,`spanish`>>,
|
|
<<swedish-analyzer,`swedish`>>,
|
|
<<turkish-analyzer,`turkish`>>,
|
|
<<thai-analyzer,`thai`>>.
|
|
|
|
==== Configuring language analyzers
|
|
|
|
===== Stopwords
|
|
|
|
All analyzers support setting custom `stopwords` either internally in
|
|
the config, or by using an external stopwords file by setting
|
|
`stopwords_path`. Check <<analysis-stop-analyzer,Stop Analyzer>> for
|
|
more details.
|
|
|
|
===== Excluding words from stemming
|
|
|
|
The `stem_exclusion` parameter allows you to specify an array
|
|
of lowercase words that should not be stemmed. Internally, this
|
|
functionality is implemented by adding the
|
|
<<analysis-keyword-marker-tokenfilter,`keyword_marker` token filter>>
|
|
with the `keywords` set to the value of the `stem_exclusion` parameter.
|
|
|
|
The following analyzers support setting custom `stem_exclusion` list:
|
|
`arabic`, `armenian`, `basque`, `catalan`, `bulgarian`, `catalan`,
|
|
`czech`, `finnish`, `dutch`, `english`, `finnish`, `french`, `galician`,
|
|
`german`, `irish`, `hindi`, `hungarian`, `indonesian`, `italian`, `latvian`, `norwegian`,
|
|
`portuguese`, `romanian`, `russian`, `sorani`, `spanish`, `swedish`, `turkish`.
|
|
|
|
==== Reimplementing language analyzers
|
|
|
|
The built-in language analyzers can be reimplemented as `custom` analyzers
|
|
(as described below) in order to customize their behaviour.
|
|
|
|
NOTE: If you do not intend to exclude words from being stemmed (the
|
|
equivalent of the `stem_exclusion` parameter above), then you should remove
|
|
the `keyword_marker` token filter from the custom analyzer configuration.
|
|
|
|
[[arabic-analyzer]]
|
|
===== `arabic` analyzer
|
|
|
|
The `arabic` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"arabic_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_arabic_" <1>
|
|
},
|
|
"arabic_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"arabic_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "arabic"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"arabic": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"lowercase",
|
|
"arabic_stop",
|
|
"arabic_normalization",
|
|
"arabic_keywords",
|
|
"arabic_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[armenian-analyzer]]
|
|
===== `armenian` analyzer
|
|
|
|
The `armenian` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"armenian_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_armenian_" <1>
|
|
},
|
|
"armenian_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"armenian_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "armenian"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"armenian": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"lowercase",
|
|
"armenian_stop",
|
|
"armenian_keywords",
|
|
"armenian_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[basque-analyzer]]
|
|
===== `basque` analyzer
|
|
|
|
The `basque` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"basque_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_basque_" <1>
|
|
},
|
|
"basque_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"basque_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "basque"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"basque": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"lowercase",
|
|
"basque_stop",
|
|
"basque_keywords",
|
|
"basque_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[brazilian-analyzer]]
|
|
===== `brazilian` analyzer
|
|
|
|
The `brazilian` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"brazilian_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_brazilian_" <1>
|
|
},
|
|
"brazilian_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"brazilian_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "brazilian"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"brazilian": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"lowercase",
|
|
"brazilian_stop",
|
|
"brazilian_keywords",
|
|
"brazilian_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[bulgarian-analyzer]]
|
|
===== `bulgarian` analyzer
|
|
|
|
The `bulgarian` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"bulgarian_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_bulgarian_" <1>
|
|
},
|
|
"bulgarian_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"bulgarian_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "bulgarian"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"bulgarian": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"lowercase",
|
|
"bulgarian_stop",
|
|
"bulgarian_keywords",
|
|
"bulgarian_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[catalan-analyzer]]
|
|
===== `catalan` analyzer
|
|
|
|
The `catalan` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"catalan_elision": {
|
|
"type": "elision",
|
|
"articles": [ "d", "l", "m", "n", "s", "t"]
|
|
},
|
|
"catalan_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_catalan_" <1>
|
|
},
|
|
"catalan_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"catalan_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "catalan"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"catalan": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"catalan_elision",
|
|
"lowercase",
|
|
"catalan_stop",
|
|
"catalan_keywords",
|
|
"catalan_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[cjk-analyzer]]
|
|
===== `cjk` analyzer
|
|
|
|
The `cjk` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"english_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_english_" <1>
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"cjk": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"cjk_width",
|
|
"lowercase",
|
|
"cjk_bigram",
|
|
"english_stop"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
|
|
[[czech-analyzer]]
|
|
===== `czech` analyzer
|
|
|
|
The `czech` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"czech_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_czech_" <1>
|
|
},
|
|
"czech_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"czech_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "czech"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"czech": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"lowercase",
|
|
"czech_stop",
|
|
"czech_keywords",
|
|
"czech_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[danish-analyzer]]
|
|
===== `danish` analyzer
|
|
|
|
The `danish` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"danish_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_danish_" <1>
|
|
},
|
|
"danish_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"danish_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "danish"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"danish": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"lowercase",
|
|
"danish_stop",
|
|
"danish_keywords",
|
|
"danish_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[dutch-analyzer]]
|
|
===== `dutch` analyzer
|
|
|
|
The `dutch` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"dutch_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_dutch_" <1>
|
|
},
|
|
"dutch_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"dutch_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "dutch"
|
|
},
|
|
"dutch_override": {
|
|
"type": "stemmer_override",
|
|
"rules": [
|
|
"fiets=>fiets",
|
|
"bromfiets=>bromfiets",
|
|
"ei=>eier",
|
|
"kind=>kinder"
|
|
]
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"dutch": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"lowercase",
|
|
"dutch_stop",
|
|
"dutch_keywords",
|
|
"dutch_override",
|
|
"dutch_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[english-analyzer]]
|
|
===== `english` analyzer
|
|
|
|
The `english` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"english_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_english_" <1>
|
|
},
|
|
"english_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"english_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "english"
|
|
},
|
|
"english_possessive_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "possessive_english"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"english": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"english_possessive_stemmer",
|
|
"lowercase",
|
|
"english_stop",
|
|
"english_keywords",
|
|
"english_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[finnish-analyzer]]
|
|
===== `finnish` analyzer
|
|
|
|
The `finnish` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"finnish_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_finnish_" <1>
|
|
},
|
|
"finnish_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"finnish_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "finnish"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"finnish": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"lowercase",
|
|
"finnish_stop",
|
|
"finnish_keywords",
|
|
"finnish_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[french-analyzer]]
|
|
===== `french` analyzer
|
|
|
|
The `french` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"french_elision": {
|
|
"type": "elision",
|
|
"articles": [ "l", "m", "t", "qu", "n", "s",
|
|
"j", "d", "c", "jusqu", "quoiqu",
|
|
"lorsqu", "puisqu"
|
|
]
|
|
},
|
|
"french_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_french_" <1>
|
|
},
|
|
"french_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"french_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "light_french"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"french": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"french_elision",
|
|
"lowercase",
|
|
"french_stop",
|
|
"french_keywords",
|
|
"french_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[galician-analyzer]]
|
|
===== `galician` analyzer
|
|
|
|
The `galician` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"galician_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_galician_" <1>
|
|
},
|
|
"galician_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"galician_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "galician"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"galician": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"lowercase",
|
|
"galician_stop",
|
|
"galician_keywords",
|
|
"galician_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[german-analyzer]]
|
|
===== `german` analyzer
|
|
|
|
The `german` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"german_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_german_" <1>
|
|
},
|
|
"german_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"german_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "light_german"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"german": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"lowercase",
|
|
"german_stop",
|
|
"german_keywords",
|
|
"german_normalization",
|
|
"german_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[greek-analyzer]]
|
|
===== `greek` analyzer
|
|
|
|
The `greek` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"greek_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_greek_" <1>
|
|
},
|
|
"greek_lowercase": {
|
|
"type": "lowercase",
|
|
"language": "greek"
|
|
},
|
|
"greek_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"greek_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "greek"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"greek": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"greek_lowercase",
|
|
"greek_stop",
|
|
"greek_keywords",
|
|
"greek_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[hindi-analyzer]]
|
|
===== `hindi` analyzer
|
|
|
|
The `hindi` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"hindi_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_hindi_" <1>
|
|
},
|
|
"hindi_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"hindi_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "hindi"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"hindi": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"lowercase",
|
|
"indic_normalization",
|
|
"hindi_normalization",
|
|
"hindi_stop",
|
|
"hindi_keywords",
|
|
"hindi_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[hungarian-analyzer]]
|
|
===== `hungarian` analyzer
|
|
|
|
The `hungarian` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"hungarian_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_hungarian_" <1>
|
|
},
|
|
"hungarian_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"hungarian_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "hungarian"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"hungarian": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"lowercase",
|
|
"hungarian_stop",
|
|
"hungarian_keywords",
|
|
"hungarian_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
|
|
[[indonesian-analyzer]]
|
|
===== `indonesian` analyzer
|
|
|
|
The `indonesian` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"indonesian_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_indonesian_" <1>
|
|
},
|
|
"indonesian_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"indonesian_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "indonesian"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"indonesian": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"lowercase",
|
|
"indonesian_stop",
|
|
"indonesian_keywords",
|
|
"indonesian_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[irish-analyzer]]
|
|
===== `irish` analyzer
|
|
|
|
The `irish` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"irish_elision": {
|
|
"type": "elision",
|
|
"articles": [ "h", "n", "t" ]
|
|
},
|
|
"irish_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_irish_" <1>
|
|
},
|
|
"irish_lowercase": {
|
|
"type": "lowercase",
|
|
"language": "irish"
|
|
},
|
|
"irish_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"irish_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "irish"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"irish": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"irish_stop",
|
|
"irish_elision",
|
|
"irish_lowercase",
|
|
"irish_keywords",
|
|
"irish_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[italian-analyzer]]
|
|
===== `italian` analyzer
|
|
|
|
The `italian` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"italian_elision": {
|
|
"type": "elision",
|
|
"articles": [
|
|
"c", "l", "all", "dall", "dell",
|
|
"nell", "sull", "coll", "pell",
|
|
"gl", "agl", "dagl", "degl", "negl",
|
|
"sugl", "un", "m", "t", "s", "v", "d"
|
|
]
|
|
},
|
|
"italian_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_italian_" <1>
|
|
},
|
|
"italian_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"italian_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "light_italian"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"italian": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"italian_elision",
|
|
"lowercase",
|
|
"italian_stop",
|
|
"italian_keywords",
|
|
"italian_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[latvian-analyzer]]
|
|
===== `latvian` analyzer
|
|
|
|
The `latvian` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"latvian_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_latvian_" <1>
|
|
},
|
|
"latvian_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"latvian_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "latvian"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"latvian": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"lowercase",
|
|
"latvian_stop",
|
|
"latvian_keywords",
|
|
"latvian_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[norwegian-analyzer]]
|
|
===== `norwegian` analyzer
|
|
|
|
The `norwegian` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"norwegian_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_norwegian_" <1>
|
|
},
|
|
"norwegian_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"norwegian_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "norwegian"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"norwegian": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"lowercase",
|
|
"norwegian_stop",
|
|
"norwegian_keywords",
|
|
"norwegian_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[persian-analyzer]]
|
|
===== `persian` analyzer
|
|
|
|
The `persian` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"char_filter": {
|
|
"zero_width_spaces": {
|
|
"type": "mapping",
|
|
"mappings": [ "\\u200C=> "] <1>
|
|
}
|
|
},
|
|
"filter": {
|
|
"persian_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_persian_" <2>
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"persian": {
|
|
"tokenizer": "standard",
|
|
"char_filter": [ "zero_width_spaces" ],
|
|
"filter": [
|
|
"lowercase",
|
|
"arabic_normalization",
|
|
"persian_normalization",
|
|
"persian_stop"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> Replaces zero-width non-joiners with an ASCII space.
|
|
<2> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
|
|
[[portuguese-analyzer]]
|
|
===== `portuguese` analyzer
|
|
|
|
The `portuguese` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"portuguese_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_portuguese_" <1>
|
|
},
|
|
"portuguese_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"portuguese_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "light_portuguese"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"portuguese": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"lowercase",
|
|
"portuguese_stop",
|
|
"portuguese_keywords",
|
|
"portuguese_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[romanian-analyzer]]
|
|
===== `romanian` analyzer
|
|
|
|
The `romanian` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"romanian_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_romanian_" <1>
|
|
},
|
|
"romanian_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"romanian_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "romanian"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"romanian": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"lowercase",
|
|
"romanian_stop",
|
|
"romanian_keywords",
|
|
"romanian_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
|
|
[[russian-analyzer]]
|
|
===== `russian` analyzer
|
|
|
|
The `russian` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"russian_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_russian_" <1>
|
|
},
|
|
"russian_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"russian_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "russian"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"russian": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"lowercase",
|
|
"russian_stop",
|
|
"russian_keywords",
|
|
"russian_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[sorani-analyzer]]
|
|
===== `sorani` analyzer
|
|
|
|
The `sorani` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"sorani_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_sorani_" <1>
|
|
},
|
|
"sorani_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"sorani_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "sorani"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"sorani": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"sorani_normalization",
|
|
"lowercase",
|
|
"sorani_stop",
|
|
"sorani_keywords",
|
|
"sorani_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[spanish-analyzer]]
|
|
===== `spanish` analyzer
|
|
|
|
The `spanish` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"spanish_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_spanish_" <1>
|
|
},
|
|
"spanish_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"spanish_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "light_spanish"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"spanish": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"lowercase",
|
|
"spanish_stop",
|
|
"spanish_keywords",
|
|
"spanish_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[swedish-analyzer]]
|
|
===== `swedish` analyzer
|
|
|
|
The `swedish` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"swedish_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_swedish_" <1>
|
|
},
|
|
"swedish_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"swedish_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "swedish"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"swedish": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"lowercase",
|
|
"swedish_stop",
|
|
"swedish_keywords",
|
|
"swedish_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[turkish-analyzer]]
|
|
===== `turkish` analyzer
|
|
|
|
The `turkish` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"turkish_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_turkish_" <1>
|
|
},
|
|
"turkish_lowercase": {
|
|
"type": "lowercase",
|
|
"language": "turkish"
|
|
},
|
|
"turkish_keywords": {
|
|
"type": "keyword_marker",
|
|
"keywords": [] <2>
|
|
},
|
|
"turkish_stemmer": {
|
|
"type": "stemmer",
|
|
"language": "turkish"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"turkish": {
|
|
"tokenizer": "standard",
|
|
"filter": [
|
|
"apostrophe",
|
|
"turkish_lowercase",
|
|
"turkish_stop",
|
|
"turkish_keywords",
|
|
"turkish_stemmer"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|
|
<2> This filter should be removed unless there are words which should
|
|
be excluded from stemming.
|
|
|
|
[[thai-analyzer]]
|
|
===== `thai` analyzer
|
|
|
|
The `thai` analyzer could be reimplemented as a `custom` analyzer as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"filter": {
|
|
"thai_stop": {
|
|
"type": "stop",
|
|
"stopwords": "_thai_" <1>
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"thai": {
|
|
"tokenizer": "thai",
|
|
"filter": [
|
|
"lowercase",
|
|
"thai_stop"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------
|
|
<1> The default stopwords can be overridden with the `stopwords`
|
|
or `stopwords_path` parameters.
|