This library wraps java-string-similarity to provide different string similarity and distance measures as SQL functions.
- Clone or download this repository
- Run
gradle build shadowJar - Copy the file
string-metrics/build/libs/string-metrics-all.jarin the OrientDB Serverlibfolder - Add the following
functionsconfiguration toconfig/custom-sql-functions.jsonin the OrientDB Server
{
"prefix": "strics",
"class": "com.orientechnologies.extra.functions.stringmetrics.StringMetrics"
}- Restart OrientDB
This section shows how to use the functions. For more details on a specific algorithm see here.
Returns the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.
Syntax: strics_editDistance(<field|value1>, <field|value2>)
select strics_editDistance('John A Smith', 'Jonathan A Smith')
--- returns 4Returns the normalized edit distance similarity: a value always in the interval between 0 (no match) and 1.0
(perfect match).
Syntax: strics_editDistanceSimilarity(<field|value1>, <field|value2>)
select strics_editDistanceSimilarity('John A Smith', 'Jonathan A Smith')
--- returns 0.75Similar to edit distance with transposition of of two adjacent characters counted as single operation.
Syntax: strics_damerauDistance(<field|value1>, <field|value2>)
select strics_damerauDistance('John A Smith', 'Jonathan A Smiht')
--- returns 5 (edit distance would return 6)Similar to edit distance with the condition that no substring is edited more than once.
Syntax: strics_optimalStringAlignmentDistance(<field|value1>, <field|value2>)
Variation of damerau, developed in the area of record linkage (deduplication), where the substitution of 2 close characters is considered less important then the substitution of 2 characters that a far from each other. The last parameter (default to 0.7) specifies the threshold when Winkler bonus should be used.
Syntax: strics_jaroWinklerSimilarity(<field|value1>, <field|value2> [, <threshold> ])
select strics_jaroWinklerSimilarity('John A Smith', 'Jonathan A Smith')
--- returns 0.8298611342906952Finds the longest subsequence common to two (or more) sequences.
Syntax: strics_longestCommonSubsequenceDistance(<field|value1>, <field|value2>)
select strics_longestCommonSubsequenceDistance('AGCAT', 'GAC')
--- returns 4Works by converting strings into sets of n-grams (sequences of n characters, also sometimes called k-shingles). Useful for large data sets takes into account the number of occurrences of each shingle.
Syntax: strics_cosineSimilarity(<field|value1>, <field|value2>, <charLength>)
select strics_cosineSimilarity("my string, \n my song", "another string, from a song", 2)
--- returns 0.5621826951410452Normalized N-Gram distance. Uses affixing with special character \n to increase the weight of first characters. The
normalization is achieved by dividing the total similarity score the original length of the longest word. Default
length value is 2.
Syntax: strics_ngramDistance(<field|value1>, <field|value2> [, <length> ])
select strics_ngramDistance("ABCD", "ABTUIO")
--- returns 0.5833333134651184The distance between two strings is defined as the L1 norm of the difference of their profiles (the number of
occurrences of each n-gram): SUM( |V1_i - V2_i| ). Default length value is 2.
Syntax: strics_qgramDistance(<field|value1>, <field|value2> [, <length> ])
select strics_qgramDistance("ABCD", "ABCE")
--- returns 2Like Q-Gram distance, the input strings are first converted into sets of n-grams but this time the cardinality of each
n-gram is not taken into account. Default length value is 3.
Syntax: strics_jaccardSimilarity(<field|value1>, <field|value2> [, <charSeq> ])
select strics_jaccardSimilarity("ABCDE", "ABCDF", 2)
--- returns 0.6Computed as 1 - similarity. Default length value is 3.
Syntax: strics_jaccardDistance(<field|value1>, <field|value2> [, <charSeq> ])
It can be considered a semimetric version of the Jaccard similarity. Default length value is 3.
Syntax: strics_sorensenDiceSimilarity(<field|value1>, <field|value2> [, <charSeq> ])
select strics_sorensenDiceSimilarity("ABCDE", "ABCDF", 2)
--- returns 0.75Computed as 1 - similarity. Default length value is 3.
Syntax: strics_sorensenDiceDistance(<field|value1>, <field|value2> [, <charSeq> ])
General purpose string distance algorithm inspired by JaroWinkler and Longest Common Subsequence. Developed to produce a distance measure that matches as close as possible to the human perception of string distance.
Syntax: strics_sift4Distance(<field|value1>, <field|value2>)
select strics_sift4Distance("This is the first string", "And this is another string")
--- returns 9