Class Name

All Implemented Interfaces:
IntFunction<CharSequence>

public class Name extends Categorize
A transliteration step that converts Unicode characters to ASCII based on their Unicode names.

This class extends Categorize to provide more granular mappings for characters that cannot be mapped simply by category. It parses the Unicode name of a character (retrieved via Character.getName(int)) to find ASCII equivalents for:

  • Latin letters (including those with diacritics)
  • Bracket types (square, curly, angle)
  • Quotation marks
  • Various symbols and punctuation
  • Field Details

    • UNICODE_CIRCLED_ZERO_WITH_SLASH

      public static final int UNICODE_CIRCLED_ZERO_WITH_SLASH
      Unicode codepoint for CIRCLED ZERO WITH SLASH (U+1F10D), transliterates to 0.
      See Also:
    • UNICODE_TRIPLE_SOLIDUS_BINARY_RELATION

      public static final int UNICODE_TRIPLE_SOLIDUS_BINARY_RELATION
      Unicode codepoint for TRIPLE SOLIDUS BINARY RELATION (U+2AFB), transliterates to ///.
      See Also:
    • UNICODE_DOUBLE_SOLIDUS_OPERATOR

      public static final int UNICODE_DOUBLE_SOLIDUS_OPERATOR
      Unicode codepoint for DOUBLE SOLIDUS OPERATOR (U+2AFD), transliterates to //.
      See Also:
    • UNICODE_OCR_DOUBLE_BACKSLASH

      public static final int UNICODE_OCR_DOUBLE_BACKSLASH
      Unicode codepoint for OCR DOUBLE BACKSLASH (U+244A), transliterates to \\.
      See Also:
    • UNICODE_COLON_SIGN

      public static final int UNICODE_COLON_SIGN
      Unicode codepoint for COLON SIGN (U+20A1), the Colombian currency symbol. Not transliterated to colon.
      See Also:
    • UNICODE_COLON_EQUALS

      public static final int UNICODE_COLON_EQUALS
      Unicode codepoint for COLON EQUALS (U+2254), transliterates to :=.
      See Also:
    • UNICODE_EQUALS_COLON

      public static final int UNICODE_EQUALS_COLON
      Unicode codepoint for EQUALS COLON (U+2255), transliterates to =:.
      See Also:
    • UNICODE_CIRCLED_DOLLAR_SIGN_WITH_OVERLAID_BACKSLASH

      public static final int UNICODE_CIRCLED_DOLLAR_SIGN_WITH_OVERLAID_BACKSLASH
      Unicode codepoint for CIRCLED DOLLAR SIGN WITH OVERLAID BACKSLASH (U+1F10F), transliterates to $.
      See Also:
    • UNICODE_CIRCLED_C_WITH_OVERLAID_BACKSLASH

      public static final int UNICODE_CIRCLED_C_WITH_OVERLAID_BACKSLASH
      Unicode codepoint for CIRCLED C WITH OVERLAID BACKSLASH (U+1F16E), transliterates to C.
      See Also:
  • Constructor Details

    • Name

      public Name(IntFunction<CharSequence> delegate, CharSequence lineSeparator)
      Creates a new Name transliterator with the specified delegate and line separator.
      Parameters:
      delegate - the next step in the processing chain
      lineSeparator - the string to use for line separators
    • Name

      public Name(IntFunction<CharSequence> delegate)
      Creates a new Name transliterator with the specified delegate and default line separator.
      Parameters:
      delegate - the next step in the processing chain
    • Name

      public Name()
      Creates a new Name transliterator with an identity delegate and default line separator.
  • Method Details

    • process

      protected CharSequence process(int codepoint)
      Transliterates a codepoint based on its type and name.
      Overrides:
      process in class Categorize
      Parameters:
      codepoint - the Unicode codepoint to process
      Returns:
      the transliterated ASCII string, or the result of the superclass processing if no specific name-based rule applies
      See Also:
    • uppercase

      protected CharSequence uppercase(int codepoint)
      Extracts the base ASCII character for an uppercase letter from its name.
      Parameters:
      codepoint - the uppercase codepoint to process
      Returns:
      the base letter if found in the name, otherwise an empty string
    • lowercase

      protected CharSequence lowercase(int codepoint)
      Extracts the base ASCII character for a lowercase letter from its name.
      Parameters:
      codepoint - the lowercase codepoint to process
      Returns:
      the base letter converted to lowercase if found, otherwise an empty string
    • titlecase

      protected CharSequence titlecase(int codepoint)
      Processes the titlecase characters. There are only four that transliterate to ASCII; the remainder are Greek. This method will have no effect if a prior processing step has already decomposed the codepoint.
      Parameters:
      codepoint - the title case codepoint to process
      Returns:
      the ASCII equivalent, or an empty string if not found
    • startPunctuation

      protected CharSequence startPunctuation(int codepoint)
      Maps start punctuation to ASCII brackets based on name.

      Detects:

      • Square brackets: [
      • Curly braces: {
      • Angle brackets: <
      • Others (default): (
      Overrides:
      startPunctuation in class Categorize
      Parameters:
      codepoint - the codepoint to process
      Returns:
      the corresponding ASCII opening bracket
    • endPunctuation

      protected CharSequence endPunctuation(int codepoint)
      Maps end punctuation to ASCII brackets based on name.

      Detects:

      • Square brackets: ]
      • Curly braces: }
      • Angle brackets: >
      • Others (default): )
      Overrides:
      endPunctuation in class Categorize
      Parameters:
      codepoint - the codepoint to process
      Returns:
      the corresponding ASCII closing bracket
    • quotePunctuation

      protected CharSequence quotePunctuation(int codepoint)
      Maps quote punctuation to ASCII quotes based on name.

      Maps to " if the name contains "DOUBLE" or "DOTTED", otherwise maps to '.

      Overrides:
      quotePunctuation in class Categorize
      Parameters:
      codepoint - the codepoint to process
      Returns:
      the corresponding ASCII quote character
    • solidus

      protected CharSequence solidus(String name, int codepoint)
      Converts solidus (slash) and backslash characters to ASCII equivalents.
      Parameters:
      name - the Unicode name of the character
      codepoint - the codepoint to process
      Returns:
      the ASCII equivalent: \ for reverse solidus/backslash, / for solidus/slash, with special handling for double variants
    • equal

      protected CharSequence equal(int codepoint)
      Converts equality-related characters to ASCII equivalents.
      Parameters:
      codepoint - the codepoint to process
      Returns:
      := for COLON EQUALS, =: for EQUALS COLON, or = for other equality symbols
    • byName

      protected CharSequence byName(int codepoint)
      Converts a Unicode codepoint to ASCII by analyzing its character name.

      This method uses Character.getName(int) to retrieve the Unicode character name and matches it against known naming patterns to determine the appropriate ASCII equivalent. It handles common punctuation marks and symbols by checking if their names contain specific keywords.

      Recognized patterns include:

      • LATIN [CAPITAL|SMALL] LETTER → corresponding letter
      • REVERSE SOLIDUS, BACKSLASH → \ (exceptions noted)
      • SOLIDUS, SLASH → / (exceptions noted)
      • EQUAL → = (special cases for colon equals and equals colon)
      • AMPERSAND → &
      • FULL STOP → . (does not handle composed "[DIGIT|NUMBER] x FULL STOP")
      • APOSTROPHE → '
      • EXCLAMATION MARK → !
      • QUESTION → ?
      • INTERROBANG → ?!
      • ASTERISK → *
      • SEMICOLON → ;
      • PERCENT → %
      • PLUS SIGN → +
      • MULTIPLICATION → X
      • COMMA → ,
      • COLON → : (except Colombian currency symbol)
      • TILDE → ~
      Parameters:
      codepoint - the Unicode codepoint to process
      Returns:
      the ASCII equivalent based on name patterns, or the original character via Categorize.identity if no pattern matches