Class Categorize

java.lang.Object
com.maybeitssquid.safeascii.Chainable
com.maybeitssquid.safeascii.Categorize
All Implemented Interfaces:
IntFunction<CharSequence>
Direct Known Subclasses:
Name

public class Categorize extends Chainable
Converts Unicode characters to their ASCII equivalents based on character categories.
  • Field Details

    • UNICODE_REPLACEMENT

      public static final char UNICODE_REPLACEMENT
      The UNICODE replacement character
      See Also:
    • UNICODE_NEL

      public static final int UNICODE_NEL
      The UNICODE line separator character
      See Also:
    • identity

      protected static final IntFunction<CharSequence> identity
      Identity function that converts a codepoint to its string representation without transformation.
  • Constructor Details

    • Categorize

      public Categorize(IntFunction<CharSequence> delegate, CharSequence lineSeparator)
      Creates a new Categorize instance with the specified delegate and line separator.
      Parameters:
      delegate - the next step in the processing chain
      lineSeparator - the string to use for line separators
    • Categorize

      public Categorize(IntFunction<CharSequence> delegate)
      Creates a new Categorizer instance with the specified delegate and default line separator.
      Parameters:
      delegate - the next step in the processing chain
    • Categorize

      public Categorize()
      Creates a new Categorizer instance with an identity delegate and default line separator.
  • Method Details

    • getLineSeparator

      public CharSequence getLineSeparator()
      Retrieves the configured line separator.
      Returns:
      the character sequence used for line separation (e.g. for UNICODE_NEL).
    • startPunctuation

      protected CharSequence startPunctuation(int codepoint)
      Converts start punctuation codepoints to their ASCII equivalent.
      Parameters:
      codepoint - the start punctuation codepoint
      Returns:
      ASCII open parenthesis (
    • endPunctuation

      protected CharSequence endPunctuation(int codepoint)
      Converts end punctuation codepoints to their ASCII equivalent.
      Parameters:
      codepoint - the end punctuation codepoint
      Returns:
      ASCII close parenthesis )
    • quotePunctuation

      protected CharSequence quotePunctuation(int codepoint)
      Converts quote punctuation codepoints to their ASCII equivalent.
      Parameters:
      codepoint - the quote punctuation codepoint
      Returns:
      ASCII double quote "
    • process

      protected CharSequence process(int codepoint)
      Transforms a codepoint based on its Unicode category.

      This method maps specific Unicode categories (like punctuation and separators) to their ASCII equivalents. Characters falling into unhandled categories are returned as-is via identity.

      Specified by:
      process in class Chainable
      Parameters:
      codepoint - the Unicode codepoint to process
      Returns:
      the transformed string or the original character
      See Also:
    • apply

      public CharSequence apply(int value)
      Applies the categorization logic to the input value.

      This method includes an optimization for ASCII characters (values < Chainable.ASCII). If the input is already ASCII, it skips the categorization process and immediately delegates to the next step in the chain.

      Specified by:
      apply in interface IntFunction<CharSequence>
      Overrides:
      apply in class Chainable
      Parameters:
      value - the input codepoint
      Returns:
      the processed character sequence