Package com.maybeitssquid.safeascii
Class Categorize
java.lang.Object
com.maybeitssquid.safeascii.Chainable
com.maybeitssquid.safeascii.Categorize
- All Implemented Interfaces:
IntFunction<CharSequence>
- Direct Known Subclasses:
Name
Converts Unicode characters to their ASCII equivalents based on character categories.
NdCharacter.DECIMAL_DIGIT_NUMBERare converted to their ASCII digit equivalents0-9ZsCharacter.SPACE_SEPARATORare converted to ASCII spaceZl, Zp, U+0085Character.LINE_SEPARATOR,Character.PARAGRAPH_SEPARATORandUNICODE_NELare converted toline separator. Carriage returnU+000Dand line feedU+000Aare passed as-is.PdCharacter.DASH_PUNCTUATIONare converted to ASCII hyphen-minus-PsCharacter.START_PUNCTUATIONare converted to ASCII open parenthesis(PeCharacter.END_PUNCTUATIONare converted to ASCII close parenthesis)PcCharacter.CONNECTOR_PUNCTUATIONare converted to ASCII underscore_Pi, PfCharacter.INITIAL_QUOTE_PUNCTUATIONandCharacter.FINAL_QUOTE_PUNCTUATIONare converted to ASCII double quote"U+FFFDUNICODE_REPLACEMENTis converted to ASCII question mark?
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected static final IntFunction<CharSequence> Identity function that converts a codepoint to its string representation without transformation.static final intThe UNICODE line separator characterstatic final charThe UNICODE replacement character -
Constructor Summary
ConstructorsConstructorDescriptionCreates a new Categorizer instance with an identity delegate and default line separator.Categorize(IntFunction<CharSequence> delegate) Creates a new Categorizer instance with the specified delegate and default line separator.Categorize(IntFunction<CharSequence> delegate, CharSequence lineSeparator) Creates a new Categorize instance with the specified delegate and line separator. -
Method Summary
Modifier and TypeMethodDescriptionapply(int value) Applies the categorization logic to the input value.protected CharSequenceendPunctuation(int codepoint) Converts end punctuation codepoints to their ASCII equivalent.Retrieves the configured line separator.protected CharSequenceprocess(int codepoint) Transforms a codepoint based on its Unicode category.protected CharSequencequotePunctuation(int codepoint) Converts quote punctuation codepoints to their ASCII equivalent.protected CharSequencestartPunctuation(int codepoint) Converts start punctuation codepoints to their ASCII equivalent.
-
Field Details
-
UNICODE_REPLACEMENT
public static final char UNICODE_REPLACEMENTThe UNICODE replacement character- See Also:
-
UNICODE_NEL
public static final int UNICODE_NELThe UNICODE line separator character- See Also:
-
identity
Identity function that converts a codepoint to its string representation without transformation.
-
-
Constructor Details
-
Categorize
Creates a new Categorize instance with the specified delegate and line separator.- Parameters:
delegate- the next step in the processing chainlineSeparator- the string to use for line separators
-
Categorize
Creates a new Categorizer instance with the specified delegate and default line separator.- Parameters:
delegate- the next step in the processing chain
-
Categorize
public Categorize()Creates a new Categorizer instance with an identity delegate and default line separator.
-
-
Method Details
-
getLineSeparator
Retrieves the configured line separator.- Returns:
- the character sequence used for line separation (e.g. for
UNICODE_NEL).
-
startPunctuation
Converts start punctuation codepoints to their ASCII equivalent.- Parameters:
codepoint- the start punctuation codepoint- Returns:
- ASCII open parenthesis
(
-
endPunctuation
Converts end punctuation codepoints to their ASCII equivalent.- Parameters:
codepoint- the end punctuation codepoint- Returns:
- ASCII close parenthesis
)
-
quotePunctuation
Converts quote punctuation codepoints to their ASCII equivalent.- Parameters:
codepoint- the quote punctuation codepoint- Returns:
- ASCII double quote
"
-
process
Transforms a codepoint based on its Unicode category.This method maps specific Unicode categories (like punctuation and separators) to their ASCII equivalents. Characters falling into unhandled categories are returned as-is via
identity. -
apply
Applies the categorization logic to the input value.This method includes an optimization for ASCII characters (values <
Chainable.ASCII). If the input is already ASCII, it skips the categorization process and immediately delegates to the next step in the chain.- Specified by:
applyin interfaceIntFunction<CharSequence>- Overrides:
applyin classChainable- Parameters:
value- the input codepoint- Returns:
- the processed character sequence
-