Package com.maybeitssquid.safeascii
Class Name
java.lang.Object
com.maybeitssquid.safeascii.Chainable
com.maybeitssquid.safeascii.Categorize
com.maybeitssquid.safeascii.Name
- All Implemented Interfaces:
IntFunction<CharSequence>
A transliteration step that converts Unicode characters to ASCII based on their Unicode names.
This class extends Categorize to provide more granular mappings for characters that
cannot be mapped simply by category. It parses the Unicode name of a character (retrieved via
Character.getName(int)) to find ASCII equivalents for:
- Latin letters (including those with diacritics)
- Bracket types (square, curly, angle)
- Quotation marks
- Various symbols and punctuation
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intUnicode codepoint for CIRCLED C WITH OVERLAID BACKSLASH (U+1F16E), transliterates toC.static final intUnicode codepoint for CIRCLED DOLLAR SIGN WITH OVERLAID BACKSLASH (U+1F10F), transliterates to$.static final intUnicode codepoint for CIRCLED ZERO WITH SLASH (U+1F10D), transliterates to0.static final intUnicode codepoint for COLON EQUALS (U+2254), transliterates to:=.static final intUnicode codepoint for COLON SIGN (U+20A1), the Colombian currency symbol.static final intUnicode codepoint for DOUBLE SOLIDUS OPERATOR (U+2AFD), transliterates to//.static final intUnicode codepoint for EQUALS COLON (U+2255), transliterates to=:.static final intUnicode codepoint for OCR DOUBLE BACKSLASH (U+244A), transliterates to\\.static final intUnicode codepoint for TRIPLE SOLIDUS BINARY RELATION (U+2AFB), transliterates to///.Fields inherited from class com.maybeitssquid.safeascii.Categorize
identity, UNICODE_NEL, UNICODE_REPLACEMENT -
Constructor Summary
ConstructorsConstructorDescriptionName()Creates a new Name transliterator with an identity delegate and default line separator.Name(IntFunction<CharSequence> delegate) Creates a new Name transliterator with the specified delegate and default line separator.Name(IntFunction<CharSequence> delegate, CharSequence lineSeparator) Creates a new Name transliterator with the specified delegate and line separator. -
Method Summary
Modifier and TypeMethodDescriptionprotected CharSequencebyName(int codepoint) Converts a Unicode codepoint to ASCII by analyzing its character name.protected CharSequenceendPunctuation(int codepoint) Maps end punctuation to ASCII brackets based on name.protected CharSequenceequal(int codepoint) Converts equality-related characters to ASCII equivalents.protected CharSequencelowercase(int codepoint) Extracts the base ASCII character for a lowercase letter from its name.protected CharSequenceprocess(int codepoint) Transliterates a codepoint based on its type and name.protected CharSequencequotePunctuation(int codepoint) Maps quote punctuation to ASCII quotes based on name.protected CharSequenceConverts solidus (slash) and backslash characters to ASCII equivalents.protected CharSequencestartPunctuation(int codepoint) Maps start punctuation to ASCII brackets based on name.protected CharSequencetitlecase(int codepoint) Processes the titlecase characters.protected CharSequenceuppercase(int codepoint) Extracts the base ASCII character for an uppercase letter from its name.Methods inherited from class com.maybeitssquid.safeascii.Categorize
apply, getLineSeparator
-
Field Details
-
UNICODE_CIRCLED_ZERO_WITH_SLASH
public static final int UNICODE_CIRCLED_ZERO_WITH_SLASHUnicode codepoint for CIRCLED ZERO WITH SLASH (U+1F10D), transliterates to0.- See Also:
-
UNICODE_TRIPLE_SOLIDUS_BINARY_RELATION
public static final int UNICODE_TRIPLE_SOLIDUS_BINARY_RELATIONUnicode codepoint for TRIPLE SOLIDUS BINARY RELATION (U+2AFB), transliterates to///.- See Also:
-
UNICODE_DOUBLE_SOLIDUS_OPERATOR
public static final int UNICODE_DOUBLE_SOLIDUS_OPERATORUnicode codepoint for DOUBLE SOLIDUS OPERATOR (U+2AFD), transliterates to//.- See Also:
-
UNICODE_OCR_DOUBLE_BACKSLASH
public static final int UNICODE_OCR_DOUBLE_BACKSLASHUnicode codepoint for OCR DOUBLE BACKSLASH (U+244A), transliterates to\\.- See Also:
-
UNICODE_COLON_SIGN
public static final int UNICODE_COLON_SIGNUnicode codepoint for COLON SIGN (U+20A1), the Colombian currency symbol. Not transliterated to colon.- See Also:
-
UNICODE_COLON_EQUALS
public static final int UNICODE_COLON_EQUALSUnicode codepoint for COLON EQUALS (U+2254), transliterates to:=.- See Also:
-
UNICODE_EQUALS_COLON
public static final int UNICODE_EQUALS_COLONUnicode codepoint for EQUALS COLON (U+2255), transliterates to=:.- See Also:
-
UNICODE_CIRCLED_DOLLAR_SIGN_WITH_OVERLAID_BACKSLASH
public static final int UNICODE_CIRCLED_DOLLAR_SIGN_WITH_OVERLAID_BACKSLASHUnicode codepoint for CIRCLED DOLLAR SIGN WITH OVERLAID BACKSLASH (U+1F10F), transliterates to$.- See Also:
-
UNICODE_CIRCLED_C_WITH_OVERLAID_BACKSLASH
public static final int UNICODE_CIRCLED_C_WITH_OVERLAID_BACKSLASHUnicode codepoint for CIRCLED C WITH OVERLAID BACKSLASH (U+1F16E), transliterates toC.- See Also:
-
-
Constructor Details
-
Name
Creates a new Name transliterator with the specified delegate and line separator.- Parameters:
delegate- the next step in the processing chainlineSeparator- the string to use for line separators
-
Name
Creates a new Name transliterator with the specified delegate and default line separator.- Parameters:
delegate- the next step in the processing chain
-
Name
public Name()Creates a new Name transliterator with an identity delegate and default line separator.
-
-
Method Details
-
process
Transliterates a codepoint based on its type and name.- Overrides:
processin classCategorize- Parameters:
codepoint- the Unicode codepoint to process- Returns:
- the transliterated ASCII string, or the result of the superclass processing if no specific name-based rule applies
- See Also:
-
uppercase
Extracts the base ASCII character for an uppercase letter from its name.- Parameters:
codepoint- the uppercase codepoint to process- Returns:
- the base letter if found in the name, otherwise an empty string
-
lowercase
Extracts the base ASCII character for a lowercase letter from its name.- Parameters:
codepoint- the lowercase codepoint to process- Returns:
- the base letter converted to lowercase if found, otherwise an empty string
-
titlecase
Processes the titlecase characters. There are only four that transliterate to ASCII; the remainder are Greek. This method will have no effect if a prior processing step has already decomposed the codepoint.- Parameters:
codepoint- the title case codepoint to process- Returns:
- the ASCII equivalent, or an empty string if not found
-
startPunctuation
Maps start punctuation to ASCII brackets based on name.Detects:
- Square brackets: [
- Curly braces: {
- Angle brackets: <
- Others (default): (
- Overrides:
startPunctuationin classCategorize- Parameters:
codepoint- the codepoint to process- Returns:
- the corresponding ASCII opening bracket
-
endPunctuation
Maps end punctuation to ASCII brackets based on name.Detects:
- Square brackets: ]
- Curly braces: }
- Angle brackets: >
- Others (default): )
- Overrides:
endPunctuationin classCategorize- Parameters:
codepoint- the codepoint to process- Returns:
- the corresponding ASCII closing bracket
-
quotePunctuation
Maps quote punctuation to ASCII quotes based on name.Maps to
"if the name contains "DOUBLE" or "DOTTED", otherwise maps to'.- Overrides:
quotePunctuationin classCategorize- Parameters:
codepoint- the codepoint to process- Returns:
- the corresponding ASCII quote character
-
solidus
Converts solidus (slash) and backslash characters to ASCII equivalents.- Parameters:
name- the Unicode name of the charactercodepoint- the codepoint to process- Returns:
- the ASCII equivalent:
\for reverse solidus/backslash,/for solidus/slash, with special handling for double variants
-
equal
Converts equality-related characters to ASCII equivalents.- Parameters:
codepoint- the codepoint to process- Returns:
:=for COLON EQUALS,=:for EQUALS COLON, or=for other equality symbols
-
byName
Converts a Unicode codepoint to ASCII by analyzing its character name.This method uses
Character.getName(int)to retrieve the Unicode character name and matches it against known naming patterns to determine the appropriate ASCII equivalent. It handles common punctuation marks and symbols by checking if their names contain specific keywords.Recognized patterns include:
- LATIN [CAPITAL|SMALL] LETTER → corresponding letter
- REVERSE SOLIDUS, BACKSLASH →
\(exceptions noted) - SOLIDUS, SLASH →
/(exceptions noted) - EQUAL →
=(special cases for colon equals and equals colon) - AMPERSAND →
& - FULL STOP →
.(does not handle composed "[DIGIT|NUMBER] x FULL STOP") - APOSTROPHE →
' - EXCLAMATION MARK →
! - QUESTION →
? - INTERROBANG →
?! - ASTERISK →
* - SEMICOLON →
; - PERCENT →
% - PLUS SIGN →
+ - MULTIPLICATION →
X - COMMA →
, - COLON →
:(except Colombian currency symbol) - TILDE →
~
- Parameters:
codepoint- the Unicode codepoint to process- Returns:
- the ASCII equivalent based on name patterns, or the original character via
Categorize.identityif no pattern matches
-