ASTNode.getText() for elements of a language that has been injected into a String-literal returns the literal content of the text, even though the lexer has been passed the unescaped text. This is inconsistent and dangerous when relying on the text of an element for its further processing.
To workaround that, I'd need to know whether/where the element is injected into and unescape the text myself before further processing it. It's great that the text is parsed in its unescaped form, but it must also be consistent with the AST/PSI.
Example:
String re = "\b" -> Lexer for injected language gets a single character: '\b' but ASTNode.getText() returns "\\b".
Issue was resolved
It would not be the end of the world if it stays like it is, but then the behavior should be documented.
There is a contract stating that text obtained from PSI should be the same as the file text.
I.e. following should be true:
document.getText().equals(psiFile.getText())
And, moreover, this should also hold for any PSI element, i.e.
document.getText().substring(element.getTextRange().getStartOffset(), element.getTextRange().getEndOffset()).equals(element.getText())
must be true for any PSI element.
Suppose I have a Token INTEGER_LITERAL: If injected into an XML attribute, the text can be "1234" or "1234": In both cases, the text passed to my lexer is the same - which is good. But how am I supposed to deal with that when I want to calculate the literal's value? There doesn't even seem to be any utility function that could help me to decode that myself.
Suggestion: Add a method getDecodedText() (or similar) to ASTNode and/or ASTWrapperPsiElement that at least provides a convenient solution for the case when I need to process the text myself.
There's a difference whether an element is part of the prefix/suffix of an injected fragment or not. While elements that are part of e.g. a String literal return the escaped text, elements from the prefix/suffix return the unescaped text.
Even though this appears logical at first glance, this is kind of a showstopper because this makes it impossible to distinguish whether to manually decode the text or not. (e.g. through getContainingFile().getContext() instanceof PsiLiteralExpression).
I see the following possibilities to address this (in order of preference):
- fix this in a way that any language can be transparently injected
or- add a method getDecodedText() (see comment above) that handles text of prefix/suffix correctly
or- apply escaping rules of injection context to getText() of nodes in prefix/suffix as well
orPlease respond ASAP. Thanks.
com.intellij.psi.impl.source.tree.injected.InjectedLanguageUtil#isInInjectedLanguagePrefixSuffix
which of course will be changed in the future, and so on, so on.
Overall, things like prefix/suffix handling should be reviewed, since now a single quote (' or ") being typed into injected Javascript language breaks all prefix/suffix things because it makes all text after quote a part of the single long string literal spanning all injected text incuding suffix.
All suggestions about transparent injection possibilities are very much welcome.