Project
IntelliJ IDEA
Priority
Normal
Type
Bug
Fix versions
No Fix versions
State
Fixed
Assignee
Alexey Kudravtsev
Subsystem
Plugin Support. API
Affected versions
No Affected versions
Fixed in build
108.65  
  • Created by   Sascha Weinreuter
    5 years ago (29 Jul 2006 23:04)
  • Updated by   root
    2 years ago (17 Jan 2010 20:31)
  • Jira: IDEADEV-11213
    (history, comments)
 
IDEA-33064 ASTNode.getText() returns escaped text for Injected Language
0
Issue is visible to: All Users
  The issue is visible to the selected user group only
ASTNode.getText() for elements of a language that has been injected into a String-literal returns the literal content of the text, even though the lexer has been passed the unescaped text. This is inconsistent and dangerous when relying on the text of an element for its further processing.

To workaround that, I'd need to know whether/where the element is injected into and unescape the text myself before further processing it. It's great that the text is parsed in its unescaped form, but it must also be consistent with the AST/PSI.

Example:
String re = "\b" -> Lexer for injected language gets a single character: '\b' but ASTNode.getText() returns "\\b".

Issue was resolved
Comments (10)
 
History
 
Linked Issues (?)
 
Sascha Weinreuter
  Sascha Weinreuter
29 Jul 2006 23:08
5 years ago
formatting
Sascha Weinreuter
  Sascha Weinreuter
12 Aug 2006 18:01
5 years ago
While not a showstopper, I think it's a rather odd behavior, at least from the API-user's point of view. Is this by design or is there a chance that this will change for the final 6.0?

It would not be the end of the world if it stays like it is, but then the behavior should be documented.
Alexey Kudravtsev
  Alexey Kudravtsev
08 Sep 2006 13:40
5 years ago
This behaviour is by design.
There is a contract stating that text obtained from PSI should be the same as the file text.
I.e. following should be true:
document.getText().equals(psiFile.getText())

And, moreover, this should also hold for any PSI element, i.e.
document.getText().substring(element.getTextRange().getStartOffset(), element.getTextRange().getEndOffset()).equals(element.getText())

must be true for any PSI element.
Sascha Weinreuter
  Sascha Weinreuter
08 Sep 2006 14:01
5 years ago
I agree that is applies to PsiElements, but not necessarily to ASTNodes (even if from the internal implementation's point of view they are the same). The end result of this is that a language needs to know into which context it is injected into (if injected at all):

Suppose I have a Token INTEGER_LITERAL: If injected into an XML attribute, the text can be "1234" or "1234": In both cases, the text passed to my lexer is the same - which is good. But how am I supposed to deal with that when I want to calculate the literal's value? There doesn't even seem to be any utility function that could help me to decode that myself.

Suggestion: Add a method getDecodedText() (or similar) to ASTNode and/or ASTWrapperPsiElement that at least provides a convenient solution for the case when I need to process the text myself.
Sascha Weinreuter
  Sascha Weinreuter
08 Sep 2006 17:31
5 years ago
Stupid JIRA formatting: Of course I meant "& #x31;& #x32;& #x33;& #x34;" (without the spaces)
Sascha Weinreuter
  Sascha Weinreuter
17 Sep 2006 21:29
5 years ago
Ok, here's another problem:

There's a difference whether an element is part of the prefix/suffix of an injected fragment or not. While elements that are part of e.g. a String literal return the escaped text, elements from the prefix/suffix return the unescaped text.

Even though this appears logical at first glance, this is kind of a showstopper because this makes it impossible to distinguish whether to manually decode the text or not. (e.g. through getContainingFile().getContext() instanceof PsiLiteralExpression).

I see the following possibilities to address this (in order of preference):

  • fix this in a way that any language can be transparently injected
or
  • add a method getDecodedText() (see comment above) that handles text of prefix/suffix correctly
or
  • apply escaping rules of injection context to getText() of nodes in prefix/suffix as well
or
  • provide some way to determine whether an ASTNode/PsiElement is part of the prefix/suffix and doesn't need to be unescaped

Please respond ASAP. Thanks.
Alexey Kudravtsev
  Alexey Kudravtsev
18 Sep 2006 14:20
5 years ago
All I can do in the meantime is to refer you to the highly obscured and implementation tied method

com.intellij.psi.impl.source.tree.injected.InjectedLanguageUtil#isInInjectedLanguagePrefixSuffix

which of course will be changed in the future, and so on, so on.

Overall, things like prefix/suffix handling should be reviewed, since now a single quote (' or ") being typed into injected Javascript language breaks all prefix/suffix things because it makes all text after quote a part of the single long string literal spanning all injected text incuding suffix.
All suggestions about transparent injection possibilities are very much welcome.
Sascha Weinreuter
  Sascha Weinreuter
18 Sep 2006 15:24
5 years ago
Well, that's good enough for me for the moment. Thanks a lot for the hint, this at least helps me to deal with the issue in a new language that is explicitly meant to be injected into strings.
Alexey Kudravtsev
  Alexey Kudravtsev
21 Nov 2007 16:18
4 years ago
InjectedLanguageManager.getUnescapedText
Sascha Weinreuter
  Sascha Weinreuter
21 Nov 2007 16:53
4 years ago
Hmm, that still requires the language to be aware that it is potentially injected. I was looking for a more transparent solution, but I guess this would be too hard because it violates certain assumptions about PSI & text. But the new method is a good start anyway. Thanks.