Archive for March, 2010

The sorry state of Avira anti-virus heuristics

Wednesday, March 17th, 2010

UPDATE (July 7, 2010) Daniel Herding sent me a story about his adventures with Avira’s latest version. Looks like they’ve just paved over the problem.

We’ve been seeing a number of reports that our DotSpots Chrome extension is being reported as infected with the HTML/Crypted.Gen malware. There’s not a lot of information on Avira’s site about this malware, except that it’s a ‘trojan’ with ‘low damage potential‘.

I previously submitted a sample of the false positive which solved the problem until we pushed out a new version of our code. Since we can’t submit each of our builds ahead of time to Avira for approval, I had to spend some time figuring out exactly what was causing them to think our code was malicious.

I started by downloading a trial version of their anti-virus product. Immediately after restart, it detected the HTML/Crypted.Gen malware in the Chrome extension that was already installed. I extracted the script from the extension to my desktop and it continued to pop up infection warnings on the file. Now that I had a reliable reproduction case, I could start working on narrowing down what triggered the alert.

I loaded up the file in my trusty analysis tool, Notepad. Starting from the original file, I deleted large swaths of code until it was no longer detected as a virus, then restored those pieces and deleted smalled chunks. Eventually I reduced it down to a couple of lines, which were then reduced down to a few strings of characters. At the end, a file with only a few hundred characters would trigger the signature:

.fromCharCode
.charCodeAt
for
eval
0,0,0,0,0,0
Math.min

Aside: my original set of pattern strings included “nodeValue” rather than “eval”. The patterns are all case-insensitive and don’t ensure matches happen on JS token boundaries. When I went character-by-character to simplify the triggers further, I discovered that it was the ‘eVal’ in ‘nodeValue’ causing issues.

When I create a file with those six strings in it on a website, Avira will attempt to block the download. This appeared to be the most specific components of the signature. Putting those keywords into Google, I found a few references to the malware it detected. The malicious script seems to construct an iframe from an array of characters, then inserts it into the document to download malware from a third-party site.

Unfortunately, these keywords also end up in the compiled Javascript of nearly every Google Web Toolkit application, giving Avira anti-virus users false-positives when viewing many of these applications.

I posted a report to the GWT contributor Google Group with my findings. I had expected that since I posted the offending signature in the message, Avira would warn me that the web page I was reading was malicious. It didn’t.

I ran the message page through the same process that I used to figure out what triggered the signature, this time looking for the smallest piece of text that disables detection of the virus. It turns out that this text is the phrase google. So, the heuristic looks for the presence of the six character strings above, but also the absence of the word Google.

It’s a little disappointing to see how poorly this anti-virus product implements heuristic detection of this particular scripting pattern. It was trivial for me to figure out the pattern. I could have worked around any number of ways- by adding whitespace to the array of zeros, using Math[‘min’], or String[‘from’ + ‘CharCode’], all of which breaks this pattern recognition. Having the phrase ‘google’ disable detection of the virus made my job even easier. It’s possible that there are a set of other safewords that do the same thing. If I were writing malware or viruses, I’d definitely spend time altering it to work around this sort of heuristic.

Considering that the risk of false positives is so high (and users might be trained to ignore other, potentially valid virus warnings), I’d say that users are worse off with this virus definition than they are without.

You can find me on twitter as @mmastrac

A decade and three blogging platforms later

Sunday, March 14th, 2010

Looking back through the archive.org history of grack.com, I realized that I’ve used three off-the-shelf products to blog over the last decade: CityDesk, Typo (a Ruby blog engine) and WordPress.

My favorite of the three was CityDesk. It was a very simple CMS with a decent custom scripting engine. Its big disadvantages were 1) that its data was stored in a Microsoft Access database 2) its data was basically a big binary blob that didn’t work well at all with source control, 3) I couldn’t push stuff into it from the command-line (ie: various testing snippets that I like to save) and 4) the scripting language was too limited at times to do what I wanted.

Typo was a great engine, but terribly slow on my webhost, 1&1. A few years in I managed to accidentally delete the .htaccess file that configured the Ruby magic and couldn’t get it working again.

WordPress is both fast and stable, but I find that it’s near-impossible to customize it as I like, not being terribly experienced with PHP.

I’ll probably be migrating off WordPress to something a little more custom in the next few weeks. I’d like to go through my entire blog history (from 1998 to present) and put it into a single, source-controlled repository that I can use going forward.

I reserve the right to remove some of the more embarrassing posts from the Internet entirely, however. :)

XML 1.1 EBNF

Tuesday, March 9th, 2010

I’ve been searching for a complete EBNF for XML 1.1 without much success. I found one for XML 1.0, but I was hoping to avoid manually patching it for the XML 1.1 changes.

In the end, I decided that it would be easiest to just parse the EBNF directly out of the specification. Here it is, for reference:

[1] document ::= prolog element Misc* ) - ( CharRestrictedChar Char* )
[2] Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
[2a] RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]
[3] S ::= (#x20 | #x9 | #xD | #xA)+
[4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a] NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5] Name ::= NameStartChar (NameChar)*
[6] Names ::= Name (#x20 Name)*
[7] Nmtoken ::= (NameChar)+
[8] Nmtokens ::= Nmtoken (#x20 Nmtoken)*
[9] EntityValue ::= '"' ([^%&"] | PEReferenceReference)* '"'
|  "'" ([^%&'] | PEReferenceReference)* "'"
[10] AttValue ::= '"' ([^<&"] | Reference)* '"'
|  "'" ([^<&'] | Reference)* "'"
[11] SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'")
[12] PubidLiteral ::= '"' PubidChar* '"' | "'" (PubidChar - "'")* "'"
[13] PubidChar ::= #x20 | #xD | #xA | [a-zA-Z0-9] | [-'()+,./:=?;!*#@$_%]
[14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
[15] Comment ::= '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->'
[16] PI ::= '<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>'
[17] PITarget ::= Name - (('X' | 'x') ('M' | 'm') ('L' | 'l'))
[18] CDSect ::= CDStart CData CDEnd
[19] CDStart ::= '<![CDATA['
[20] CData ::= (Char* - (Char* ']]>' Char*))
[21] CDEnd ::= ']]>'
[22] prolog ::= XMLDecl Misc* (doctypedecl Misc*)?
[23] XMLDecl ::= '<?xml' VersionInfo EncodingDeclSDDeclS? '?>'
[24] VersionInfo ::= S 'version' Eq ("'" VersionNum "'" | '"' VersionNum '"')
[25] Eq ::= S? '=' S?
[26] VersionNum ::= '1.1'
[27] Misc ::= CommentPIS
[28] doctypedecl ::= '<!DOCTYPE' S Name (S ExternalID)? S? ('[' intSubset ']' S?)? '>' [VC: Root Element Type]
[WFC: External Subset]
[28a] DeclSep ::= PEReferenceS [WFC: PE Between Declarations]
[28b] intSubset ::= (markupdeclDeclSep)*
[29] markupdecl ::= elementdeclAttlistDeclEntityDeclNotationDeclPIComment [VC: Proper Declaration/PE Nesting]
[WFC: PEs in Internal Subset]
[30] extSubset ::= TextDeclextSubsetDecl
[31] extSubsetDecl ::= markupdeclconditionalSectDeclSep)*
[32] SDDecl ::= S 'standalone' Eq (("'" ('yes' | 'no') "'") | ('"' ('yes' | 'no') '"')) [VC: Standalone Document Declaration]
[39] element ::= EmptyElemTag
STag content ETag [WFC: Element Type Match]
[VC: Element Valid]
[40] STag ::= '<' Name (S Attribute)* S? '>' [WFC: Unique Att Spec]
[41] Attribute ::= Name Eq AttValue [VC: Attribute Value Type]
[WFC: No External Entity References]
[WFC: No < in Attribute Values]
[42] ETag ::= '</' Name S? '>'
[43] content ::= CharData? ((elementReferenceCDSectPICommentCharData?)*
[44] EmptyElemTag ::= '<' Name (S Attribute)* S? '/>' [WFC: Unique Att Spec]
[45] elementdecl ::= '<!ELEMENT' S Name S contentspec S? '>' [VC: Unique Element Type Declaration]
[46] contentspec ::= 'EMPTY' | 'ANY' | Mixedchildren
[47] children ::= (choiceseq) ('?' | '*' | '+')?
[48] cp ::= (Namechoiceseq) ('?' | '*' | '+')?
[49] choice ::= '(' ScpS? '|' Scp )+ S? ')' [VC: Proper Group/PE Nesting]
[50] seq ::= '(' ScpS? ',' Scp )* S? ')' [VC: Proper Group/PE Nesting]
[51] Mixed ::= '(' S? '#PCDATA' (S? '|' SName)* S? ')*'
| '(' S? '#PCDATA' S? ')' [VC: Proper Group/PE Nesting]
[VC: No Duplicate Types]
[52] AttlistDecl ::= '<!ATTLIST' S Name AttDefS? '>'
[53] AttDef ::= S Name S AttType S DefaultDecl
[54] AttType ::= StringTypeTokenizedTypeEnumeratedType
[55] StringType ::= 'CDATA'
[56] TokenizedType ::= 'ID' [VC: ID]
[VC: One ID per Element Type]
[VC: ID Attribute Default]
| 'IDREF' [VC: IDREF]
| 'IDREFS' [VC: IDREF]
| 'ENTITY' [VC: Entity Name]
| 'ENTITIES' [VC: Entity Name]
| 'NMTOKEN' [VC: Name Token]
| 'NMTOKENS' [VC: Name Token]
[57] EnumeratedType ::= NotationTypeEnumeration
[58] NotationType ::= 'NOTATION' S '(' SName (S? '|' SName)* S? ')' [VC: Notation Attributes]
[VC: One Notation Per Element Type]
[VC: No Notation on Empty Element]
[VC: No Duplicate Tokens]
[59] Enumeration ::= '(' SNmtoken (S? '|' SNmtoken)* S? ')' [VC: Enumeration]
[VC: No Duplicate Tokens]
[60] DefaultDecl ::= '#REQUIRED' | '#IMPLIED'
| (('#FIXED' S)? AttValue) [VC: Required Attribute]
[VC: Attribute Default Value Syntactically Correct]
[WFC: No < in Attribute Values]
[VC: Fixed Attribute Default]
[WFC: No External Entity References]
[61] conditionalSect ::= includeSectignoreSect
[62] includeSect ::= '<![' S? 'INCLUDE' S? '[' extSubsetDecl ']]>' [VC: Proper Conditional Section/PE Nesting]
[63] ignoreSect ::= '<![' S? 'IGNORE' S? '[' ignoreSectContents* ']]>' [VC: Proper Conditional Section/PE Nesting]
[64] ignoreSectContents ::= Ignore ('<![' ignoreSectContents ']]>' Ignore)*
[65] Ignore ::= Char* - (Char* ('<![' | ']]>') Char*)
[66] CharRef ::= '&#' [0-9]+ ';'
| '&#x' [0-9a-fA-F]+ ';' [WFC: Legal Character]
[67] Reference ::= EntityRefCharRef
[68] EntityRef ::= '&' Name ';' [WFC: Entity Declared]
[VC: Entity Declared]
[WFC: Parsed Entity]
[WFC: No Recursion]
[69] PEReference ::= '%' Name ';' [VC: Entity Declared]
[WFC: No Recursion]
[WFC: In DTD]
[70] EntityDecl ::= GEDeclPEDecl
[71] GEDecl ::= '<!ENTITY' S Name S EntityDef S? '>'
[72] PEDecl ::= '<!ENTITY' S '%' S Name S PEDef S? '>'
[73] EntityDef ::= EntityValue | (ExternalID NDataDecl?)
[74] PEDef ::= EntityValueExternalID
[75] ExternalID ::= 'SYSTEM' S SystemLiteral
| 'PUBLIC' S PubidLiteral S SystemLiteral
[76] NDataDecl ::= S 'NDATA' S Name [VC: Notation Declared]
[77] TextDecl ::= '<?xml' VersionInfoEncodingDecl S? '?>'
[78] extParsedEnt ::= TextDeclcontent ) - ( CharRestrictedChar Char* )
[80] EncodingDecl ::= S 'encoding' Eq ('"' EncName '"' | "'" EncName "'" )
[81] EncName ::= [A-Za-z] ([A-Za-z0-9._] | '-')* /* Encoding name contains only Latin characters */
[82] NotationDecl ::= '<!NOTATION' S Name S (ExternalIDPublicIDS? '>' [VC: Unique Notation Name]
[83] PublicID ::= 'PUBLIC' S PubidLiteral