[cmake-developers] slow regex implementation in RegularExpression

Alexandru Ciobanu alex at rogue-research.com
Wed Nov 23 12:20:06 EST 2011


Hi Bill,

On 2011-11-23, at 10:36 AM, Bill Hoffman wrote:

> I am pretty sure the last time we talked about adding a new regex we talked about requiring explicit requests.  I think this would be a much safer approach.  I am really scared that this regex will not be compatible with the old one, and it will break lots of stuff in very subtle ways that are hard for people to detect.  It is not that much code to have both.  Where performance is an issue, we can swap it out, and when people need better regex they can use TRE as well.  I don't think the pain will be worth getting rid of the old usage.

I agree with you on multiple points here.

[1]
Your intuition is right. Just minutes before reading your email I tried to compile ITK using CMake+TRE. And there was at least one regex that TRE refused to compile. 

So yes, we cannot use TRE for *all* regex needs in CMake.

[2]
>  It is not that much code to have both. 
I agree, and I think Brad King does too.

The patch I submitted does exactly that:
    - old regex code stays unchanged 
              ---> Source/kwsys/RegularExpression.{cxx,hxx.in} 
    - new TRE regex code added
              ---> Utilities/cmtre

So, yes, we intend to keep both.

[3]
I think there's a simple solution to our problem. I think we can:
  - solve the performance problem
  - keep the old behaviour, i.e. not break any projects out there

The solution is to use the old regex code everywhere, except for the very specific place where it causes problems.

By looking at the bug report for 12381 (http://public.kitware.com/Bug/view.php?id=12381), you can see that the only place we have to use TRE is:
   Source/CTest/cmCTestBuildHandler.cxx

What cmCTestBuildHandler does is very simple: for every single line of output from the build process it tries to match about 100 regular expressions, in order to find error or warning messages. These 100 regular expressions are defined in the same file in four static arrays, that look something like this:

    static const char* cmCTestErrorMatches[] = { 
      "^[Bb]us [Ee]rror", 
      "^[Ss]egmentation [Vv]iolation", 
      "^[Ss]egmentation [Ff]ault", 
      ":.*[Pp]ermission [Dd]enied", 
      "([^ :]+):([0-9]+): ([^ \\t])", 
      "([^:]+): error[ \\t]*[0-9]+[ \\t]*:", 
      // AND SO ON …
   };
  
  // + other 3 arrays

In this case, it is safe to use TRE, because we (the CMake developers) write these regular expressions, and we can make sure they work with TRE.

All other regular expressions, including those written by users in their CMakeList.txt files will run on the old regex code, and thus behave normally.

CONCLUSION:
  - we can keep old behaviour and solve the performance problem
  - solution in part [3]

If this solution is acceptable, I'll have to recreate the patch.

sincerely,
Alex Ciobanu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://public.kitware.com/pipermail/cmake-developers/attachments/20111123/94365ab3/attachment.html>


More information about the cmake-developers mailing list