From edi at agharta.de Thu Aug 12 10:03:05 2004 From: edi at agharta.de (Edi Weitz) Date: Thu, 12 Aug 2004 12:03:05 +0200 Subject: [regex-coach] Re: about Regex Coach In-Reply-To: <5D7D85C4DFC1D411BD8700B0D07810E003010557@KUFMXS04> (Eugeny Sattler's message of "Thu, 12 Aug 2004 13:08:01 +0400") References: <5D7D85C4DFC1D411BD8700B0D07810E003010557@KUFMXS04> Message-ID: <87n010u10m.fsf@bird.agharta.de> Hi Eugene! On Thu, 12 Aug 2004 13:08:01 +0400, Eugeny.Sattler at RU.NESTLE.com wrote: > 1) I have discovered that \w does not match with russian characters > on my russian winNT. > > \w should take locale settings into account , shouldn't it? > > character class like [firstRussianLetter-LastRussianLetter] does work > > 2) I have discovered that I can not turn off case sensitivity using > construction like (?-i)regex_here Regex Coach doesn't work well with Eastern European character sets - see: This is something I can't fix, sorry. > My version is 0.5.2 The current version is 0.6.7, you should upgrade. (That won't help with your problems, though.) Cheers, Edi. PS: Please use the mailing list for bug reports, From johnjc-regex at publicinfo.net Sun Aug 22 13:04:47 2004 From: johnjc-regex at publicinfo.net (John Clements) Date: Sun, 22 Aug 2004 14:04:47 +0100 Subject: [regex-coach] ".+" and ".+?" with optional parenthesized text Message-ID: <6.1.0.6.2.20040822134658.02f2e170@pop3.attglobal.net> Hello, This is my first post to this list. I have looked through the archives (searched on "greedy" and some other terms, actually) but don't find anything that seems to relate to my problem. So I'm writing to see if anyone else on the list has encountered something like this. There is something about the "." operator, especially the "non-greedy" version of it, and in particular its behaviour when used in conjunction with a parenthesized term which is optional. I've put the pattern and a sample target string and written comments about the results I get from Regex Coach. I ran the pattern with "i" checked. Pattern: ^\s*An appeal.+?(Joined )?Cases? ?t ?[-?] ?\d{1,3}\/ ?\d{2}.+?(between)? Target string: An appeal against the judgment delivered on 15 January 2003 by the Second Chamber (Extended Composition) of the Court of First Instance of the European Communities in joined cases T-377/00 (1), T-379/00 (2), T-380/00 (2), T-260/01 (3) and T-272/01 (4) between Philip Morris International, Inc., R.J. Reynolds Tobacco Holdings, Inc., RJR Acquisition Corp., R.J. Reynolds Tobacco Company, R.J. Reynolds Tobacco International Inc., and Japan Tobacco, Inc., and Commission of the European Communities, supported by European Parliament, Kingdom of Spain, French Republic, Italian Republic, Portuguese Republic, Republic of Finland, Federal Republic of Germany, Hellenic Republic, Kingdom of the Netherlands, was brought before the Court of Justice of the European Communities on 25 March 2003 by R.J. Reynolds Tobacco Holdings, Inc., established in Winston-Salem, North Carolina (United States), RJR Acquisition Corp., established in Wilmington, Delaware (United States), R.J. Reynolds Tobacco Company, established in Winston-Salem, North Carolina (United States), R.J. Reynolds Tobacco International Inc., established in Winston-Salem, North Carolina (United States) and Japan Tobacco, Inc., established in Tokyo (Japan), represented by O.W. Brouwer, lawyer, and P. Lomas, solicitor. ============ What I want it to do is match the string from the beginning through "between", and when there is no instance of "between", I want it to match the entire string. I would expect the example above to give me a match on 0-259 (i.e. through "between". But instead I get a match only on 0-189 (through the first case number). This makes no sense to me whatsoever. I would consider it a bug but Regex Coach and Perl v5.8.3 on FreeBSD give me the same results. ^\s*An appeal.+?(Joined )?Cases? ?t ?[-?] ?\d{1,3}\/ ?\d{2}.+(between)? gives me a match on 0-1279 (the whole string). Why doesn't it stop when it finds "between"? ^\s*An appeal.+?(Joined )?Cases? ?t ?[-?] ?\d{1,3}\/ ?\d{2}.+?(between) gives me the match I expect, 0-259. ^\s*An appeal.+?(Joined )?Cases? ?t ?[-?] ?\d{1,3}\/ ?\d{2}.+(between) also gives me the match I expect, 0-259. But if I make the "(between)" optional, by putting a "?" after it, - the regex engine doesn't stop there when the ".+" is greedy, and - the regex engine doesn't find "between" when the ".+" is non-greedy, i.e. ".+?" Can anyone enlighten me? Many thanks, John Clements From edi at agharta.de Sun Aug 22 13:55:09 2004 From: edi at agharta.de (Edi Weitz) Date: Sun, 22 Aug 2004 15:55:09 +0200 Subject: [regex-coach] ".+" and ".+?" with optional parenthesized text In-Reply-To: <6.1.0.6.2.20040822134658.02f2e170@pop3.attglobal.net> (John Clements's message of "Sun, 22 Aug 2004 14:04:47 +0100") References: <6.1.0.6.2.20040822134658.02f2e170@pop3.attglobal.net> Message-ID: <87fz6fl1ky.fsf@bird.agharta.de> On Sun, 22 Aug 2004 14:04:47 +0100, John Clements wrote: > I ran the pattern with "i" checked. I guess you also had "s" checked because your target string contained line breaks. > What I want it to do is match the string from the beginning through > "between", and when there is no instance of "between", I want it to > match the entire string. This regex should work: ^\s*An appeal.+?(Joined )?Cases? ?t ?[-?] ?\d{1,3}\/ ?\d{2}(.+?between|.*) The behaviour you saw was right. (As a rule of thumb Regex Coach is always right as long as it does the same as Perl... :) You had ".+?(between)?" which meant "match as few characters as possible up to ..." where ... was "the string 'between' OR ANYTHING" because you made 'between' optional, i.e. you regex was equivalent to ".+?". So, the regex engine matched exactly zero characters. Does that help? Cheers, Edi. From johnjc-regex at publicinfo.net Sun Aug 22 16:37:36 2004 From: johnjc-regex at publicinfo.net (John Clements) Date: Sun, 22 Aug 2004 17:37:36 +0100 Subject: [regex-coach] ".+" and ".+?" with optional parenthesized text In-Reply-To: <87fz6fl1ky.fsf@bird.agharta.de> References: <6.1.0.6.2.20040822134658.02f2e170@pop3.attglobal.net> <87fz6fl1ky.fsf@bird.agharta.de> Message-ID: <6.1.0.6.2.20040822171822.02469850@mail.publicinfo.net> That is absolutely brilliant, Edi! Thank you so much! At 14:55 22/08/04, you wrote: >On Sun, 22 Aug 2004 14:04:47 +0100, John Clements > wrote: > > > I ran the pattern with "i" checked. > >I guess you also had "s" checked because your target string contained >line breaks. If it had line breaks they were introduced by the mailer(s) because Regex Coach didn't show any. > > What I want it to do is match the string from the beginning through > > "between", and when there is no instance of "between", I want it to > > match the entire string. > >This regex should work: > >^\s*An appeal.+?(Joined )?Cases? ?t ?[-?] ?\d{1,3}\/ ?\d{2}(.+?between|.*) I was just looking over some tutorial material which was talking about what you enclose in parentheses and what not, and it hadn't dawned on me that it was relevant to my problem! Yes, putting the ".+?" inside the parenthesis does the trick. And the "|.*" makes perfect sense. It says so directly "or the rest of the string". I had settled for a solution that used the "greedy" version of ".+" before "between", which in the presence of a second instance of the word "between" would have brought in unwanted text. Now it's just right. I really appreciate this! >The behaviour you saw was right. (As a rule of thumb Regex Coach is >always right as long as it does the same as Perl... :) Yeah, that's what I thought, too. :) >You had ".+?(between)?" which meant "match as few characters as >possible up to ..." where ... was "the string 'between' OR ANYTHING" >because you made 'between' optional, i.e. you regex was equivalent to >".+?". So, the regex engine matched exactly zero characters. > >Does that help? Indeed, indeed! Thanks for that explanation, too. I need to see the logic of something to really absorb it. I had accepted what I saw as the limitation of the regex engine but without understanding its logic hadn't worked out how to get that refinement that I needed. All the best, John John Clements john.clements at publicinfo.net +44(0)20 8959-6432 http://www.publicinfo.net PublicInfo.Net Ltd. 29 Gibbs Green Edgware, Middlesex United Kingdom HA8 9RS From edi at agharta.de Sun Aug 22 18:39:07 2004 From: edi at agharta.de (Edi Weitz) Date: Sun, 22 Aug 2004 20:39:07 +0200 Subject: [regex-coach] ".+" and ".+?" with optional parenthesized text In-Reply-To: <6.1.0.6.2.20040822171822.02469850@mail.publicinfo.net> (John Clements's message of "Sun, 22 Aug 2004 17:37:36 +0100") References: <6.1.0.6.2.20040822134658.02f2e170@pop3.attglobal.net> <87fz6fl1ky.fsf@bird.agharta.de> <6.1.0.6.2.20040822171822.02469850@mail.publicinfo.net> Message-ID: <87d61jggqc.fsf@bird.agharta.de> On Sun, 22 Aug 2004 17:37:36 +0100, John Clements wrote: >>^\s*An appeal.+?(Joined )?Cases? ?t ?[-?] ?\d{1,3}\/ ?\d{2}(.+?between|.*) > > Yes, putting the ".+?" inside the parenthesis does the trick. And > the "|.*" makes perfect sense. It says so directly "or the rest of > the string". What I forgot to say: Note that the order is important. This regex ^\s*An appeal.+?(Joined )?Cases? ?t ?[-?] ?\d{1,3}\/ ?\d{2}(.*|.+?between) won't work because the engine will try the "rest of the string" first and will succeed, so it will stop. Cheers, Edi.