[sip-comm-dev] Re: svn commit: r7627 - trunk/src/net/java/sip/communicator/util/Html2Text.java

Matthieu Casanova chocolat.mou at gmail.com
Thu Aug 26 15:13:49 CEST 2010


ok, I'll fix that

Matthieu

On Thu, Aug 26, 2010 at 2:53 PM, Emil Ivov <emcho at sip-communicator.org> wrote:
> Hey Matthieu,
>
> Nice catch! Want to take care of it or should it log an issue?
>
> Cheers,
> Emil
>
> На 25.08.10 15:22, Matthieu Casanova написа:
>> Hi,
>> I may be wrong because it seems that this class was already in use,
>> but this class is not threadsafe, if the
>> extractText(String html) method is called twice from different threads
>> at the same time it can result in troubles since the StringBuilder
>> which contains the data is shared. Are we sure that it cannot happens
>> ?
>>
>> Matthieu
>>
>> 2010/8/21  <lubomir_m at dev.java.net>:
>>> Author: lubomir_m
>>> Date: 2010-08-20 23:04:04+0000
>>> New Revision: 7627
>>>
>>> Modified:
>>>   trunk/src/net/java/sip/communicator/util/Html2Text.java
>>>
>>> Log:
>>> Fixes a memory leak in Html2Text which used to keep the plain text extracted from a specific HTML text after it was no longer necessary.
>>>
>>> Modified: trunk/src/net/java/sip/communicator/util/Html2Text.java
>>> Url: https://sip-communicator.dev.java.net/source/browse/sip-communicator/trunk/src/net/java/sip/communicator/util/Html2Text.java?view=diff&rev=7627&p1=trunk/src/net/java/sip/communicator/util/Html2Text.java&p2=trunk/src/net/java/sip/communicator/util/Html2Text.java&r1=7626&r2=7627
>>> ==============================================================================
>>> --- trunk/src/net/java/sip/communicator/util/Html2Text.java     (original)
>>> +++ trunk/src/net/java/sip/communicator/util/Html2Text.java     2010-08-20 23:04:04+0000
>>> @@ -4,7 +4,6 @@
>>>  * Distributable under LGPL license.
>>>  * See terms of license at gnu.org.
>>>  */
>>> -
>>>  package net.java.sip.communicator.util;
>>>
>>>  import java.io.*;
>>> @@ -13,25 +12,34 @@
>>>  import javax.swing.text.html.parser.*;
>>>
>>>  /**
>>> - * A utility class that allows to extract the text content of an html page
>>> + * A utility class that allows to extract the text content of an HTML page
>>>  * stripped from all formatting tags.
>>>  *
>>>  * @author Emil Ivov <emcho at sip-communicator.org>
>>>  * @author Yana Stamcheva
>>> + * @author Lubomir Marinov
>>>  */
>>>  public class Html2Text
>>>  {
>>> +    /**
>>> +     * The <tt>Logger</tt> used by the <tt>Html2Text</tt> class for logging
>>> +     * output.
>>> +     */
>>>     private static final Logger logger
>>>         = Logger.getLogger(Html2Text.class);
>>>
>>> -    private static HTMLParserCallBack parser;
>>> +    /**
>>> +     * The HTML parser used by {@link #extractText(String)} to parse HTML so
>>> +     * that plain text can be extracted from it.
>>> +     */
>>> +    private static HTMLParserCallback parser;
>>>
>>>     /**
>>> -     * A utility method that allows to extract the text content of an html page
>>> +     * A utility method that allows to extract the text content of an HTML page
>>>      * stripped from all formatting tags. Method is synchronized to avoid
>>> -     * concurrent access to the underlying html editor kit.
>>> +     * concurrent access to the underlying <tt>HTMLEditorKit</tt>.
>>>      *
>>> -     * @param html the html string that we will extract the text from.
>>> +     * @param html the HTML string that we will extract the text from.
>>>      * @return the text content of the <tt>html</tt> parameter.
>>>      */
>>>     public static synchronized String extractText(String html)
>>> @@ -40,59 +48,86 @@
>>>             return null;
>>>
>>>         if (parser == null)
>>> -            parser = new HTMLParserCallBack();
>>> +            parser = new HTMLParserCallback();
>>>
>>>         try
>>>         {
>>>             StringReader in = new StringReader(html);
>>> -            parser.parse(in);
>>> -            in.close();
>>>
>>> -            return parser.getText();
>>> +            try
>>> +            {
>>> +                return parser.parse(in);
>>> +            }
>>> +            finally
>>> +            {
>>> +                in.close();
>>> +            }
>>>         }
>>> -        catch (Exception exc)
>>> +        catch (Exception ex)
>>>         {
>>>             if (logger.isInfoEnabled())
>>> -                logger.info("Failed to extract plain text from html="+html, exc);
>>> +                logger.info("Failed to extract plain text from html="+html, ex);
>>>             return html;
>>>         }
>>>     }
>>>
>>>     /**
>>> -     * The ParserCallback that will parse the html.
>>> +     * The ParserCallback that will parse the HTML.
>>>      */
>>> -    private static class HTMLParserCallBack extends HTMLEditorKit.ParserCallback
>>> +    private static class HTMLParserCallback
>>> +        extends HTMLEditorKit.ParserCallback
>>>     {
>>> -        StringBuffer s;
>>> +        /**
>>> +         * The <tt>StringBuilder</tt> which accumulates the parsed text while it
>>> +         * is being parsed.
>>> +         */
>>> +        private StringBuilder sb;
>>>
>>>         /**
>>>          * Parses the text contained in the given reader.
>>>          *
>>>          * @param in the reader to parse.
>>> +         * @return the parsed text
>>>          * @throws IOException thrown if we fail to parse the reader.
>>>          */
>>> -        public void parse(Reader in) throws IOException
>>> +        public String parse(Reader in)
>>> +            throws IOException
>>>         {
>>> -            s = new StringBuffer();
>>> -            ParserDelegator delegator = new ParserDelegator();
>>> -            // the third parameter is TRUE to ignore charset directive
>>> -            delegator.parse(in, this, Boolean.TRUE);
>>> +            sb = new StringBuilder();
>>> +
>>> +            String s;
>>> +
>>> +            try
>>> +            {
>>> +                new ParserDelegator().parse(in, this, /* ignoreCharSet */ true);
>>> +                s = sb.toString();
>>> +            }
>>> +            finally
>>> +            {
>>> +                /*
>>> +                 * Since the Html2Text class keeps this instance in a static
>>> +                 * reference, the field sb should be reset to null as soon as
>>> +                 * completing its goad in order to avoid keeping the parsed
>>> +                 * text in memory after it is no longer needed i.e. to prevent
>>> +                 * a memory leak. This method has been converted to return the
>>> +                 * parsed string instead of having a separate getter method for
>>> +                 * the parsed string for the same purpose.
>>> +                 */
>>> +                sb = null;
>>> +            }
>>> +            return s;
>>>         }
>>>
>>>         /**
>>>          * Appends the given text to the string buffer.
>>> +         *
>>> +         * @param text
>>> +         * @param pos
>>>          */
>>> +        @Override
>>>         public void handleText(char[] text, int pos)
>>>         {
>>> -            s.append(text);
>>> -        }
>>> -
>>> -        /**
>>> -         * Returns the parsed text.
>>> -         */
>>> -        public String getText()
>>> -        {
>>> -            return s.toString();
>>> +            sb.append(text);
>>>         }
>>>     }
>>>  }
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: commits-unsubscribe at sip-communicator.dev.java.net
>>> For additional commands, e-mail: commits-help at sip-communicator.dev.java.net
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe at sip-communicator.dev.java.net
>> For additional commands, e-mail: dev-help at sip-communicator.dev.java.net
>>
>>
>
> --
> Emil Ivov, Ph.D.                               67000 Strasbourg,
> Project Lead                                   France
> SIP Communicator
> emcho at sip-communicator.org                     PHONE: +33.1.77.62.43.30
> http://sip-communicator.org                    FAX:   +33.1.77.62.47.31
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe at sip-communicator.dev.java.net
> For additional commands, e-mail: dev-help at sip-communicator.dev.java.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe at sip-communicator.dev.java.net
For additional commands, e-mail: dev-help at sip-communicator.dev.java.net





More information about the dev mailing list