?????????? ????????? - ??????????????? - /home/agenciai/public_html/cd38d8/perl-HTML-Parser.zip
???????
PK w{!\K�A��� �� Changesnu �[��� Change history for HTML-Parser 3.76 2021-03-04 * Add a fix for a stack confusion error on `eof`. (GH#21) (Matthew Horsfall and Chase Whitener) 3.75 2020-08-30 * Cleanup the prereqs a bit * Mark HTML::Filter as deprecated as the docs point out * Move Parser.pm into the lib directory with the others. This will help with everything from auto version bumps after releases, to scanning for prerequisites and spelling errors. * Fix a few spelling errors in the POD for HTML::Parser * Clean up the spacing on many examples in HTML::Parser 3.74 2020-08-30 * Fix the order of date and version in this change log. (Thanks, haarg) * Convert to Dist::Zilla * Build all prereqs from our cpanfile * Go through all test files and: * perltidy * Use strict/warnings * Get rid of two-arg open * Get rid of BAREWORD filehandles * Fix the eval pattern used * Only use -w where we catch $SIG{__WARN__} * Fix encoding problems * use utf8 where we have unicode in the source * Fix a typo here and there * perltidy all of the example apps in eg/ * Add comments explaining the apps in eg/ (GH#13 Thanks, Salvatore Bonaccorso) * Print out UTF-8 encoded data where sensible in eg/ 3.73 2020-08-24 * Cleaned up this changes log. * Added a .mailmap file to organize contributions accurately. * Ensure all versions are equal and on the current version * Add the .mailmap to the MANIFEST * Change the META information to point to the new GH repository * Add a .perltidyrc to use going forward * Add hctype.h and pfunc.h to the dist as static files and stop asking for them to be built on the user's end. * Remove t/pod.t from userland testing * Remove t/pod-coverage.t from userland testing * Clean up the MANIFEST * Start testing via GitHub Actions/Workflows * Protect active parser from being freed (PR 13, RT #115034) 3.72 2016-01-19 * Avoid more clang casting warnings * Remove trailing whitespace * Ensure entities expand to utf8 sequences under 'utf8_mode' [RT#99755] * typo fixes (David Steinbrunner) * Silence clang warning (Jacques Germishuys) * const+static-ing (bulk88) 3.71 2013-05-09 * Transform ':' in headers to '-' [RT#80524] 3.70 2013-03-28 * Fix for cross-compiling with Buildroot (François Perrad) * Comment typo fix * Fix Issue #3 / RT #84144: HTML::Entities::decode_entities() needs to call SV_CHECK_THINKFIRST() before checking READONLY flag (Yves Orton) 3.69 2011-10-15 * Documentation fix; encode_utf8 mixup [RT#71151] * Make it clearer that there are 2 (actually 3) options for handing "UTF-8 garbage" * Github is the official repo * Can't be bothered to try to fix the failures that occur on perl-5.6 * fix to TokeParser to correctly handle option configuration (Barbie) * Aesthetic change: remove extra ; (Jon Jensen) * Trim surrounding whitespace from extracted URLs. (Ville Skyttä) 3.68 2010-09-01 * Declare the encoding of the POD to be utf8 3.67 2010-08-17 * bleadperl 2154eca7 breaks HTML::Parser 3.66 [RT#60368] (Nicholas Clark) 3.66 2010-07-09 * Fix entity decoding in utf8_mode for the title header 3.65 2010-04-04 * Eliminate buggy entities_decode_old * Fixed endianness typo [RT#50811] (Salvatore Bonaccorso) * Documentation Fixes. (Ville Skyttä) 3.64 2009-10-25 * Convert files to UTF-8 * Don't allow decode_entities() to generate illegal Unicode chars * Copyright 2009 * Remove rendundant (repeated) test * Make parse_file() method use 3-arg open [RT#49434] 3.63 2009-10-22 * Take more care to prepare the char range for encode_entities [RT#50170] * decode_entities confused by trailing incomplete entity 3.62 2009-08-13 * Doc patch: Make it clearer what the return value from ->parse is * HTTP::Header doc typo fix. (Ville Skyttä) * Do not bother tracking style or script, they're ignored. (Ville Skyttä) * Bring HTML 5 head elements up to date with WD-html5-20090423. (Ville Skyttä) * Improve HeadParser performance. (Ville Skyttä) 3.61 2009-06-20 * Test that triggers the crash that Chip fixed * Complete documented list of literal tags * Avoid crash (referenced pend_text instead of skipped_text) (Chip Salzenberg) * Reference HTML::LinkExttor [RT#43164] (Antonio Radici) 3.60 2009-02-09 * Spelling fixes. (Ville Skyttä) * Test multi-value headers. (Ville Skyttä) * Documentation improvements. (Ville Skyttä) * Do not terminate head parsing on the <object> element (added in HTML 4.0). (Ville Skyttä) * Add support for HTML 5 <meta charset> and new HEAD elements. (Ville Skyttä) * Short description of the htextsub example (Damyan Ivanov) * Suppress warning when encode_entities is called with undef [RT#27567] (Mike South) * HTML::Parser doesn't compile with perl 5.8.0. (Zefram) 3.59 2008-11-24 * Restore perl-5.6 compatibility for HTML::HeadParser. * Improved META.yml 3.58 2008-11-17 * Suppress "Parsing of undecoded UTF-8 will give garbage" warning with attr_encoded [RT#29089] * HTML::HeadParser: - Recognize the Unicode BOM in utf8_mode as well [RT#27522] - Avoid ending up with '/' keys attribute in Link headers. 3.57 2008-11-16 * The <iframe> element content is now parsed in literal mode. * Parsing of <script> and <style> content ends on the first end tag even when that tag was in a quoted string. That seems to be the behaviour of all modern browsers. * Implement backquote() attribute as requested by Alex Kapranoff. * Test and documentation tweaks from Alex Kapranoff. 3.56 2007-01-12 * Cloning of parser state for compatibility with threads. Fixed by Bo Lindbergh <blgl@hagernas.com>. * Don't require whitespace between declaration tokens. <http://rt.cpan.org/Ticket/Display.html?id=20864> 3.55 2006-07-10 * Treat <> at the end of document as text. Used to be reported as a comment. * Improved Firefox compatibility for bad HTML: - Unclosed <script>, <style> are now treated as empty tags. - Unclosed <textarea>, <xmp> and <plaintext> treat rest as text. - Unclosed <title> closes at next tag. * Make <!a'b> a comment by itself. 3.54 2006-04-28 * Yaakov Belch discovered yet another issue with <script> parsing. Enabling of 'empty_element_tags' got the parser confused if it found such a tag for elements that are normally parsed in literal mode. Of these <script src="..."/> is the only one likely to be found in documents. <http://rt.cpan.org//Ticket/Display.html?id=18965> 3.53 2006-04-27 * When ignore_element was enabled it got confused if the corresponding tags did not nest properly; the end tag was treated it as if it was a start tag. Found and fixed by Yaakov Belch <code@yaakovnet.net>. <http://rt.cpan.org/Ticket/Display.html?id=18936> 3.52 2006-04-26 * Make sure the 'start_document' fires exactly once for each document parsed. For earlier releases it did not fire at all for empty documents and could fire multiple times if parse was called with empty chunks. * Documentation tweaks and typo fixes. 3.51 2006-03-22 * Named entities outside the Latin-1 range are now only expanded when properly terminated with ";". This makes HTML::Parser compatible with Firefox/Konqueror/MSIE when it comes to how these entities are expanded in attribute values. Firefox does expand unterminated non-Latin-1 entities in plain text, so here HTML::Parser only stays compatible with Konqueror/MSIE. Fixes <http://rt.cpan.org/Ticket/Display.html?id=17962>. * Fixed some documentation typos spotted by <william@knowmad.com>. <http://rt.cpan.org/Ticket/Display.html?id=18062> 3.50 2006-02-14 * The 3.49 release didn't compile with VC++ because it mixed code and declarations. Fixed by Steve Hay <steve.hay@uk.radan.com>. 3.49 2006-02-08 * Events could sometimes still fire after a handler has signaled eof. * Marked_sections with text ending in square bracket parsed wrong. Fix provided by <paul.bijnens@xplanation.com>. <http://rt.cpan.org/Ticket/Display.html?id=16749> 3.48 2005-12-02 * Enabling empty_element_tags by default for HTML::TokeParser was a mistake. Reverted that change. <http://rt.cpan.org/Ticket/Display.html?id=16164> * When processing a document with "marked_sections => 1", the skipped text missed the first 3 bytes "<![". <http://rt.cpan.org/Ticket/Display.html?id=16207> 3.47 2005-11-22 * Added empty_element_tags and xml_pic configuration options. These make it possible to enable these XML features without enabling the full XML-mode. * The empty_element_tags is enabled by default for HTML::TokeParser. 3.46 2005-10-24 * Don't try to treat an literal as space. This breaks Unicode parsing. <http://rt.cpan.org/Ticket/Display.html?id=15068> * The unbroken_text option is now on by default for HTML::TokeParser. * HTML::Entities::encode will now encode "'" by default. * Improved report/ignore_tags documentation by Norbert Kiesel <nkiesel@tbdnetworks.com>. * Test suite now use Test::More, by Norbert Kiesel <nkiesel@tbdnetworks.com>. * Fix HTML::Entities typo spotted by Stefan Funke <bundy@adm.arcor.net>. * Faster load time with XSLoader (perl-5.6 or better now required). * Fixed POD markup errors in some of the modules. 3.45 2005-01-06 * Fix stack memory leak caused by missing PUTBACK. Only code that used $p->parse(\&cb) form was affected. Fix provided by Gurusamy Sarathy <gsar@sophos.com>. 3.44 2004-12-28 * Fix confusion about nested quotes in <script> and <style> text. 3.43 2004-12-06 * The SvUTF8 flag was not propagated correctly when replacing unterminated entities. * Fixed test failure because of missing binmode on Windows. 3.42 2004-12-04 * Avoid sv_catpvn_utf8_upgrade() as that macro was not available in perl-5.8.0. Patch by Reed Russell <Russell.Reed@acxiom.com>. * Add casts to suppress compilation warnings for char/U8 mismatches. * HTML::HeadParser will always push new header values. This make sure we never loose old header values. 3.41 2004-11-30 * Fix unresolved symbol error with perl-5.005. 3.40 2004-11-29 * Make utf8_mode only available on perl-5.8 or better. It produced garbage with older versions of perl. * Emit warning if entities are decoded and something in the first chunk looks like hi-bit UTF-8. Previously this warning was only triggered for documents with BOM. 3.39_92 2004-11-23 * More documentation of the Unicode issues. Moved around HTML::Parser documentation a bit. * New boolean option; $p->utf8_mode to allow parsing of raw UTF-8. * Documented that HTML::Entities::decode_entities() can take multiple arguments. * Unterminated entities are now decoded in text (compatibility with MSIE misfeature). * Document HTML::Entities::_decode_entities(); this variation of the decode_entities() function has been available for a long time, but have not been documented until now. * HTML::Entities::_decode_entities() can now be told to try to expand unterminated entities. * Simplified Makefile.PL 3.39_91 2004-11-23 * The HTML::HeadParser will skip Unicode BOM. Previously it would consider the <head> section done when it saw the BOM. * The parser will look for Unicode BOM and give appropriate warnings if the form found indicate trouble. * If no matching end tag is found for <script>, <style>, <xmp> <title>, <textarea> then generate one where the next tag starts. * For <script> and <style> recognize quoted strings and don't consider end element if the corresponding end tag is found inside such a string. 3.39_90 2004-11-17 * The <title> element is now parsed in literal mode, which means that other tags are not recognized until </title> has been seen. * Unicode support for perl-5.8 and better. * Decoding Unicode entities always enabled; no longer a compile time option. * Propagation of UTF8 state on strings. Patch contributed by John Gardiner Myers <jgmyers@proofpoint.com>. * Calculate offsets and lengths in chars for Unicode strings. * Fixed link typo in the HTML::TokeParser documentation. 3.38 2004-11-11 * New boolean option; $p->closing_plaintext Contributed by Alex Kapranoff <alex@kapranoff.ru> 3.37 2004-11-10 * Improved handling of HTML encoded surrogate pairs and illegally encoded Unicode; <http://rt.cpan.org/Ticket/Display.html?id=7785>. Patch by John Gardiner Myers <jgmyers@proofpoint.com>. * Avoid generating bad UTF8 strings when decoding entities representing chars beyond #255 in 8-bit strings. Such bad UTF8 sometimes made perl-5.8.5 and older segfault. * Undocument v2 style subclassing in synopsis section. * Internal cleanup: Make 'gcc -Wall' happier. * Avoid modification of PVs during parsing of attrspec. Another patch by John Gardiner Myers. 3.36 2004-04-01 * Improved MSIE/Mozilla compatibility. If the same attribute name repeats for a start tag, use the first value instead of the last. Patch by Nick Duffek <html-parser@duffek.com>. <http://rt.cpan.org/Ticket/Display.html?id=5472> 3.35 2003-12-12 * Documentation fixes by Paul Croome <Paul.Croome@softwareag.com>. * Removed redundant dSP. 3.34 2003-10-27 * Fix segfault that happened when the parse callback caused the stack to get reallocated. The original bug report was <http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=217616> 3.33 2003-10-14 * Perl 5.005 or better is now required. For some reason we get a test failure with perl-5.004 and I don't really feel like debugging that perl any more. Details about this failure can be found at <http://rt.cpan.org/Ticket/Display.html?id=4065>. * New HTML::TokeParser method called 'get_phrase'. It returns all current text while ignoring any phrase-level markup. * The HTML::TokeParser method 'get_text' now expands skipped non-phrase-level tags as a single space. 3.32 2003-10-10 * If the document parsed ended with some kind of unterminated markup, then the parser state was not reset properly and this piece of markup would show up in the beginning of the next document parsed. <http://rt.cpan.org/Ticket/Display.html?id=3954> * The get_text and get_trimmed_text methods of HTML::TokeParser can now take multiple end tags as argument. Patch by <siegmann@tinbergen.nl> at <http://rt.cpan.org/Ticket/Display.html?id=3166>. * Various documentation tweaks. * Included another example program: hdump 3.31 2003-08-19 * The -DDEBUGGING fix in 3.30 was not really there :-( 3.30 2003-08-17 * The previous release failed to compile on a -DDEBUGGING perl like the one provided by Redhat 9. * Got rid of references to perl-5.7. * Further fixes to avoid warnings from Visual C. Patch by Steve Hay <steve.hay@uk.radan.com>. 3.29 2003-08-14 * Setting xml_mode now implies strict_names also for end tags. * Avoid warning from Visual C. Patch by <gsar@activestate.com>. * 64-bit fix from Doug Larrick <doug@ties.org> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=195500 * Try to parse similar to Mozilla/MSIE in certain edge cases. All these are outside of the official definition of HTML but HTML spam often tries to take advantage of these. - New configuration attribute 'strict_end'. Unless enabled we will allow end tags to contain extra words or stuff that look like attributes before the '>'. This means that tags like these: </foo foo="<ignored>"> </foo ignored> </foo ">" ignored> are now all parsed as a 'foo' end tag instead of text. Even if the extra stuff looks like attributes they will not be reported if requested via the 'attr' or 'tokens' argspecs for the 'end' handler. - Parse '</:comment>' and '</ comment>' as comments unless strict_comment is enabled. Previous versions of the parser would report these as text. If these comments contain quoted words prefixed by space or '=' these words can contain '>' without terminating the comment. - Parse '<! "<>" foo>' as comment containing ' "<>" foo'. Previous versions of the parser would terminate the comment at the first '>' and report the rest as text. - Legacy comment mode: Parse with comments terminated with a lone '>' if no '-->' is found before eof. - Incomplete tag at eof is reported as a 'comment' instead of 'text' unless strict_comment is enabled. 3.28 2003-04-16 * When 'strict_comment' is off (which it is by default) treat anything that matches <!...> a comment. * Should now be more efficient on threaded perls. 3.27 2003-01-18 * Typo fixes to the documentation. * HTML::Entities::escape_entities_numeric contributed by Sean M. Burke <sburke@cpan.org>. * Included one more example program 'hlc' that show how to downcase all tags in an HTML file. 3.26 2002-03-17 * Avoid core dump in some cases where the callback croaks. The perl_call_method and perl_call_sv needs G_EVAL flag to be safe. * New parser attributes; 'attr_encoded' and 'case_sensitive'. Contributed by Guy Albertelli II <guy@albertelli.com>. * HTML::Entities - don't encode \r by default as suggested by Sean M. Burke. * HTML::HeadParser - ignore empty http-equiv - allow multiple <link> elements. Patch by Timur I. Bakeyev <timur@gnu.org> * Avoid warnings from bleadperl on the uentities test. 3.25 2001-05-11 * Minor tweaks for build failures on perl5.004_04, perl-5.6.0, and for macro clash under Windows. * Improved parsing of <plaintext>... :-) 3.24 2001-05-09 * $p->parse(CODE) * New events: start_document, end_document * New argspecs: skipped_text, offset_end * The offset/line/column counters was not properly reset after eof. 3.23 2001-05-01 * If the $p->ignore_elements filter did not work as it should if handlers for start/end events was not registered. 3.22 2001-04-17 * The <textarea> element is now parsed in literal mode, i.e. no other tags recognized until the </textarea> tag is seen. Unlike other literal elements, the text content is not 'cdata'. * The XML ' entity is decoded. It apos-char itself is still encoded as ' as ' is not really an HTML tag, and not recognized by many HTML browsers. 3.21 2001-04-10 * Fix a memory leak which occurred when using filter methods. * Avoid a few compiler warnings (DEC C): - Trailing comma found in enumerator list - "unsigned char" is not compatible with "const char". * Doc update. 3.20 2001-04-02 * Some minor documentation updates. 3.19_94 2001-03-30 * Implemented 'tag', 'line', 'column' argspecs. * HTML::PullParser doc update. eg/hform is an example of HTML::PullParser usage. 3.19_93 2001-03-27 * Shorten 'report_only_tags' to 'report_tags'. I think it reads better. * Bleadperl portability fixes. 3.19_92 2001-03-25 * HTML::HeadParser made more efficient by using 'ignore_elements'. * HTML::LinkExtor made more efficient by using 'report_only_tags'. * HTML::TokeParser generalized into HTML::PullParser. HTML::PullParser only support the get_token/unget_token interface of HTML::TokeParser, but is more flexible because the information that make up an token is customisable. HTML::TokeParser is made into an HTML::PullParser subclass. 3.19_91 2001-03-19 * Array references can be passed to the filter methods. Makes it easier to use them as constructor options. * Example programs updated to use filters. * Reset ignored_element state on EOF. * Documentation updates. * The netscape_buggy_comment() method now generates mandatory warning about its deprecation. 3.19_90 2001-03-13 * This is an developer only release. It contains some new experimental features. The interface to these might still change. * Implemented filters to reduce the numbers of callbacks generated: - $p->ignore_tags() - $p->report_only_tags() - $p->ignore_elements() * New @attr argspec. Less overhead than 'attr' and allow compatibility with XML::Parser style start events. * The whole argspec can be wrapped up in @{...} to signal flattening. Only makes a difference when the target is an array. 3.19 2001-03-09 * Avoid the entity2char global. That should make the module more thread safe. Patch by Gurusamy Sarathy <gsar@ActiveState.com>. 3.18 2001-02-24 * There was a C++ style comment left in util.c. Strict C compilers do not like that kind of stuff. 3.17 2001-02-23 * The 3.16 release broke MULTIPLICITY builds. Fixed. 3.16 2001-02-22 * The unbroken_text option now works across ignored tags. * Fix casting of pointers on some 64 bit platforms. * Fix decoding of Unicode entities. Only optionally available for perl-5.7.0 or better. * Expose internal decode_entities() function at the Perl level. * Reindented some code. 3.15 2000-12-26 * HTML::TokeParser's get_tag() method now takes multiple tags to match. Hopefully the documentation is also a bit clearer. * #define PERL_NO_GET_CONTEXT: Should speed up things for thread enabled versions of perl. * Quote some more entities that also happens to be perl keywords. This avoids warnings on perl-5.004. * Unicode entities only triggered for perl-5.7.0 or higher. 3.14 2000-12-03 * If a handler triggered by flushing text at eof called the eof method then infinite recursion occurred. Fixed. Bug discovered by Jonathan Stowe <gellyfish@gellyfish.com>. * Allow <!doctype ...> to be parsed as declaration. 3.13 2000-09-17 * Experimental support for decoding of Unicode entities. 3.12 2000-09-14 * Some tweaks to get it to compile with "Optimierender Microsoft (R) 32-Bit C/C++-Compiler, Version 12.00.8168, fuer x86." Patch by Matthias Waldorf <matthias.waldorf@zoom.de>. * HTML::Entities documentation spelling patch by David Dyck <dcd@tc.fluke.com>. 3.11 2000-08-22 * HTML::LinkExtor and eg/hrefsub now obtain %linkElements from the HTML::Tagset module. 3.10 2000-06-29 * Avoid core dump when stack gets relocated as the result of text handler invocation while $p->unbroken_text is enabled. Needed to refresh the stack pointer. 3.09 2000-06-28 * Avoid core dump if somebody clobbers the aliased $self argument of a handler. * HTML::TokeParser documentation update suggested by Paul Makepeace <Paul.Makepeace@realprogrammers.com>. 3.08 2000-05-23 * Fix core dump for large start tags. Bug spotted by Alexander Fraser <green795@hotmail.com> * Added yet another example program: eg/hanchors * Typo fix by Jamie McCarthy <jamie@mccarthy.org> 3.07 2000-03-20 * Fix perl5.004 builds (was broken in 3.06) * Declaration parsing mode now only triggers for <!DOCTYPE ...> and <!ENTITY ...>. Based on patch by la mouton <kero@3sheep.com>. 3.06 2000-03-06 * Multi-threading/MULTIPLICITY compilation fix. Both Doug MacEachern <dougm@pobox.com> and Matthias Urlichs <smurf@noris.net> provided a patch. * Avoid some "statement not reached" warnings from picky compilers. * Remove final commas in enums as ANSI C does not allow them and some compilers actually care. Patch by James Walden <jamesw@ichips.intel.com> * Added eg/htextsub example program. 3.05 2000-01-22 * Implemented $p->unbroken_text option * Don't parse content of certain HTML elements as CDATA when xml_mode is enabled. * Offset was reported with wrong sign for text at end of chunk. 3.04 2000-01-15 * Backed out 3.03-patch that checked for legal handler and attribute names in the HTML::Parser constructor. * Documentation typo fixed by Michael. 3.03 2000-01-14 * We did not get out of comment mode for comments ending with an odd number of "-" before ">". Patch by la mouton <kero@3sheep.com> * Documentation patch by Michael. 3.02 1999-12-21 * Hide ~-magic IV-pointer to 'struct p_state' behind a reference. This allow copying of the internal _hparser_xs_state element, and will make HTML-Tree-0.61 work again. * Introduced $p->init() which might be useful for subclasses that only want the initialization part of the constructor. * Filled out DIAGNOSTICS section of the HTML::Parser POD. 3.01 1999-12-19 * Rely on ~-magic instead of a DESTROY method to deallocate the internal 'struct p_state'. This avoid memory leaks when people simply wipe of the content of the object hash. * One of the assertion in hparser.c had opposite logic. This made the parser fail when compiled with a -DDEBUGGING perl. * Don't assume any specific order of hash keys in the t/cases.t. This test failed with some newer development releases of perl. 3.00 1999-12-14 * Documentation update (most of it from Michael) * Minor patch to eg/hstrip so that it use a "" handler instead of &ignore. * Test suite patches from Michael 2.99_96 1999-12-13 * Patches from Michael: - A handler of "" means that the event will be ignored. More efficient than using 'sub {}' as handler. - Don't use a perl hash for looking up argspec keywords. - Documentation tweaks. 2.99_95 1999-12-09 * (this is a 3.00 candidate) * Fixed core dump when "<" was followed by an 8-bit character. Spotted and test case provided by Doug MacEachern. Doug had been running HTML-Parser-XS through more that 1 million urls that had been downloaded via LWP. * Handlers can now invoke $p->eof to request the parsing to terminate. HTML::HeadParser has been simplified by taking advantage of this. Also added a title-extraction example that uses this. * Michael once again fixed my bad English in the HTML::Parser documentation. * netscape_buggy_comment will carp instead of warn * updated TODO/README * Documented that HTML::Filter is depreciated. * Made backslash reserved in literal argspec strings. * Added several new test scripts. 2.99_94 1999-12-08 * (should almost be a 3.00 candidate) * Renamed 'cdata_flag' as 'is_cdata'. * Dropped support for wrapping callback handler and argspec in an array and passing a reference to $p->handler. It created ambiguities when you want to pass a array as handler destination and not update argspec. The wrapping for constructor arguments are unchanged. * Reworked the documentation after updates from Michael. * Simplified internal check_handler(). It should probably simply be inlined in handler() again. * Added argspec 'length' and 'undef' * Fix statement-less label. Fix suggested by Matthew Langford <langfml@Eng.Auburn.EDU>. * Added two more example programs: eg/hstrip and eg/htext. * Various minor patches from Michael. 2.99_93 1999-12-07 * Documentation update * $p->bool_attr_value renamed as $p->boolean_attribute_value * Internal renaming: attrspec --> argspec * Introduced internal 'enum argcode' in hparser.c * Added eg/hrefsub 2.99_92 1999-12-05 * More documentation patches from Michael * Renamed 'token1' as 'token0' as suggested by Michael * For artificial end tags we now report 'tokens', but not 'tokenpos'. * Boolean attribute values show up as (0, 0) in 'tokenpos' now. * If $p->bool_attr_value is set it will influence 'tokens' * Fix for core dump when parsing <a "> when $p->strict_names(0). Based on fix by Michael. * Will av_extend() the tokens/tokenspos arrays. * New test suite script by Michael: t/attrspec.t 2.99_91 1999-12-04 * Implemented attrspec 'offset' * Documentation patch from Michael * Some more cleanup/updated TODO 2.99_90 1999-12-03 * (first beta for 3.00) * Using "realloc" as a parameter name in grow_tokens created problems for some people. Fix by Paul Schinder <schinder@pobox.com> * Patch by Michael that makes array handler destinations really work. * Patch by Michael that make HTML::TokeParser use this. This gave a a speedup of about 80%. * Patch by Michael that makes t/cases into a real test. * Small HTML::Parser documentation patch by Michael. * Renamed attrspec 'origtext' to 'text' and 'decoded_text' to 'dtext' * Split up Parser.xs. Moved stuff into hparser.c and util.c * Dropped html_ prefix from internal parser functions. * Renamed internal function html_handle() as report_event(). 2.99_17 1999-12-02 * HTML::Parser documentation patch from Michael. * Fix memory leaks in html_handler() * Patch that makes an array legal as handler destination. Also from Michael. * The end of marked sections does not eat successive newline any more. * The artificial end event for empty tag in xml_mode did not report an empty origtext. * New constructor option: 'api_version' 2.99_16 1999-12-01 * Support "event" in argspec. It expands to the name of the handler (minus "default"). * Fix core dump for large start tags. The tokens_grow() routine needed an adjustment. Added test for this; t/largstags.t. 2.99_15 1999-11-30 * Major restructuring/simplification of callback interface based on initial work by Michael. The main news is that you now need to tell what arguments you want to be provided to your callbacks. * The following parser options has been eliminated: $p->decode_text_entities $p->keep_case $p->v2_compat $p->pass_self $p->attr_pos 2.99_14 1999-11-26 * Documentation update by Michael A. Chase. * Fix for declaration parsing by Michael A. Chase. * Workaround for perl5.004_05 bug. Can't return &PL_sv_undef. 2.99_13 1999-11-22 * New Parser.pm POD based on initial work by Michael A. Chase. All new features should now be described. * $p->callback(start => undef) will not reset the callback. * $p->xml_mode() did not parse attributes correct because HCTYPE_NOT_SPACE_EQ_SLASH_GT flag was never set. * A few more tests. 2.99_12 1999-11-18 * Implemented $p->attr_pos attribute. This causes attr positions within $origtext of the start tag to be reported instead of the attribute values. The positions are reported as 4 numbers; end of previous attr, start of this attr, start of attr value, and end of attr. This should make substr() manipulations of $origtext easy. * Implemented $p->unbroken_text attribute. This makes sure that text segments are never broken and given back as separate text callbacks. It delays text callbacks until some other markup has been recognized. * More English corrections by Michael A. Chase. * HTML::LinkExtor now recognizes even more URI attributes as suggested by Sean M. Burke <sburke@netadventure.net> * Completed marked sections support. It is also now a compile time decision if you want this supported or not. The only drawback of enabling it should be a possible parsing speed reduction. I have not measured this yet. * The keys for callbacks initialized in the constructor are now suffixed with "_cb". * Renamed $p->pass_cbdata to $p->pass_self. * Added magic number to the p_state struct. 2.99_11 1999-11-17 * Don't leak $@ modifications from HTML::Parser constructor. * Included HTML::Parser POD. * Marked sections almost work. CDATA and RCDATA should work. * For tags that take us into literal_mode; <script>, <style>, <xmp>, we did not recognize the end tag unless it was written in all lower case. 2.99_10 1999-11-16 * The mkhctype and mkpfunc scripts were using \z inside RE. This did not work for perl5.004. Replaced them with plain old dollar signs. 2.99_09 1999-11-15 * Grammar fixes by Michael A. Chase <mchase@ix.netcom.com> * Some more test suite patches for Win32 by Michael A. Chase <mchase@ix.netcom.com> * Implemented $p->strict_names attribute. By default we now allow almost anything in tag and attribute names. This is much closer to the behaviour of some popular browsers. This allows us to parse broken tags like this example from the LWP mailing list: <IMG ALIGN=MIDDLE SRC=newprevlstGr.gif ALT=[PREV LIST] BORDER=0> * Introduced some tables in "hctype.h" and "pfunc.h". These are built by the corresponding "mk..." script. 2.99_08 1999-11-10 * Make Parser.xs compile on perl5.004_05 too. * New callback called 'default'. This will be called for any document text no other callback shows an interest in. * Patch by Michael A. Chase <mchase@ix.netcom.com> that should help clean up files for the test suite on Win32. * Can now set up various attributes with key/value pairs passed to the constructor. * $p->parse_file() will open the file in binmode() * Pass complete processing instruction tag as second argument to process callback. * New boolean attribute v2_compat. This influences how attributes are reported for start tags. * HTML::Filter now filters process instructions too. * Faster HTML::LinkExtor by taking advantage of the new callback interface. The module now also uses URI.pm (instead of the old URI::URL) to absolutize URIs. * Faster HTML::TokeParser by taking advantage of new accum interface. 2.99_07 1999-11-09 * Entities in attribute values are now always expanded. * If you set the $p->decode_text_entities to a true value, then you don't have to decode the text yourself. * In xml_mode we don't report empty element tags as a start tag with an extra parameter any more. Instead we generate an artificial end tag. * 'xml_mode' now implies 'keep_case'. * The parser now keeps its own copy of the bool_attr_value value. * Avoid memory leak for text callbacks * Avoid using ERROR as a goto label. * Introduced common internal accessor function for all boolean parser attributes. * Tweaks to make Parser.xs compile under perl5.004. 2.99_06 1999-11-08 * Internal fast decode_entities(). By using it we are able to make the HTML::Entities::decode function 6 times faster than the old one implemented in pure Perl. * $p->bool_attr_value() can be set to influence the value that boolean attributes will be assigned. The default is to assign a value identical to the attribute name. * Process instructions are reported as "PI" in @accum * $p->xml_mode(1) modifies how processing instructions are terminated and allows "/>" at the end of start tags. * Turn off optimizations when compiling with gcc on Solaris. Avoids what we believe to be a compiler bug. Should probably figure out which versions of gcc have this bug. 2.99_05 1999-11-05 * The previous release did not even compile. I forgot to try 'make test' before uploading. 2.99_04 1999-11-05 * Generalized <XMP>-support to cover all literal parsing. Currently activated for <script>, <style>, <xmp> and <plaintext>. 2.99_03 1999-11-05 * <XMP>-support. * Allow ":" in tag and attribute names * Include rest of the HTML::* files from the old HTML::Parser package. This should make testing easier. 2.99_02 1999-11-04 * Implemented keep_case() option. If this attribute is true, then we don't lowercase tag and attribute names. * Implemented accum() that takes an array reference. Tokens are pushed onto this array instead of sent to callbacks. * Implemented strict_comment(). 2.99_01 1999-11-03 * Baseline of XS implementation 2.25 1999-11-05 * Allow ":" in attribute names as a workaround for Microsoft Excel 2000 which generates such files. * Make deprecate warning if netscape_buggy_comment() method is used. The method is used in strict_comment(). * Avoid duplication of parse_file() method in HTML::HeadParser. 2.24 1999-10-29 * $p->parse_file() will not close a handle passed to it any more. If passed a filename that can't be opened it will return undef instead of raising an exception, and strings like "*STDIN" are not treated as globs any more. * HTML::LinkExtor knows about background attribute of <tables>. Patch by Clinton Wong <clintdw@netcom.com> * HTML::TokeParser will parse large inline strings much faster now. The string holding the document must not be changed during parsing. 2.23 1999-06-09 * Documentation updates. 2.22 1998-12-18 * Protect HTML::HeadParser from evil $SIG{__DIE__} hooks. 2.21 1998-11-13 * HTML::TokeParser can now parse strings directly and does the right thing if you pass it a GLOB. Based on patch by Sami Itkonen <si@iki.fi>. * HTML::Parser now allows space before and after "--" in Netscape comments. Patch by Peter Orbaek <poe@daimi.au.dk>. 2.20 1998-07-08 * Added HTML::TokeParser. Check it out! 2.19 1998-07-07 * Don't end a text chunk with space when we try to avoid breaking up words. 2.18 1998-06-22 * HTML::HeadParser->parse_file will now stop parsing when the <body> starts as it should. * HTML::LinkExtor more easily subclassable by introducing the $self->_found_link method. 2.17 1998-04-28 * Never split words (a sequence of non-space) between two invocations of $self->text. This is just a simplification of the code that tried not to break entities. * HTML::Parser->parse_file now use smaller chunks as already suggested by the HTML::Parser documentation. 2.16 1998-04-02 * The HTML::Parser could some times break hex entities (like ) in the middle. * Removed remaining forced dependencies on libwww-perl modules. It means that all tests should now pass, even if libwww-perl was not installed previously. * More tests. 2.14 1998-04-01 * HTML::* modules unbundled from libwww-perl-5.22 PK w{!\�e�`� � eg/hformnu �[��� #!/usr/bin/perl # Print information about forms and their controls present in the HTML. # See also HTML::Form module use strict; use warnings; use HTML::PullParser (); use HTML::Entities qw(decode_entities); use Data::Dumper qw(Dumper); my @FORM_TAGS = qw(form input textarea button select option); my $p = HTML::PullParser->new( file => shift || "xxx.html", start => 'tag, attr', end => 'tag', text => '@{text}', report_tags => \@FORM_TAGS, ) || die "$!"; # a little helper function sub get_text { my ($p, $stop) = @_; my $text; while (defined(my $t = $p->get_token)) { if (ref $t) { $p->unget_token($t) unless $t->[0] eq $stop; last; } else { $text .= $t; } } return $text; } my @forms; while (defined(my $t = $p->get_token)) { next unless ref $t; # skip text if ($t->[0] eq "form") { shift @$t; push(@forms, $t); while (defined(my $t = $p->get_token)) { next unless ref $t; # skip text last if $t->[0] eq "/form"; if ($t->[0] eq "select") { my $sel = $t; push(@{$forms[-1]}, $t); while (defined(my $t = $p->get_token)) { next unless ref $t; # skip text last if $t->[0] eq "/select"; #print "select ", Dumper($t), "\n"; if ($t->[0] eq "option") { my $value = $t->[1]->{value}; my $text = get_text($p, "/option"); unless (defined $value) { $value = decode_entities($text); } push(@$sel, $value); } else { warn "$t->[0] inside select"; } } } elsif ($t->[0] =~ /^\/?option$/) { warn "option tag outside select"; } elsif ($t->[0] eq "textarea") { push(@{$forms[-1]}, $t); $t->[1]{value} = get_text($p, "/textarea"); } elsif ($t->[0] =~ m,^/,) { warn "stray $t->[0] tag"; } else { push(@{$forms[-1]}, $t); } } } else { warn "form tag $t->[0] outside form"; } } print Dumper(\@forms), "\n"; PK w{!\ƙj�� � eg/htitlenu �[��� #!/usr/bin/perl # This program will print out the title of an HTML document. use strict; use warnings; use HTML::Parser (); sub title_handler { my $self = shift; $self->handler(text => sub { print @_ }, "dtext"); $self->handler(end => "eof", "self"); } my $p = HTML::Parser->new( api_version => 3, start_h => [\&title_handler, "self"], report_tags => ['title'], ); $p->parse_file(shift || die) || die $!; print "\n"; PK w{!\�� � eg/hbodynu �[��� #!/usr/bin/perl use strict; use warnings; use HTML::Parser (); my $doc = <<'EOT'; <!-- This is not where <BODY> starts --> <title>foo</title> <script language="Perl" description="Print out <BODY>"> open(BODY, "body.txt"); while (<BODY>) { print; } </script> <!-- The next thing will be <BODY> the body --> <Body> Howdy! </body> EOT my $body_offset; HTML::Parser->new( start_h => [ sub { return unless shift eq "body"; $body_offset = shift; shift->eof; # tell the parser to stop }, "tagname,offset,self" ] )->parse($doc); die "No <body> found" unless defined $body_offset; my $head = substr($doc, 0, $body_offset, ""); print $doc; PK w{!\��� � eg/hlcnu �[��� #!/usr/bin/perl # This script will assume that the first command line argument # is a file containing HTML, and return a version # where all the tags are converted to lowercase. use strict; use warnings; use HTML::Parser (); HTML::Parser->new( start_h => [\&start_lc, "tokenpos, text"], end_h => [sub { print lc shift }, "text"], default_h => [sub { print shift }, "text"], )->parse_file(shift) || die "Can't open file: $!\n"; sub start_lc { my ($tpos, $text) = @_; for (my $i = 0; $i < @$tpos; $i += 2) { next if $i && ($i / 2) % 2 == 0; # skip attribute values $_ = lc $_ for substr($text, $tpos->[$i], $tpos->[$i + 1]); } print $text; } PK w{!\Vb��� � eg/hrefsubnu �[��� #!/usr/bin/perl # Perform transformations on link attributes in an HTML document. # Examples: # # $ hrefsub 's/foo/bar/g' index.html # $ hrefsub '$_=URI->new_abs($_, "http://foo")' index.html # # The first argument is a perl expression that might modify $_. # It is called for each link in the document with $_ set to # the original value of the link URI. The variables $tag and # $attr can be used to access the tagname and attributename # within the tag where the current link is found. # # The second argument is the name of a file to process. use strict; use warnings; use HTML::Parser (); use HTML::Tagset (); use URI; # Construct a hash of tag names that may have links. my %link_attr; { # To simplify things, reformat the %HTML::Tagset::linkElements # hash so that it is always a hash of hashes. while (my ($k, $v) = each %HTML::Tagset::linkElements) { if (ref($v)) { $v = {map { $_ => 1 } @$v}; } else { $v = {$v => 1}; } $link_attr{$k} = $v; } # Uncomment this to see what HTML::Tagset::linkElements thinks are # the tags with link attributes #use Data::Dump; Data::Dump::dump(\%link_attr); exit; } # Create a subroutine named 'edit' to perform the operation # passed in from the command line. The code should modify $_ # to change things. my $code = shift; $code = 'sub edit { local $_ = shift; my($attr, $tag) = @_; no strict; ' . ($code // '') . '; $_; }'; #print $code; eval $code; die $@ if $@; # Set up the parser. my $p = HTML::Parser->new(api_version => 3); # The default is to print everything as is. $p->handler(default => sub { print @_ }, "text"); # All links are found in start tags. This handler will evaluate # &edit for each link attribute found. $p->handler( start => sub { my ($tagname, $pos, $text) = @_; if (my $link_attr = $link_attr{$tagname}) { while (4 <= @$pos) { # use attribute sets from right to left # to avoid invalidating the offsets # when replacing the values my ($k_offset, $k_len, $v_offset, $v_len) = splice(@$pos, -4); my $attrname = lc(substr($text, $k_offset, $k_len)); next unless $link_attr->{$attrname}; next unless $v_offset; # 0 v_offset means no value my $v = substr($text, $v_offset, $v_len); $v =~ s/^([\'\"])(.*)\1$/$2/; my $new_v = edit($v, $attrname, $tagname); next if $new_v eq $v; $new_v =~ s/\"/"/g; # since we quote with "" substr($text, $v_offset, $v_len) = qq("$new_v"); } } print $text; }, "tagname, tokenpos, text" ); # Parse the file passed in from the command line my $file = shift || usage(); $p->parse_file($file) || die "Can't open file $file: $!\n"; sub usage { my $progname = $0; $progname =~ s,^.*/,,; die "Usage: $progname <perlexpr> <filename>\n"; } PK w{!\X3=� � eg/hdisablenu �[��� #!/usr/bin/perl use strict; use warnings; use HTML::Parser (); use HTML::Entities qw(encode_entities); sub disable_tags_but { my ($text, $allowed_tags) = @_; my @text; my %allowed_tag = map { $_ => 1 } @{$allowed_tags || []}; my $tag_h = sub { my ($tag, $text) = @_; $text = encode_entities($text, "<") unless $allowed_tag{$tag}; push(@text, $text); }; HTML::Parser->new( start_h => [$tag_h, 'tagname, text'], end_h => [$tag_h, 'tagname, text'], default_h => [\@text, '@{text}'], )->parse($text)->eof; return join("", @text); } # # Test it # print disable_tags_but(<<EOT, [qw(a br)]) unless caller; Test <foo> <a href="...">...</a> </bar> EOT PK w{!\���� � eg/hanchorsnu �[��� #!/usr/bin/perl use strict; use warnings; # This program will print out all <a href=".."> links in a # document together with the text that goes with it. # # See also HTML::LinkExtor use Encode; use HTML::Parser; my $p = HTML::Parser->new( api_version => 3, start_h => [\&a_start_handler, "self,tagname,attr"], report_tags => [qw(a img)], ); $p->parse_file(shift || die) || die $!; sub a_start_handler { my ($self, $tag, $attr) = @_; return unless $tag eq "a"; return unless exists $attr->{href}; print "A $attr->{href}\n"; $self->handler(text => [], '@{dtext}'); $self->handler(start => \&img_handler); $self->handler(end => \&a_end_handler, "self,tagname"); } sub img_handler { my ($self, $tag, $attr) = @_; return unless $tag eq "img"; push(@{$self->handler("text")}, $attr->{alt} || "[IMG]"); } sub a_end_handler { my ($self, $tag) = @_; my $text = encode('utf8', join("", @{$self->handler("text")})); $text =~ s/^\s+//; $text =~ s/\s+$//; $text =~ s/\s+/ /g; print "T $text\n"; $self->handler("text", undef); $self->handler("start", \&a_start_handler); $self->handler("end", undef); } PK w{!\��EӨ � eg/htextsubnu �[��� #!/usr/bin/perl # Shows how to mangle all plain text in an HTML document, using an arbitrary # Perl expression. Plain text is all text not within a tag declaration, i.e. # not in <p ...>, but possibly between <p> and </p> # Example (Reverse 'Debian' in all text) : # lynx -dump -source -raw http://www/debian.org > /tmp/a.txt # ./htextsub '$_ =~ s/Debian/Naibed/gi' /tmp/a.txt use strict; use warnings; use HTML::Parser (); my $code = shift || usage(); $code = 'sub edit_print { local $_ = shift; ' . $code . '; print }'; #print $code; eval $code; die $@ if $@; my $p = HTML::Parser->new( unbroken_text => 1, default_h => [sub { print @_; }, "text"], text_h => [\&edit_print, "text"], ); my $file = shift || usage(); $p->parse_file($file) || die "Can't open file $file: $!\n"; sub usage { my $progname = $0; $progname =~ s,^.*/,,; die "Usage: $progname <perlexpr> <filename>\n"; } PK w{!\quK�� � eg/hstripnu �[��� #!/usr/bin/perl # This script cleans up an HTML document use strict; use warnings; use HTML::Parser (); # configure these values my @ignore_attr = qw(bgcolor background color face style link alink vlink text onblur onchange onclick ondblclick onfocus onkeydown onkeyup onload onmousedown onmousemove onmouseout onmouseover onmouseup onreset onselect onunload ); my @ignore_tags = qw(font big small b i); my @ignore_elements = qw(script style); # make it easier to look up attributes my %ignore_attr = map { $_ => 1 } @ignore_attr; sub tag { my ($pos, $text) = @_; if (@$pos >= 4) { # kill some attributes my ($k_offset, $k_len, $v_offset, $v_len) = @{$pos}[-4 .. -1]; my $next_attr = $v_offset ? $v_offset + $v_len : $k_offset + $k_len; my $edited; while (@$pos >= 4) { ($k_offset, $k_len, $v_offset, $v_len) = splice @$pos, -4; if ($ignore_attr{lc substr($text, $k_offset, $k_len)}) { substr($text, $k_offset, $next_attr - $k_offset) = ""; $edited++; } $next_attr = $k_offset; } # if we killed all attributed, kill any extra whitespace too $text =~ s/^(<\w+)\s+>$/$1>/ if $edited; } print $text; } sub decl { my $type = shift; print shift if $type eq "doctype"; } sub text { print shift; } HTML::Parser->new( api_version => 3, start_h => [\&tag, "tokenpos, text"], process_h => ["", ""], comment_h => ["", ""], declaration_h => [\&decl, "tagname, text"], default_h => [\&text, "text"], ignore_tags => \@ignore_tags, ignore_elements => \@ignore_elements, )->parse_file(shift) || die "Can't open file: $!\n"; PK w{!\}*�� � eg/hdumpnu �[��� #!/usr/bin/perl # This script will output event information as it parses the HTML document. # This gives the user a "Parser's eye view" of an HTML document. use strict; use warnings; use HTML::Parser (); use Data::Dumper qw(Dumper); sub h { my ($event, $line, $column, $text, $tagname, $attr) = @_; my @d = (uc(substr($event, 0, 1)) . " L$line C$column"); substr($text, 40) = "..." if length($text) > 40; push(@d, $text); push(@d, $tagname) if defined $tagname; push(@d, $attr) if $attr; print Dumper(@d), "\n"; } my $p = HTML::Parser->new(api_version => 3); $p->handler(default => \&h, "event, line, column, text, tagname, attr"); $p->parse_file(@ARGV ? shift : *STDIN); PK w{!\���od d eg/htextnu �[��� #!/usr/bin/perl # Extract all plain text from an HTML file use strict; use warnings; use Encode (); use HTML::Parser (); my %inside; sub tag { my ($tag, $num) = @_; $inside{$tag} += $num; print " "; # not for all tags } sub text { return if $inside{script} || $inside{style}; print encode('utf8', $_[0]); } HTML::Parser->new( api_version => 3, handlers => [ start => [\&tag, "tagname, '+1'"], end => [\&tag, "tagname, '-1'"], text => [\&text, "dtext"], ], marked_sections => 1, )->parse_file(shift) || die "Can't open file: $!\n"; PK w{!\j���� � TODOnu �[��� TODO - Check how we compare to the HTML5 parsing rules - limit the length of markup elements that never end. Perhaps by configurable limits on the length that markup can have and still be recognized. Report stuff as 'text' when this happens? - remove 255 char limit on literal argspec strings - implement backslash escapes in literal argspec string - <![%app1;[...]]> (parameter entities) - make literal tags configurable. The current list is hardcoded to be "script", "style", "title", "iframe", "textarea", "xmp", and "plaintext". SGML FEATURES WE WILL PROBABLY IGNORE FOREVER - Empty tags: <> </> (repeat previous start tag) - <foo<bar> (same as <foo><bar>) - NET tags <name/.../ MINOR "BUGS" (alias FEATURES) - no way to clear "boolean_attribute_value". - <style> and <script> do not end with the first "</". MSIE bug compatibility - recognize server side includes as comments; <% ... %> if no matching %> found tread "<% ..." as text - skip quoted strings when looking for PIC PK w{!\ܾMqT� T� READMEnu �[��� NAME HTML::Parser - HTML parser class SYNOPSIS use strict; use warnings; use HTML::Parser (); # Create parser object my $p = HTML::Parser->new( api_version => 3, start_h => [\&start, "tagname, attr"], end_h => [\&end, "tagname"], marked_sections => 1, ); # Parse document text chunk by chunk $p->parse($chunk1); $p->parse($chunk2); # ... # signal end of document $p->eof; # Parse directly from file $p->parse_file("foo.html"); # or open(my $fh, "<:utf8", "foo.html") || die; $p->parse_file($fh); DESCRIPTION Objects of the "HTML::Parser" class will recognize markup and separate it from plain text (alias data content) in HTML documents. As different kinds of markup and text are recognized, the corresponding event handlers are invoked. "HTML::Parser" is not a generic SGML parser. We have tried to make it able to deal with the HTML that is actually "out there", and it normally parses as closely as possible to the way the popular web browsers do it instead of strictly following one of the many HTML specifications from W3C. Where there is disagreement, there is often an option that you can enable to get the official behaviour. The document to be parsed may be supplied in arbitrary chunks. This makes on-the-fly parsing as documents are received from the network possible. If event driven parsing does not feel right for your application, you might want to use "HTML::PullParser". This is an "HTML::Parser" subclass that allows a more conventional program structure. METHODS The following method is used to construct a new "HTML::Parser" object: $p = HTML::Parser->new( %options_and_handlers ) This class method creates a new "HTML::Parser" object and returns it. Key/value argument pairs may be provided to assign event handlers or initialize parser options. The handlers and parser options can also be set or modified later by the method calls described below. If a top level key is in the form "<event>_h" (e.g., "text_h") then it assigns a handler to that event, otherwise it initializes a parser option. The event handler specification value must be an array reference. Multiple handlers may also be assigned with the 'handlers => [%handlers]' option. See examples below. If new() is called without any arguments, it will create a parser that uses callback methods compatible with version 2 of "HTML::Parser". See the section on "version 2 compatibility" below for details. The special constructor option 'api_version => 2' can be used to initialize version 2 callbacks while still setting other options and handlers. The 'api_version => 3' option can be used if you don't want to set any options and don't want to fall back to v2 compatible mode. Examples: $p = HTML::Parser->new( api_version => 3, text_h => [ sub {...}, "dtext" ] ); This creates a new parser object with a text event handler subroutine that receives the original text with general entities decoded. $p = HTML::Parser->new( api_version => 3, start_h => [ 'my_start', "self,tokens" ] ); This creates a new parser object with a start event handler method that receives the $p and the tokens array. $p = HTML::Parser->new( api_version => 3, handlers => { text => [\@array, "event,text"], comment => [\@array, "event,text"], } ); This creates a new parser object that stores the event type and the original text in @array for text and comment events. The following methods feed the HTML document to the "HTML::Parser" object: $p->parse( $string ) Parse $string as the next chunk of the HTML document. Handlers invoked should not attempt to modify the $string in-place until $p->parse returns. If an invoked event handler aborts parsing by calling $p->eof, then $p->parse() will return a FALSE value. Otherwise the return value is a reference to the parser object ($p). $p->parse( $code_ref ) If a code reference is passed as the argument to be parsed, then the chunks to be parsed are obtained by invoking this function repeatedly. Parsing continues until the function returns an empty (or undefined) result. When this happens $p->eof is automatically signaled. Parsing will also abort if one of the event handlers calls $p->eof. The effect of this is the same as: while (1) { my $chunk = &$code_ref(); if (!defined($chunk) || !length($chunk)) { $p->eof; return $p; } $p->parse($chunk) || return undef; } But it is more efficient as this loop runs internally in XS code. $p->parse_file( $file ) Parse text directly from a file. The $file argument can be a filename, an open file handle, or a reference to an open file handle. If $file contains a filename and the file can't be opened, then the method returns an undefined value and $! tells why it failed. Otherwise the return value is a reference to the parser object. If a file handle is passed as the $file argument, then the file will normally be read until EOF, but not closed. If an invoked event handler aborts parsing by calling $p->eof, then $p->parse_file() may not have read the entire file. On systems with multi-byte line terminators, the values passed for the offset and length argspecs may be too low if parse_file() is called on a file handle that is not in binary mode. If a filename is passed in, then parse_file() will open the file in binary mode. $p->eof Signals the end of the HTML document. Calling the $p->eof method outside a handler callback will flush any remaining buffered text (which triggers the "text" event if there is any remaining text). Calling $p->eof inside a handler will terminate parsing at that point and cause $p->parse to return a FALSE value. This also terminates parsing by $p->parse_file(). After $p->eof has been called, the parse() and parse_file() methods can be invoked to feed new documents with the parser object. The return value from eof() is a reference to the parser object. Most parser options are controlled by boolean attributes. Each boolean attribute is enabled by calling the corresponding method with a TRUE argument and disabled with a FALSE argument. The attribute value is left unchanged if no argument is given. The return value from each method is the old attribute value. Methods that can be used to get and/or set parser options are: $p->attr_encoded $p->attr_encoded( $bool ) By default, the "attr" and @attr argspecs will have general entities for attribute values decoded. Enabling this attribute leaves entities alone. $p->backquote $p->backquote( $bool ) By default, only ' and " are recognized as quote characters around attribute values. MSIE also recognizes backquotes for some reason. Enabling this attribute provides compatibility with this behaviour. $p->boolean_attribute_value( $val ) This method sets the value reported for boolean attributes inside HTML start tags. By default, the name of the attribute is also used as its value. This affects the values reported for "tokens" and "attr" argspecs. $p->case_sensitive $p->case_sensitive( $bool ) By default, tag names and attribute names are down-cased. Enabling this attribute leaves them as found in the HTML source document. $p->closing_plaintext $p->closing_plaintext( $bool ) By default, "plaintext" element can never be closed. Everything up to the end of the document is parsed in CDATA mode. This historical behaviour is what at least MSIE does. Enabling this attribute makes closing " </plaintext" > tag effective and the parsing process will resume after seeing this tag. This emulates early gecko-based browsers. $p->empty_element_tags $p->empty_element_tags( $bool ) By default, empty element tags are not recognized as such and the "/" before ">" is just treated like a normal name character (unless "strict_names" is enabled). Enabling this attribute make "HTML::Parser" recognize these tags. Empty element tags look like start tags, but end with the character sequence "/>" instead of ">". When recognized by "HTML::Parser" they cause an artificial end event in addition to the start event. The "text" for the artificial end event will be empty and the "tokenpos" array will be undefined even though the token array will have one element containing the tag name. $p->marked_sections $p->marked_sections( $bool ) By default, section markings like <![CDATA[...]]> are treated like ordinary text. When this attribute is enabled section markings are honoured. There are currently no events associated with the marked section markup, but the text can be returned as "skipped_text". $p->strict_comment $p->strict_comment( $bool ) By default, comments are terminated by the first occurrence of "-->". This is the behaviour of most popular browsers (like Mozilla, Opera and MSIE), but it is not correct according to the official HTML standard. Officially, you need an even number of "--" tokens before the closing ">" is recognized and there may not be anything but whitespace between an even and an odd "--". The official behaviour is enabled by enabling this attribute. Enabling of 'strict_comment' also disables recognizing these forms as comments: </ comment> <! comment> $p->strict_end $p->strict_end( $bool ) By default, attributes and other junk are allowed to be present on end tags in a manner that emulates MSIE's behaviour. The official behaviour is enabled with this attribute. If enabled, only whitespace is allowed between the tagname and the final ">". $p->strict_names $p->strict_names( $bool ) By default, almost anything is allowed in tag and attribute names. This is the behaviour of most popular browsers and allows us to parse some broken tags with invalid attribute values like: <IMG SRC=newprevlstGr.gif ALT=[PREV LIST] BORDER=0> By default, "LIST]" is parsed as a boolean attribute, not as part of the ALT value as was clearly intended. This is also what Mozilla sees. The official behaviour is enabled by enabling this attribute. If enabled, it will cause the tag above to be reported as text since "LIST]" is not a legal attribute name. $p->unbroken_text $p->unbroken_text( $bool ) By default, blocks of text are given to the text handler as soon as possible (but the parser takes care always to break text at a boundary between whitespace and non-whitespace so single words and entities can always be decoded safely). This might create breaks that make it hard to do transformations on the text. When this attribute is enabled, blocks of text are always reported in one piece. This will delay the text event until the following (non-text) event has been recognized by the parser. Note that the "offset" argspec will give you the offset of the first segment of text and "length" is the combined length of the segments. Since there might be ignored tags in between, these numbers can't be used to directly index in the original document file. $p->utf8_mode $p->utf8_mode( $bool ) Enable this option when parsing raw undecoded UTF-8. This tells the parser that the entities expanded for strings reported by "attr", @attr and "dtext" should be expanded as decoded UTF-8 so they end up compatible with the surrounding text. If "utf8_mode" is enabled then it is an error to pass strings containing characters with code above 255 to the parse() method, and the parse() method will croak if you try. Example: The Unicode character "\x{2665}" is "\xE2\x99\xA5" when UTF-8 encoded. The character can also be represented by the entity "♥" or "♥". If we feed the parser: $p->parse("\xE2\x99\xA5♥"); then "dtext" will be reported as "\xE2\x99\xA5\x{2665}" without "utf8_mode" enabled, but as "\xE2\x99\xA5\xE2\x99\xA5" when enabled. The later string is what you want. This option is only available with perl-5.8 or better. $p->xml_mode $p->xml_mode( $bool ) Enabling this attribute changes the parser to allow some XML constructs. This enables the behaviour controlled by individually by the "case_sensitive", "empty_element_tags", "strict_names" and "xml_pic" attributes and also suppresses special treatment of elements that are parsed as CDATA for HTML. $p->xml_pic $p->xml_pic( $bool ) By default, *processing instructions* are terminated by ">". When this attribute is enabled, processing instructions are terminated by "?>" instead. As markup and text is recognized, handlers are invoked. The following method is used to set up handlers for different events: $p->handler( event => \&subroutine, $argspec ) $p->handler( event => $method_name, $argspec ) $p->handler( event => \@accum, $argspec ) $p->handler( event => "" ); $p->handler( event => undef ); $p->handler( event ); This method assigns a subroutine, method, or array to handle an event. Event is one of "text", "start", "end", "declaration", "comment", "process", "start_document", "end_document" or "default". The "\&subroutine" is a reference to a subroutine which is called to handle the event. The $method_name is the name of a method of $p which is called to handle the event. The @accum is an array that will hold the event information as sub-arrays. If the second argument is "", the event is ignored. If it is undef, the default handler is invoked for the event. The $argspec is a string that describes the information to be reported for the event. Any requested information that does not apply to a specific event is passed as "undef". If argspec is omitted, then it is left unchanged. The return value from $p->handler is the old callback routine or a reference to the accumulator array. Any return values from handler callback routines/methods are always ignored. A handler callback can request parsing to be aborted by invoking the $p->eof method. A handler callback is not allowed to invoke the $p->parse() or $p->parse_file() method. An exception will be raised if it tries. Examples: $p->handler(start => "start", 'self, attr, attrseq, text' ); This causes the "start" method of object $p to be called for 'start' events. The callback signature is "$p->start(\%attr, \@attr_seq, $text)". $p->handler(start => \&start, 'attr, attrseq, text' ); This causes subroutine start() to be called for 'start' events. The callback signature is start(\%attr, \@attr_seq, $text). $p->handler(start => \@accum, '"S", attr, attrseq, text' ); This causes 'start' event information to be saved in @accum. The array elements will be ['S', \%attr, \@attr_seq, $text]. $p->handler(start => ""); This causes 'start' events to be ignored. It also suppresses invocations of any default handler for start events. It is in most cases equivalent to $p->handler(start => sub {}), but is more efficient. It is different from the empty-sub-handler in that "skipped_text" is not reset by it. $p->handler(start => undef); This causes no handler to be associated with start events. If there is a default handler it will be invoked. Filters based on tags can be set up to limit the number of events reported. The main bottleneck during parsing is often the huge number of callbacks made from the parser. Applying filters can improve performance significantly. The following methods control filters: $p->ignore_elements( @tags ) Both the "start" event and the "end" event as well as any events that would be reported in between are suppressed. The ignored elements can contain nested occurrences of itself. Example: $p->ignore_elements(qw(script style)); The "script" and "style" tags will always nest properly since their content is parsed in CDATA mode. For most other tags "ignore_elements" must be used with caution since HTML is often not *well formed*. $p->ignore_tags( @tags ) Any "start" and "end" events involving any of the tags given are suppressed. To reset the filter (i.e. don't suppress any "start" and "end" events), call "ignore_tags" without an argument. $p->report_tags( @tags ) Any "start" and "end" events involving any of the tags *not* given are suppressed. To reset the filter (i.e. report all "start" and "end" events), call "report_tags" without an argument. Internally, the system has two filter lists, one for "report_tags" and one for "ignore_tags", and both filters are applied. This effectively gives "ignore_tags" precedence over "report_tags". Examples: $p->ignore_tags(qw(style)); $p->report_tags(qw(script style)); results in only "script" events being reported. Argspec Argspec is a string containing a comma-separated list that describes the information reported by the event. The following argspec identifier names can be used: "attr" Attr causes a reference to a hash of attribute name/value pairs to be passed. Boolean attributes' values are either the value set by $p->boolean_attribute_value, or the attribute name if no value has been set by $p->boolean_attribute_value. This passes undef except for "start" events. Unless "xml_mode" or "case_sensitive" is enabled, the attribute names are forced to lower case. General entities are decoded in the attribute values and one layer of matching quotes enclosing the attribute values is removed. The Unicode character set is assumed for entity decoding. @attr Basically the same as "attr", but keys and values are passed as individual arguments and the original sequence of the attributes is kept. The parameters passed will be the same as the @attr calculated here: @attr = map { $_ => $attr->{$_} } @$attrseq; assuming $attr and $attrseq here are the hash and array passed as the result of "attr" and "attrseq" argspecs. This passes no values for events besides "start". "attrseq" Attrseq causes a reference to an array of attribute names to be passed. This can be useful if you want to walk the "attr" hash in the original sequence. This passes undef except for "start" events. Unless "xml_mode" or "case_sensitive" is enabled, the attribute names are forced to lower case. "column" Column causes the column number of the start of the event to be passed. The first column on a line is 0. "dtext" Dtext causes the decoded text to be passed. General entities are automatically decoded unless the event was inside a CDATA section or was between literal start and end tags ("script", "style", "xmp", "iframe", "title", "textarea" and "plaintext"). The Unicode character set is assumed for entity decoding. With Perl version 5.6 or earlier only the Latin-1 range is supported, and entities for characters outside the range 0..255 are left unchanged. This passes undef except for "text" events. "event" Event causes the event name to be passed. The event name is one of "text", "start", "end", "declaration", "comment", "process", "start_document" or "end_document". "is_cdata" Is_cdata causes a TRUE value to be passed if the event is inside a CDATA section or between literal start and end tags ("script", "style", "xmp", "iframe", "title", "textarea" and "plaintext"). if the flag is FALSE for a text event, then you should normally either use "dtext" or decode the entities yourself before the text is processed further. "length" Length causes the number of bytes of the source text of the event to be passed. "line" Line causes the line number of the start of the event to be passed. The first line in the document is 1. Line counting doesn't start until at least one handler requests this value to be reported. "offset" Offset causes the byte position in the HTML document of the start of the event to be passed. The first byte in the document has offset 0. "offset_end" Offset_end causes the byte position in the HTML document of the end of the event to be passed. This is the same as "offset" + "length". "self" Self causes the current object to be passed to the handler. If the handler is a method, this must be the first element in the argspec. An alternative to passing self as an argspec is to register closures that capture $self by themselves as handlers. Unfortunately this creates circular references which prevent the HTML::Parser object from being garbage collected. Using the "self" argspec avoids this problem. "skipped_text" Skipped_text returns the concatenated text of all the events that have been skipped since the last time an event was reported. Events might be skipped because no handler is registered for them or because some filter applies. Skipped text also includes marked section markup, since there are no events that can catch it. If an ""-handler is registered for an event, then the text for this event is not included in "skipped_text". Skipped text both before and after the ""-event is included in the next reported "skipped_text". "tag" Same as "tagname", but prefixed with "/" if it belongs to an "end" event and "!" for a declaration. The "tag" does not have any prefix for "start" events, and is in this case identical to "tagname". "tagname" This is the element name (or *generic identifier* in SGML jargon) for start and end tags. Since HTML is case insensitive, this name is forced to lower case to ease string matching. Since XML is case sensitive, the tagname case is not changed when "xml_mode" is enabled. The same happens if the "case_sensitive" attribute is set. The declaration type of declaration elements is also passed as a tagname, even if that is a bit strange. In fact, in the current implementation tagname is identical to "token0" except that the name may be forced to lower case. "token0" Token0 causes the original text of the first token string to be passed. This should always be the same as $tokens->[0]. For "declaration" events, this is the declaration type. For "start" and "end" events, this is the tag name. For "process" and non-strict "comment" events, this is everything inside the tag. This passes undef if there are no tokens in the event. "tokenpos" Tokenpos causes a reference to an array of token positions to be passed. For each string that appears in "tokens", this array contains two numbers. The first number is the offset of the start of the token in the original "text" and the second number is the length of the token. Boolean attributes in a "start" event will have (0,0) for the attribute value offset and length. This passes undef if there are no tokens in the event (e.g., "text") and for artificial "end" events triggered by empty element tags. If you are using these offsets and lengths to modify "text", you should either work from right to left, or be very careful to calculate the changes to the offsets. "tokens" Tokens causes a reference to an array of token strings to be passed. The strings are exactly as they were found in the original text, no decoding or case changes are applied. For "declaration" events, the array contains each word, comment, and delimited string starting with the declaration type. For "comment" events, this contains each sub-comment. If $p->strict_comments is disabled, there will be only one sub-comment. For "start" events, this contains the original tag name followed by the attribute name/value pairs. The values of boolean attributes will be either the value set by $p->boolean_attribute_value, or the attribute name if no value has been set by $p->boolean_attribute_value. For "end" events, this contains the original tag name (always one token). For "process" events, this contains the process instructions (always one token). This passes "undef" for "text" events. "text" Text causes the source text (including markup element delimiters) to be passed. "undef" Pass an undefined value. Useful as padding where the same handler routine is registered for multiple events. '...' A literal string of 0 to 255 characters enclosed in single (') or double (") quotes is passed as entered. The whole argspec string can be wrapped up in '@{...}' to signal that the resulting event array should be flattened. This only makes a difference if an array reference is used as the handler target. Consider this example: $p->handler(text => [], 'text'); $p->handler(text => [], '@{text}']); With two text events; "foo", "bar"; then the first example will end up with [["foo"], ["bar"]] and the second with ["foo", "bar"] in the handler target array. Events Handlers for the following events can be registered: "comment" This event is triggered when a markup comment is recognized. Example: <!-- This is a comment -- -- So is this --> "declaration" This event is triggered when a *markup declaration* is recognized. For typical HTML documents, the only declaration you are likely to find is <!DOCTYPE ...>. Example: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> DTDs inside <!DOCTYPE ...> will confuse HTML::Parser. "default" This event is triggered for events that do not have a specific handler. You can set up a handler for this event to catch stuff you did not want to catch explicitly. "end" This event is triggered when an end tag is recognized. Example: </A> "end_document" This event is triggered when $p->eof is called and after any remaining text is flushed. There is no document text associated with this event. "process" This event is triggered when a processing instructions markup is recognized. The format and content of processing instructions are system and application dependent. Examples: <? HTML processing instructions > <? XML processing instructions ?> "start" This event is triggered when a start tag is recognized. Example: <A HREF="http://www.perl.com/"> "start_document" This event is triggered before any other events for a new document. A handler for it can be used to initialize stuff. There is no document text associated with this event. "text" This event is triggered when plain text (characters) is recognized. The text may contain multiple lines. A sequence of text may be broken between several text events unless $p->unbroken_text is enabled. The parser will make sure that it does not break a word or a sequence of whitespace between two text events. Unicode "HTML::Parser" can parse Unicode strings when running under perl-5.8 or better. If Unicode is passed to $p->parse() then chunks of Unicode will be reported to the handlers. The offset and length argspecs will also report their position in terms of characters. It is safe to parse raw undecoded UTF-8 if you either avoid decoding entities and make sure to not use *argspecs* that do, or enable the "utf8_mode" for the parser. Parsing of undecoded UTF-8 might be useful when parsing from a file where you need the reported offsets and lengths to match the byte offsets in the file. If a filename is passed to $p->parse_file() then the file will be read in binary mode. This will be fine if the file contains only ASCII or Latin-1 characters. If the file contains UTF-8 encoded text then care must be taken when decoding entities as described in the previous paragraph, but better is to open the file with the UTF-8 layer so that it is decoded properly: open(my $fh, "<:utf8", "index.html") || die "...: $!"; $p->parse_file($fh); If the file contains text encoded in a charset besides ASCII, Latin-1 or UTF-8 then decoding will always be needed. VERSION 2 COMPATIBILITY When an "HTML::Parser" object is constructed with no arguments, a set of handlers is automatically provided that is compatible with the old HTML::Parser version 2 callback methods. This is equivalent to the following method calls: $p->handler(start => "start", "self, tagname, attr, attrseq, text"); $p->handler(end => "end", "self, tagname, text"); $p->handler(text => "text", "self, text, is_cdata"); $p->handler(process => "process", "self, token0, text"); $p->handler( comment => sub { my($self, $tokens) = @_; for (@$tokens) {$self->comment($_);} }, "self, tokens" ); $p->handler( declaration => sub { my $self = shift; $self->declaration(substr($_[0], 2, -1)); }, "self, text" ); Setting up these handlers can also be requested with the "api_version => 2" constructor option. SUBCLASSING The "HTML::Parser" class is able to be subclassed. Parser objects are plain hashes and "HTML::Parser" reserves only hash keys that start with "_hparser". The parser state can be set up by invoking the init() method, which takes the same arguments as new(). EXAMPLES The first simple example shows how you might strip out comments from an HTML document. We achieve this by setting up a comment handler that does nothing and a default handler that will print out anything else: use HTML::Parser; HTML::Parser->new( default_h => [sub { print shift }, 'text'], comment_h => [""], )->parse_file(shift || die) || die $!; An alternative implementation is: use HTML::Parser; HTML::Parser->new( end_document_h => [sub { print shift }, 'skipped_text'], comment_h => [""], )->parse_file(shift || die) || die $!; This will in most cases be much more efficient since only a single callback will be made. The next example prints out the text that is inside the <title> element of an HTML document. Here we start by setting up a start handler. When it sees the title start tag it enables a text handler that prints any text found and an end handler that will terminate parsing as soon as the title end tag is seen: use HTML::Parser (); sub start_handler { return if shift ne "title"; my $self = shift; $self->handler(text => sub { print shift }, "dtext"); $self->handler( end => sub { shift->eof if shift eq "title"; }, "tagname,self" ); } my $p = HTML::Parser->new(api_version => 3); $p->handler(start => \&start_handler, "tagname,self"); $p->parse_file(shift || die) || die $!; print "\n"; More examples are found in the eg/ directory of the "HTML-Parser" distribution: the program "hrefsub" shows how you can edit all links found in a document; the program "htextsub" shows how to edit the text only; the program "hstrip" shows how you can strip out certain tags/elements and/or attributes; and the program "htext" show how to obtain the plain text, but not any script/style content. You can browse the eg/ directory online from the *[Browse]* link on the http://search.cpan.org/~gaas/HTML-Parser/ page. BUGS The <style> and <script> sections do not end with the first "</", but need the complete corresponding end tag. The standard behaviour is not really practical. When the *strict_comment* option is enabled, we still recognize comments where there is something other than whitespace between even and odd "--" markers. Once $p->boolean_attribute_value has been set, there is no way to restore the default behaviour. There is currently no way to get both quote characters into the same literal argspec. Empty tags, e.g. "<>" and "</>", are not recognized. SGML allows them to repeat the previous start tag or close the previous start tag respectively. NET tags, e.g. "code/.../" are not recognized. This is SGML shorthand for "<code>...</code>". Incomplete start or end tags, e.g. "<tt<b>...</b</tt>" are not recognized. DIAGNOSTICS The following messages may be produced by HTML::Parser. The notation in this listing is the same as used in perldiag: Not a reference to a hash (F) The object blessed into or subclassed from HTML::Parser is not a hash as required by the HTML::Parser methods. Bad signature in parser state object at %p (F) The _hparser_xs_state element does not refer to a valid state structure. Something must have changed the internal value stored in this hash element, or the memory has been overwritten. _hparser_xs_state element is not a reference (F) The _hparser_xs_state element has been destroyed. Can't find '_hparser_xs_state' element in HTML::Parser hash (F) The _hparser_xs_state element is missing from the parser hash. It was either deleted, or not created when the object was created. API version %s not supported by HTML::Parser %s (F) The constructor option 'api_version' with an argument greater than or equal to 4 is reserved for future extensions. Bad constructor option '%s' (F) An unknown constructor option key was passed to the new() or init() methods. Parse loop not allowed (F) A handler invoked the parse() or parse_file() method. This is not permitted. marked sections not supported (F) The $p->marked_sections() method was invoked in a HTML::Parser module that was compiled without support for marked sections. Unknown boolean attribute (%d) (F) Something is wrong with the internal logic that set up aliases for boolean attributes. Only code or array references allowed as handler (F) The second argument for $p->handler must be either a subroutine reference, then name of a subroutine or method, or a reference to an array. No handler for %s events (F) The first argument to $p->handler must be a valid event name; i.e. one of "start", "end", "text", "process", "declaration" or "comment". Unrecognized identifier %s in argspec (F) The identifier is not a known argspec name. Use one of the names mentioned in the argspec section above. Literal string is longer than 255 chars in argspec (F) The current implementation limits the length of literals in an argspec to 255 characters. Make the literal shorter. Backslash reserved for literal string in argspec (F) The backslash character "\" is not allowed in argspec literals. It is reserved to permit quoting inside a literal in a later version. Unterminated literal string in argspec (F) The terminating quote character for a literal was not found. Bad argspec (%s) (F) Only identifier names, literals, spaces and commas are allowed in argspecs. Missing comma separator in argspec (F) Identifiers in an argspec must be separated with ",". Parsing of undecoded UTF-8 will give garbage when decoding entities (W) The first chunk parsed appears to contain undecoded UTF-8 and one or more argspecs that decode entities are used for the callback handlers. The result of decoding will be a mix of encoded and decoded characters for any entities that expand to characters with code above 127. This is not a good thing. The recommended solution is to apply Encode::decode_utf8() on the data before feeding it to the $p->parse(). For $p->parse_file() pass a file that has been opened in ":utf8" mode. The alternative solution is to enable the "utf8_mode" and not decode before passing strings to $p->parse(). The parser can process raw undecoded UTF-8 sanely if the "utf8_mode" is enabled, or if the "attr", @attr or "dtext" argspecs are avoided. Parsing string decoded with wrong endian selection (W) The first character in the document is U+FFFE. This is not a legal Unicode character but a byte swapped "BOM". The result of parsing will likely be garbage. Parsing of undecoded UTF-32 (W) The parser found the Unicode UTF-32 "BOM" signature at the start of the document. The result of parsing will likely be garbage. Parsing of undecoded UTF-16 (W) The parser found the Unicode UTF-16 "BOM" signature at the start of the document. The result of parsing will likely be garbage. SEE ALSO HTML::Entities, HTML::PullParser, HTML::TokeParser, HTML::HeadParser, HTML::LinkExtor, HTML::Form HTML::TreeBuilder (part of the *HTML-Tree* distribution) <http://www.w3.org/TR/html4/> More information about marked sections and processing instructions may be found at <http://www.is-thought.co.uk/book/sgml-8.htm>. COPYRIGHT Copyright 1996-2016 Gisle Aas. All rights reserved. Copyright 1999-2000 Michael A. Chase. All rights reserved. This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself. PK w{!\K�A��� �� Changesnu �[��� PK w{!\�e�`� � ǖ eg/hformnu �[��� PK w{!\ƙj�� � Š eg/htitlenu �[��� PK w{!\�� � ˢ eg/hbodynu �[��� PK w{!\��� � ܥ eg/hlcnu �[��� PK w{!\Vb��� � ߨ eg/hrefsubnu �[��� PK w{!\X3=� � � eg/hdisablenu �[��� PK w{!\���� � 4� eg/hanchorsnu �[��� PK w{!\��EӨ � � eg/htextsubnu �[��� PK w{!\quK�� � �� eg/hstripnu �[��� PK w{!\}*�� � � eg/hdumpnu �[��� PK w{!\���od d � eg/htextnu �[��� PK w{!\j���� � �� TODOnu �[��� PK w{!\ܾMqT� T� �� READMEnu �[��� PK � so
| ver. 1.6 |
Github
|
.
| PHP 8.2.30 | ??????????? ?????????: 0 |
proxy
|
phpinfo
|
???????????