Needlebase provides a point-and-click interface for extracting
structured information from web pages. As a user, you select elements on
an example page that contain the data you’re interested in, and the tool
then uses the patterns you’ve defined to pull out information from other
pages on a site with a similar structure. For example, you might want to
extract product names and prices from a shopping site. With the tool, you
could find a single product page, select the product name and price, and
then the same elements would be pulled for every other page it crawled
from the site. It relies on the fact that most web pages are generated by
combining templates with information retrieved from a database, and so
have a very consistent structure.Once you’ve gathered the data, it offers some features that are a
bit like Google Refine’s for de-duplicating and cleaning up the data. All
in all, it’s a very powerful tool for turning web content into structured
information, with a very approachable interface.
from Big Data GlossaryPete Warden
댓글 없음:
댓글 쓰기