MINING THE ‘DEEP WEB’ WITH SPECIALIZED DRILLS

January 26, 2001

SCIENCE AND ENGINEERING NEWS

Lisa Guernsey reported for The NY Times: Two weeks ago, online newspapers and magazines were buzzing with news about Linda Chavez, President Bush’s first choice for labor secretary.

But from the results coming up in most popular search engines, you would never have known it. Instead of retrieving articles about an illegal immigrant who had lived in Ms. Chavez’s home, a Google search on “chavez” led to several encyclopedia entries on Cesar Chavez, the American labor leader and advocate of farmworkers’ rights.

Lycos turned up several Web sites with information about Eric Chavez, an Oakland A’s third baseman. On Alta Vista, some of the first results linked to Ms. Chavez’s old columns for an online magazine, but none of the links provided even a hint of the fact that she had become front-page news.

“I don’t see anything that anyone would feel is relevant to her given the context of this past week,” said Danny Sullivan, the editor of SearchEngineWatch.com, as he typed “chavez” into other search engines.

His demonstration illustrated a problem that has long been apparent longtime problem that has to anyone casting about for online news reports: search engines can be pitifully inadequate, partly because they rely on Web-page indexes that were compiled weeks before. It is not just timely material that seems to escape their reach. Pages deep within Web sites are also often missed, as are multimedia files, bibliographies, the bits of information in databases and pages that come in P.D.F., Adobe’s portable document format.

In fact, traditional search engines have access to only a fraction of 1 percent of what exists on the Web. As many as 500 billion pieces of content are hidden from the view of those search engines, according to BrightPlanet.com, a search company that has tried to tally them. To many search experts, this is the “invisible Web.” BrightPlanet prefers the term “deep Web,” an online frontier that it estimates may be 500 times larger than the surface Web that search engines try to cover. And that uncharted territory does not include Web pages that are behind firewalls or part of intranets.

To dig deeper into the Web, a new breed of search engine has cropped up that takes a different approach to Web page retrieval. Instead of broadly scanning the Web by indexing pages from any links they can find, these search engines are devoted to drilling further into specialty areas–medical sites, legal documents, even Web pages dedicated to jokes and parody. Looking for timely financial data? Try FinancialFind.com. Seeking sketches of molecular structures or even scientific humor? Biolinks.com may help.

“Instead of grabbing everything on the Web and then trying to deal with this big mess,” Mr. Sullivan said, these boutique search engines have decided to do some filtering. “They may say, we’ll pick 40 sites that we know are related to this topic,” he said. “And that means you won’t get these irrelevant search results.”

Some search engines go even further, sending out finely tuned software agents, or bots, that learn not only which pages to search, but also what information to grab from those pages. Either way, the theory is the same: The smaller the haystack, the better chance of finding the needle.

Finding those smaller haystacks can be a challenge in itself. It is the same problem faced by patrons who walk into a library, said Gary Price, a librarian at George Washington University and co-author of the forthcoming book “The Invisible Web” (CyberAge Books). People may know to come to the library, but they probably do not know which reference books to pull off the shelf. Of course, in such cases, patrons can at least consult a reference librarian. On the Web, people are usually fending for themselves.

“The end user should have a better idea of all the different options that exist,” Mr. Price said. “But this is easier said than done.”

Lately, however, a few specialty search engines have been popping up on lists of most-visited Web sites–evidence that people are learning to find them. MySimon, a service that specializes in culling product prices and information across 2,500 shopping sites, is one of the most popular. In December, the site attracted 5 million unique visitors, a huge increase from its 1.9 million visitors a year before, according to Jupiter Media Metrix, an Internet research firm. FindLaw.com, a search engine and Web- based directory of legal information, has as many as 900,000 visitors a month.

Moreover.com, a site that opened in 1999 with a search engine that gathers headlines from 1,800 online news sources, has also appeared on Jupiter Media Metrix’s reports of Web use, which track only sites with at least 200,000 visitors a month. Last month, about 340,000 people visited Moreover.com’s pages–and that is without any consumer marketing from the company, which offers the search engine free as a teaser for businesses that might buy its search software.

Like most specialty search engines, Moreover manages to find those news stories because its bots have been designed to hunt for only specific pages within a specific realm of the Web. They are like sniffing dogs that have been given a whiff of a scent and are taught to disregard everything else. Font tags in the source code underlying the Web page, for example, are a giveaway. Between 6 and 18 words in large type near the top of a Web page look a lot like headlines. In most cases they are, and the site’s bots retrieve them, using the headline as the link in the list of search results.

Once in a while, however, those supposed headlines turn out to be something else, like a copyright disclaimer page. So to filter further, Moreover’s spiderlike bots learn the structure of the Web address, noting which words and numbers show up between the slashes. If an address ends with the word “copyright,” a bot may decide to disregard that page. Similar rules are used to categorize the news articles so that people can narrow their searches before even entering a search term. “Our spiders are very good readers,” said Nick Denton, Moreover’s chief executive.

MySimon also employs bots that are designed to hunt for very specific information. But first the bots must watch the click- through routines of MySimon employees who have learned the ins and outs of particular online shops–like exactly which pages typically provide prices, sizes or shipping fees. Once trained, the bots follow those paths themselves, prowling shops for information to put into databases and then display online. For example, one bot is assigned to Amazon.com’s bookshelves; another is assigned to its electronics merchandise.

“What we’re doing is teaching our agents to shop on behalf of consumers,” said Josh Goldman, president of MySimon.

Meanwhile, general search engines have also decided to offer smaller fields for foraging. Northern Light has a news search service that searches a two-week archive of articles on 56 news wires. It also offers a “geosearch” service that allows people to look for businesses based within a few miles of a given address. Google recently opened an “Uncle Sam” area, where people can search for governmental material.

Services that limit searches to audio or video files–typically found under the heading “multimedia search”–are now offered on sites like Alta Vista, Excite and Lycos. And shopping search engines are linked from almost all of the major search sites.

But again, many Web users do not know that the narrow searching tools exist. So reference librarians and library Web sites are now directing their patrons to those areas on the Web. Mr. Sullivan, Mr. Price and Chris Sherman, a search guide on About.com who is working with Mr. Price on the “Invisible Web” book, are among the several information- retrieval experts who have built online directories of specific search sites. Another tool is the LexiBot, a downloadable program designed by BrightPlanet to demonstrate the search technology it sells to businesses. The LexiBot, which costs $89.95 but is free for the first 30 days, gathers information simultaneously from 600 search sites and databases–including the databases that form the basis of specialty search engines.

The harder part may be to change people’s behavior. All the boutique search engines in the world will not alter the fact that the majority of Web surfers are still inclined to type a single keyword into a huge, general search engine and hope for the best. The thought of narrowing a search–by either going to a specialty search page or clicking through a menu of choices on a general search site–does not seem to occur to most users, Mr. Sullivan said.

He poses this challenge to the major search sites: Wouldn’t search engines be more helpful if they would automatically narrow a search without requiring their users to make that realization on their own?

“Can you automatically detect what database to search,” he asked in posing his challenge, “based on what people have typed in?” During the second week of January, for example, perhaps a search engine could have been directed to steer people to news sites whenever they typed in words that made headlines, like “chavez.”

A few search engines have tried to take that step, with mixed results. For example, when Mr. Sullivan typed “chavez” into the search box at Ask Jeeves earlier this month, the site pointed to a recent news story–a link provided by Ask Jeeves’ editors who were assembling information about potential members of a Bush cabinet. Using the same search a few weeks later, the news reports were nowhere to be found. (Paul Stroube, the company’s vice president for Web production, said that the news link disappeared because Ms. Chavez was taken off Ask Jeeves’ list of President Bush’s nominees.)

Unless the big search engines get better at delivering timely information, searchers might be better off with Moreover.com and other news-oriented search services. With those, Mr. Sullivan has found success. Two weeks ago, in a Moreover search using the word “chavez,” more than 30 relevant stories appeared, at least half of which had been posted that day.

============================================================

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Supercomputers Take to the Solar Winds

June 5, 2020

The whims of the solar winds – charged particles flowing from the Sun’s atmosphere – can interfere with systems that are now crucial for modern life, such as satellites and GPS services – but these winds can be d Read more…

By Oliver Peckham

HPC in O&G: Deep Sea Drilling – What Happens Now   

June 4, 2020

At the beginning of March I attended the Rice Oil & Gas HPC conference in Houston. That seems a long time ago now. It’s a great event where oil and gas specialists join with compute veterans and the discussion tell Read more…

By Rosemary Francis

NCSA Wades into Post-Blue Waters Era with Delta Supercomputer

June 3, 2020

NSF has awarded the National Center for Supercomputing Applications (NCSA) $10 million for its next supercomputer - named Delta – “which will kick-start NCSA’s next generation of supercomputers post-Blue Waters,” Read more…

By John Russell

Dell Integrates Bitfusion for vHPC, GPU ‘Pools’

June 3, 2020

Dell Technologies advanced its hardware virtualization strategy to AI workloads this week with the introduction of capabilities aimed at expanding access to GPU and HPC services via its EMC, VMware and recently acquired Read more…

By George Leopold

Supercomputers Streamline Prediction of Dangerous Arrhythmia

June 2, 2020

Heart arrhythmia can prove deadly, contributing to the hundreds of thousands of deaths from cardiac arrest in the U.S. every year. Unfortunately, many of those arrhythmia are induced as side effects from various medicati Read more…

By Staff report

AWS Solution Channel

Join AWS, Univa and Intel for This Informative Session!

Event Date: June 18, 2020

More enterprises than ever are turning to HPC cloud computing. Whether you’re just getting started, or more mature in your use of cloud, this HPC Cloud webinar is an excellent opportunity to gain valuable insights and knowledge to help accelerate your HPC cloud projects. Read more…

Indiana University to Deploy Jetstream 2 Cloud with AMD, Nvidia Technology

June 2, 2020

Indiana University has been awarded a $10 million NSF grant to build ‘Jetstream 2,’ a cloud computing system that will provide 8 aggregate petaflops of computing capability in support of data analysis and AI workload Read more…

By Tiffany Trader

NCSA Wades into Post-Blue Waters Era with Delta Supercomputer

June 3, 2020

NSF has awarded the National Center for Supercomputing Applications (NCSA) $10 million for its next supercomputer - named Delta – “which will kick-start NCS Read more…

By John Russell

Indiana University to Deploy Jetstream 2 Cloud with AMD, Nvidia Technology

June 2, 2020

Indiana University has been awarded a $10 million NSF grant to build ‘Jetstream 2,’ a cloud computing system that will provide 8 aggregate petaflops of comp Read more…

By Tiffany Trader

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

By Doug Black

COVID-19 HPC Consortium Expands to Europe, Reports on Research Projects

May 28, 2020

The COVID-19 HPC Consortium, a public-private effort delivering free access to HPC processing for scientists pursuing coronavirus research – some utilizing AI Read more…

By Doug Black

$100B Plan Submitted for Massive Remake and Expansion of NSF

May 27, 2020

Legislation to reshape, expand - and rename - the National Science Foundation has been submitted in both the U.S. House and Senate. The proposal, which seems to Read more…

By John Russell

IBM Boosts Deep Learning Accuracy on Memristive Chips

May 27, 2020

IBM researchers have taken another step towards making in-memory computing based on phase change (PCM) memory devices a reality. Papers in Nature and Frontiers Read more…

By John Russell

Hats Over Hearts: Remembering Rich Brueckner

May 26, 2020

HPCwire and all of the Tabor Communications family are saddened by last week’s passing of Rich Brueckner. He was the ever-optimistic man in the Red Hat presiding over the InsideHPC media portfolio for the past decade and a constant presence at HPC’s most important events. Read more…

Nvidia Q1 Earnings Top Expectations, Datacenter Revenue Breaks $1B

May 22, 2020

Nvidia’s seemingly endless roll continued in the first quarter with the company announcing blockbuster earnings that exceeded Wall Street expectations. Nvidia Read more…

By Doug Black

Supercomputer Modeling Tests How COVID-19 Spreads in Grocery Stores

April 8, 2020

In the COVID-19 era, many people are treating simple activities like getting gas or groceries with caution as they try to heed social distancing mandates and protect their own health. Still, significant uncertainty surrounds the relative risk of different activities, and conflicting information is prevalent. A team of Finnish researchers set out to address some of these uncertainties by... Read more…

By Oliver Peckham

[email protected] Turns Its Massive Crowdsourced Computer Network Against COVID-19

March 16, 2020

For gamers, fighting against a global crisis is usually pure fantasy – but now, it’s looking more like a reality. As supercomputers around the world spin up Read more…

By Oliver Peckham

[email protected] Rallies a Legion of Computers Against the Coronavirus

March 24, 2020

Last week, we highlighted [email protected], a massive, crowdsourced computer network that has turned its resources against the coronavirus pandemic sweeping the globe – but [email protected] isn’t the only game in town. The internet is buzzing with crowdsourced computing... Read more…

By Oliver Peckham

Global Supercomputing Is Mobilizing Against COVID-19

March 12, 2020

Tech has been taking some heavy losses from the coronavirus pandemic. Global supply chains have been disrupted, virtually every major tech conference taking place over the next few months has been canceled... Read more…

By Oliver Peckham

Supercomputer Simulations Reveal the Fate of the Neanderthals

May 25, 2020

For hundreds of thousands of years, neanderthals roamed the planet, eventually (almost 50,000 years ago) giving way to homo sapiens, which quickly became the do Read more…

By Oliver Peckham

DoE Expands on Role of COVID-19 Supercomputing Consortium

March 25, 2020

After announcing the launch of the COVID-19 High Performance Computing Consortium on Sunday, the Department of Energy yesterday provided more details on its sco Read more…

By John Russell

Steve Scott Lays Out HPE-Cray Blended Product Roadmap

March 11, 2020

Last week, the day before the El Capitan processor disclosures were made at HPE's new headquarters in San Jose, Steve Scott (CTO for HPC & AI at HPE, and former Cray CTO) was on-hand at the Rice Oil & Gas HPC conference in Houston. He was there to discuss the HPE-Cray transition and blended roadmap, as well as his favorite topic, Cray's eighth-gen networking technology, Slingshot. Read more…

By Tiffany Trader

Honeywell’s Big Bet on Trapped Ion Quantum Computing

April 7, 2020

Honeywell doesn’t spring to mind when thinking of quantum computing pioneers, but a decade ago the high-tech conglomerate better known for its control systems waded deliberately into the then calmer quantum computing (QC) waters. Fast forward to March when Honeywell announced plans to introduce an ion trap-based quantum computer whose ‘performance’ would... Read more…

By John Russell

Leading Solution Providers

SC 2019 Virtual Booth Video Tour

AMD
AMD
ASROCK RACK
ASROCK RACK
AWS
AWS
CEJN
CJEN
CRAY
CRAY
DDN
DDN
DELL EMC
DELL EMC
IBM
IBM
MELLANOX
MELLANOX
ONE STOP SYSTEMS
ONE STOP SYSTEMS
PANASAS
PANASAS
SIX NINES IT
SIX NINES IT
VERNE GLOBAL
VERNE GLOBAL
WEKAIO
WEKAIO

Contributors

Tech Conferences Are Being Canceled Due to Coronavirus

March 3, 2020

Several conferences scheduled to take place in the coming weeks, including Nvidia’s GPU Technology Conference (GTC) and the Strata Data + AI conference, have Read more…

By Alex Woodie

Exascale Watch: El Capitan Will Use AMD CPUs & GPUs to Reach 2 Exaflops

March 4, 2020

HPE and its collaborators reported today that El Capitan, the forthcoming exascale supercomputer to be sited at Lawrence Livermore National Laboratory and serve Read more…

By John Russell

‘Billion Molecules Against COVID-19’ Challenge to Launch with Massive Supercomputing Support

April 22, 2020

Around the world, supercomputing centers have spun up and opened their doors for COVID-19 research in what may be the most unified supercomputing effort in hist Read more…

By Oliver Peckham

Cray to Provide NOAA with Two AMD-Powered Supercomputers

February 24, 2020

The United States’ National Oceanic and Atmospheric Administration (NOAA) last week announced plans for a major refresh of its operational weather forecasting supercomputers, part of a 10-year, $505.2 million program, which will secure two HPE-Cray systems for NOAA’s National Weather Service to be fielded later this year and put into production in early 2022. Read more…

By Tiffany Trader

15 Slides on Programming Aurora and Exascale Systems

May 7, 2020

Sometime in 2021, Aurora, the first planned U.S. exascale system, is scheduled to be fired up at Argonne National Laboratory. Cray (now HPE) and Intel are the k Read more…

By John Russell

Australian Researchers Break All-Time Internet Speed Record

May 26, 2020

If you’ve been stuck at home for the last few months, you’ve probably become more attuned to the quality (or lack thereof) of your internet connection. Even Read more…

By Oliver Peckham

Summit Supercomputer is Already Making its Mark on Science

September 20, 2018

Summit, now the fastest supercomputer in the world, is quickly making its mark in science – five of the six finalists just announced for the prestigious 2018 Read more…

By John Russell

Nvidia’s Ampere A100 GPU: Up to 2.5X the HPC, 20X the AI

May 14, 2020

Nvidia's first Ampere-based graphics card, the A100 GPU, packs a whopping 54 billion transistors on 826mm2 of silicon, making it the world's largest seven-nanom Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This