Not To Be Scrapped- Web Scraping Here To Stay
People may typically resort to plain copy-and-paste procedure when collecting information online, but when dealing with voluminous data, web scraping is the way to do it.
Consider a probable scenario mentioned by computer science website GeeksforGeeks.
If you’re looking for information about a former American president, for example, you may go to Wikipedia to obtain it. As you go over with the content, you can simply copy and paste the text you need and that’s it.
In the case of “large amounts of information,” however, GeeksforGeeks strongly stated that such an approach “will not work!”
Likewise, what if the copy-and-paste task entails that you do it “1,000 or more times,” as asked by Shubham Prasad in Quora.
What GeeksforGeeks and Prasad, an Indian digital marketing professional, with experience in web scraping, are pointing out is that time is one important factor to be considered in data collection.
To collect tons of information “as quickly as possible,” according to GeeksforGeeks, web scraping is the method deemed fit to accomplish such an enormous task.
Data scraping, web data extraction, and web harvesting are alternative terms to web scraping.
Generally speaking, Techopedia says that web scraping refers to a method “used to collect information from across the Internet.”
Specifically, this method of web scraping is an automated process of extracting “structured web data from any public website,” according to Zyte (formerly Scrapinghub).
This “extraction of data from a website” or web scraping, said ParseHub, can also be done manually. However, as emphasized earlier, when it comes to data collection, time is of essence because in business, time is money.
So, doing it automatically is more preferred than manually.
ParseHub gives an idea on how web scraping is done. As an example, it chose to collect information regarding couches sold by international furniture and home accessories company IKEA.
If done manually, the collected data appears like this in Excel as seen in the screenshot below:
The screenshot below shows how it appears when done automatically:
According to Sequentum, an award-winning New York-based software development company, web scraping can be both easy and difficult.
It’s easy if the data will be extracted from:
(a) static websites (websites that have HTML-coded content that are fixed; they don't change or remain “‘static’” for all website users/viewers, according to Canadian marketing agency H&C)
Now the level of difficulty increases as the amount of data to be collected also increases as well as when you extract the data from:
(a) dynamic websites (websites where contents and information change every time the database gets updated, according to H&C)
(b) “complex websites” (those where visitor and website interaction can happen “beyond a simple information request form,” according to Arizona-based business solutions company Lodestone Systems)
Examples of complex websites include e-commerce sites; content-heavy, information-based, magazine-style, and service-oriented websites; and websites about celebrities, fashion, gaming, news, and videos)
(c) “non-HTML content”
(d) “websites that use deterrents”
One good example of a deterrent is CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart). It identifies real users (actual human beings) from automated users (e.g., bots or robots).
Per Christensson, creator of Techterms.com, said that bots are used in web scraping (which then explains why there are websites that use CAPTCHA).
Christensson also mentioned that manual web scraping can be carried out through the “File – Save As” command and copy-paste.
Automated web scraping uses software tools such as the following (in alphabetical order):
(1) Bright Data (formerly Luminati)
(6) Scraper API
In his Wired article “Beyond the Information Age” (2014), Professor Julian Birkinshaw mentioned how, for decades, companies have participated in “harnessing information and knowledge.”
According to Prof. Birkinshaw, who teaches Strategy and Entrepreneurship at the London Business School, companies do it to cut operation cost, to improve their products, to stay afloat, and to remain competitive.
Web scraping is generally used to collect prices (monitoring and comparison), website contents, and statistics that will be analyzed for marketing and research and development purposes.
Indian software company Juppiter AI Labs enumerated the following activities/businesses/fields where web scraping is applied:
(a) academic research
(c) international news
(d) real estate
(e) sports betting
Techopedia mentioned auction details and weather reports as two examples of information that some companies find useful for their business operation.
Online education platform Edureka added that web scraping is used to collect email addresses and job listings. Social media sites are targets, too, to know trending topics.
Investment companies, according to Zyte, go for news scraping for their investment plans and strategies.
So, if web scraping is so useful, then why, in 2021, it “became uncool,” according to ScrapeOps, a web scraping DevOps tool.
By uncool, it means that web scraping loses its popularity.
ScrapeOps cited two main reasons: the increasing number of companies that provide data (so no need for other firms to scrape data) and legal issues.
In the U.S., “no law directly applies to web scraping,” according to Atty. Kieran McCarthy, founder of Colorado-based McCarthy Garber Law.
That explains why, according to him, lawyers turn to “using judicial frameworks designed for other purposes” when dealing with cases on web scraping.
Worse, Atty. McCarthy doubts that Congress will draft some web scraping laws anytime soon.
ScrapeOps thinks legalities concerning web scraping became “a little less gray” because of the outcome of the 2021 case (“Van Buren vs. United States”). For the U.S. Supreme Court, web scraping didn’t violate the U.S. Computer Fraud and Abuse Act (CFAA).
However, in his Hacker News comment, Atty. McCarthy (user “KieranMac”) disagreed with ScrapeOps, saying that it’s actually “a darker shade of gray.”
In his opinion, the path towards a lawsuit-free web scraping could be navigable; nonetheless, Atty. McCarthy cautioned that it could still get “tricky.”
We could presume that companies will continue to collect and compile information through web scraping for their day-to-day operations.
In 2021, per FinancesOnline, the volume of data created daily reached 1.134 trillion MB.
ScrapeOps predicted that web scraping will overcome usual challenges like it did before. Thus, it concluded that the “future is looking bright” for it.
Still, remember Atty. McCarthy’s words; so, companies and individuals should remain vigilant in their web scraping practices.