Gold mine of real-time insights from internet

Computer expert G.K. Kulatilleke (BSc Eng. (Computer), MSc (Networking), MSc (Data Science), ACMA, CGMA) speaks about the automated extraction of web-based information which is no longer an option, but a must, for a competitive business in the current world.

Q. What is web scraping?

A. Web scraping, web harvesting, or web data extraction refers to the automated extraction of web-based information. Web-based information consists of indicators such as search queries, clicks on specific pages, commercial information (prices, rates) and text posted online. ‘Scraping’ involves sending requests to a website’s server in much the same way a standard internet browser does – however, instead of displaying the information, the scraping tool uses the code behind the website to pick out useful information.

One of the main benefits of scraping is its ability to generate information and insights from (near) real-time data due to the absence of data collection lags. Data is obtained from source, with less possibility for being altered or influenced. It is up to date and gives the exact picture at that specific instant in time. Internet allows scraping any public global content.

There is generally no cost of acquiring the data. Given this plethora of advantages, it is no wonder that the information-hungry private sector is already on the band-wagon using scraped data sets to produce novel insights and timely indicators. Government, regulators and civic groups can derive intelligence on public sentiment, policy transmission, early warning signals and other socio-political indications on a near real-time basis. It can also provide a rich source of diverse information for robust modelling and forecasting.

Q. Is scraping legal?

A. Web scraping is the automated gathering of data from a public third-party web site, generally for a purpose different from its original intent. When is it okay to grab data from someone else’s website, without their explicit permission? The legality of web scraping varies across the world. In general, web scraping may be against the terms of use of some websites, but the enforceability of these terms is unclear.

In the US, website owners can use copyright infringement and violation of the Computer Fraud and Abuse Act to prevent scaping, which is, however subject to meeting various criteria, and the case law is still evolving. While scraping is not illegal in the UK, the information itself may be subject to copyright. In Canada, the legality of web scraping has not been fully defined.

In 2017, a Californian federal judge ruled in favour of hiQ (a startup that used web scaping), prohibiting LinkedIn, (hiQ’s scraping source) from preventing access, copying, or use of public profiles on LinkedIn’s website. LinkedIn was further ordered to remove any technical or legal mechanisms put in place to prevent such access, and to refrain from implementing these means in the future.

Generally, however, websites will allow third-party scraping if due care is taken that web-scraping process does not affect the performance or bandwidth of the web server. Google search engine uses web scraping to build its search database by scraping information, while many other institutions including government institutions use scraping to build their databases.

Q. What are the technical challenges?

A. Scraping provides characteristically incomplete, erroneous data with missing values. This requires some cleaning and filtering process. Another salient aspect of web scraping is that the source can and will change abruptly due to website changes and requires that the scraping tools and team pay constant vigilance and update the scraping code accordingly.

Unlike in surveys where specific questions can be used to obtain expected information, a common issue in scraping is that websites do not contain direct variables as expected. Researchers need to identify another variable that is not in itself directly relevant, but that serves in place of an unobservable or immeasurable variable, known as proxies. An example of a proxy variable is the use of GDP as an indicator of quality of life.

Some websites are known to actively discourage scraping for various reasons. Poorly designed sites, or those that use complex AJAX-based technology may also be hard to scrape. Fortunately, such instances seem to be infrequent so far.

Q. What are the global examples of scraping?

A. Bundesbank, the German central bank, uses scraped web data to understand depositors’ expectations and attitudes towards funds held in banks. This is done by getting the number of times people look up (google) the term ‘deposit insurance’. They also use scraping to obtain interest rates and the balances of overnight banking deposits in addition to using (near) real-time availability of scraped web data to anticipate trends on deposits and the risk of bank runs.

The Central Bank of Armenia collects prices posted online by supermarkets daily, which allows advanced estimates of consumer inflation to be computed. It also scrapes housing prices from real estate agencies to create a housing price index. Data on music downloads from Spotify used as proxy for sentiment has been shown to be as good as standard consumer confidence survey.

Many organisations use scraping to obtain contact information such as emails, URLs and phone numbers for creating contact lists and lead generation. Social media is scraped for gathering data on customers or trends. Data is also scraped for market analysis and research, product and service pricing, competitors’ price analysis, monitoring prices in different markets, and understanding customer feedback and reviewvels.

A property investor can extract a wealth of information from property rental and sales sites which would allow determining best locations that receive the highest location reviews. Investment analysts and traders can obtain massive amounts of fundamental data, price and volume data. There is an exciting new branch of finance that can use textual data for sentiment analysis and sales prediction.

Q. How can we exploit the opportunity?

A. Every day, the web grows by a further 200,000 pages and is being more versatile. More importantly it is becoming more structured. We already see the semantic web and technologies such as HTML5 striding steadfastly in this direction. Ultimately, smart tools and algorithms would be able locate and scrape data automatically by analysing all the web documents published and indexing in a more flexible semantic format.

As a data source, scaping is unchallenged. As time progresses, companies which want to get ahead will use the opportunities presented by scraping and capitalise on it, tapping into the wealth of real-time information. In the meantime, the web data is growing at a pace faster than we can imagine, as it is never possible to manually access, copy or paste data. Often, websites will allow third-party scraping. For example, most websites give Google the express or implied permission to index their web pages. Although scraping is ubiquitous, it’s not clearly legal. A variety of laws may apply to unauthorised scraping, including contract, copyright and trespass to chattels laws

Web scraping is no longer an option, but a must for a competitive business.



from daily news

Post a Comment

Previous Post Next Post