Is it possible to extract text from Google Scholar results?


Is it possible to extract text from Google Scholar results?

Yes, and no….  I have been playing around with this for a long time on my website at Meta-Guide.com.  AFAIK, there is no Google Scholar API, nor XML search feed; however, email alerts are available (which can be partially parsed into a feed with the Gmail API).

Scraping the search results is made more difficult by the default maximum return of 20 items per page.  I normally work with 100 items, regardless.  Therefore, I try to maximize the search parameters to target that 100 item window.  I operate on the premiss that my searches are sophisticated, and likely unique; so, since the past history of academic literature does not change much, my searches are value added, and worth keeping.

Further, there is considerable AI behind Google snippets.  I operate on the additional premiss of massive multiplicity, or redundancy, which is to say that if something is said enough times, in enough different ways, then it must tell us something.  ;^)  For instance, it is my belief that there is enough information in 100 targeted Google snippets, to (automatically) create a Wikipedia entry, on any given topic.

Parsing captured links, PDF or otherwise, is not rocket science; for instance, there are plenty of examples of so-called fulltext RSS feed parsers.  Of course, parsing a PDF is a little more challenging than regular web text. Mendeley is an interesting case in point; there you can see which PDFs are more easily parsed than others, and where the mistakes occur.  For sure, Google Scholar is not perfect at parsing PDFs either, particularly theses.  I am impressed with how Google Scholar turns up papers published inside books though, via Google Books.

(Visited 73 times, 1 visits today)
Liked it? Take a second to support mendicott on Patreon!