Perplexity has hit a PR Roadblock due to its Aggregation Misconduct
The warped reality of search bots trying to aggregate citations and mostly failing.
Hey Everyone,
I hope you are enjoying your summer.
So what does the Perplexity vs. Forbes story tell us about the future of media? It’s actually a more interesting story than it sounds like on paper.
I was struck by how Perplexity’s conduct is catching up to its reputation in June, 2024. Robb Knight wrote this op-ed and since then things have become a bit heated on the internet.
In the week that followed Perplexity has been under the microscope by the likes of Wired and Platformer. The story is fairly nuanced by depective of how toxic Generative AI search bots and chatbots are for the future of media, objectivity and copyright.
“Plagiarism engines” like Perplexity and Arc Search (according to Platformer) have attracted millions of users by ripping off other people’s work, depriving publishers of the traffic and advertising revenue that once sustained them.
Wired calls it Bulshit Media. Forbes is upset.
Perplexity’s “answer engine” works by crawling large swaths of information on the web and then creating a big database (an index) of content it grabs from web pages.
Randall Lane, Forbes’ chief content officer, explained in a blog post.
By May 30th, Perplexity had launched Pages which is a new tool that transforms a researched topic into a well-structured, beautifully formatted article with ease. The problem? It was ripping off content sometimes word for word and often behind a paywall.
Ben Thompson even came to their defense.
Since OpenAI plans to launch its own Web-search product this may also be their PR/Comms team’s MO of trying to hurt future competitors. Why would it come out of Forbes of all people? It doesn’t make a whole lot of sense.
Forbes is how do we put this, not exactly the most original of publications.
Rough Edges, Side-Stepping robots.txt, and Citation Abuse
Forbes' Accusation: Forbes reported that Perplexity had plagiarized its journalism by repurposing an exclusive scoop about Eric Schmidt's new drone project for Perplexity's 'Pages' product. This product creates automated web pages based on user prompts. Forbes noted that Perplexity's version included eerily similar wording, lifted fragments, and even an illustration from Forbes' previous stories without proper attribution.
Wired's Investigation: Wired found compelling evidence that Perplexity was ignoring the Robots Exclusion Protocol, which websites use to block automated crawlers. Wired verified this by creating a new website and monitoring its server logs, confirming that Perplexity's bot accessed their site using unpublicized IP addresses.
Perplexity's Response: Perplexity's CEO, Aravind Srinivas, acknowledged the issues as 'rough edges' of a newly released product and promised improvements in attribution over time. The company also stated that it is committed to working with media companies to create aligned incentives and is developing a revenue-sharing program for media companies.
Legal Actions: Following these revelations, Forbes sent a cease-and-desist letter to Perplexity and threatened legal action. This highlights the severity of the accusations and the potential legal consequences for Perplexity.
The Wacky Word of Web Scraped Answers and Search Bots that don’t do Citations Well
Perplexity was side-stepping robot.txt. Wired then said Perplexity was probably using a "secret IP address" to access content not intended for AI.
Perplexity unwisely sent one of its curated Pages and sent it to Subscribers.
Apparently, Perplexity's AI-generated webpage didn't prominently cite Forbes, Paczkowski said, and elevated other news coverage of the story in its citations — including an article from Business Insider about the Forbes piece — over Forbes' original reporting.
Despite claiming to respect website operators' wishes, Perplexity appears to scrape content from sites that have blocked its crawler, using an undisclosed IP address, a Wired investigation found.
Perplexity considers itself of course an aggregator of information. However it does seem like its version of this is prone to all sorts of problematic practices of citations, copyright infringement, plagiarism and hallucinated responses, even though many white collar workers prefer Perplexity itself to Google.
Perplexity Pages and Genspark.AI
These articles that are from multiple sources do of course produce a kind of context gibberish that isn’t perfect. How Perplexity was marketing Pages also seemed a bit deceptive.
Perplexity, an AI-powered search startup, has been accused of plagiarizing content from various news publications. Reports from Forbes, Wired, and other sources indicate that Perplexity's new feature, Perplexity Pages, repurposes journalistic content without proper attribution. There is some attempt at attribution and citations, it’s just not very good.
According to Axios who would know a thing or two about this, media publishers and self-employed creators are fighting to protect their intellectual property are stuck in a game of whack-a-mole as they confront not just the biggest AI companies, like OpenAI or Google, but also smaller AI startups.
The response of the CEO of Perplexity to the drama was also rather weak. It’s a PR drama they didn’t need just as they were getting some momentum and managed to get more funding. Wired also said Perplexity was probably bypassing publishers who indicated via their website code that AI web scraping was off-limits.
Web Scraping and Robot Exclusions Protocol Norms
About 4 days ago Perplexity said:
“Perplexity is not ignoring the Robot Exclusions Protocol and then lying about it,” said Perplexity cofounder and CEO Aravind Srinivas in a phone interview Friday. “I think there is a basic misunderstanding of the way this works,” Srinivas said. “We don’t just rely on our own web crawlers, we rely on third-party web crawlers as well.”
Perplexity says they use third-party web crawlers as well.
The News Media Alliance, a trade group representing more than 2,200 U.S.-based publishers, expressed concern about the impact that ignoring "do not crawl" signals could have on its members.
The Future of Search is Getting Weird
Keep reading with a 7-day free trial
Subscribe to Artificial Intelligence Survey 🤖🏦🧭 to keep reading this post and get 7 days of free access to the full post archives.