Hi Robert, 

May be you should use regular expressions that detect a long series of numbers without spaces between them. 

Regards

Imene

On Monday, February 11, 2013, wrote:
Send Xmldatadumps-l mailing list submissions to
        xmldatadumps-l@lists.wikimedia.org

To subscribe or unsubscribe via the World Wide Web, visit
        https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
or, via email, send a message with subject or body 'help' to
        xmldatadumps-l-request@lists.wikimedia.org

You can reach the person managing the list at
        xmldatadumps-l-owner@lists.wikimedia.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Xmldatadumps-l digest..."


Today's Topics:

   1. Weird page titles in page table (Robert Crowe)
   2. Re: Weird page titles in page table (Ariel T. Glenn)


----------------------------------------------------------------------

Message: 1
Date: Sun, 10 Feb 2013 14:08:56 -0800
From: "Robert Crowe" <robert@ourwebhome.com>
To: <xmldatadumps-l@lists.wikimedia.org>
Subject: [Xmldatadumps-l] Weird page titles in page table
Message-ID: <010301ce07db$39477660$abd66320$@com>
Content-Type: text/plain;       charset="us-ascii"

I'm seeing rows in the page table that have weird titles, and I'd like to be
able to identify and filter them out, but I don't see properties that seem
to identify them.  For example:

page.page_id = 21441554
page.page_title = 4567797074e280934d6f726f63636f5f72656c6174696f6e73

What should I look for to identify pages like that?

Thanks,

Robert





------------------------------

Message: 2
Date: Mon, 11 Feb 2013 07:51:03 +0200
From: "Ariel T. Glenn" <ariel@wikimedia.org>
To: Robert Crowe <robert@ourwebhome.com>
Cc: xmldatadumps-l@lists.wikimedia.org
Subject: Re: [Xmldatadumps-l] Weird page titles in page table
Message-ID: <1360561863.18140.5.camel@trouble.localdomain>
Content-Type: text/plain; charset="UTF-8"

Στις 10-02-2013, ημέρα Κυρ, και ώρα 14:08 -0800, ο/η Robert Crowe
έγραψε:
> I'm seeing rows in the page table that have weird titles, and I'd like to be
> able to identify and filter them out, but I don't see properties that seem
> to identify them.  For example:
>
> page.page_id = 21441554
> page.page_title = 4567797074e280934d6f726f63636f5f72656c6174696f6e73
>
> What should I look for to identify pages like that?

Which dump is this from?

Ariel




------------------------------

_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


End of Xmldatadumps-l Digest, Vol 36, Issue 1
*********************************************